Saturday, November 2, 2024

macos – Highlight “shady” characters in scripts copied in Terminal from PDFs

FOREWORD:

The question above was deleted by the OP while I was working on the following answer. Not being keen on wasted effort, I managed to copy the OP’s original question, and pasted it into the “new question” above. Yes… this is a bit odd 🙂


I think what you may be looking for is a CLI utility called iconv. Inconveniently, iconv requires “from” and “to” argument declarations (ref man iconv) of the encoding type (e.g. UTF-8, ascii, unicode, etc)… and AFAIK, “shady” is not a recognized encoding type 🙂 However – the encoding type may be determined from another CLI utility called file. Still more inconveniently, both iconv and file specify that the input be contained in a file :/

Your question intrigued me as it seems a reasonable thing to do; i.e. C&P from PDF to CLI. So I spent a few minutes wrangling with iconv and file to get the following answer; an answer which does not require you to C&P your PDF strings into a file. <caveat>This works on my Ventura Mac under zsh, but it’s been tested nowhere else.</caveat>

You’ve not provided an example, and I was unable to find any malfunctioning PDF code strings in a brief search. So – instead, I found this string in a French-language PDF on Python programming:

print(“Numéro de boucle”, i)

So – first we’ll need to run this string through file to determine the encoding (note the use of the “dash” -: a reference to stdin in lieu of a proper filename):

echo "print("Numéro de boucle", i)" | file -
/dev/stdin: Unicode text, UTF-8 text

So – the string was encoded in UTF-8. Now let’s convert the string to ASCII from UTF-8 using iconv:

NOTE: The //translit option is not addressed in the macOS version of man iconv, but it still works (?!). It is used as a flag to tell iconv to transliterate the output to the command line. Another option is to ignore the non-ascii character(s): //ignore

echo "print("Numéro de boucle", i)" | iconv -f utf-8 -t ascii//translit
print(Num'ero de boucle, i)

And so you may be wondering, “Why did it add the extra ' character”??. That’s a good question, and I think the answer has already been supplied here. Apple may be using utf-8-mac instead of utf-8. Which I guess would be OK if they had bothered to reflect that in their implementation of iconv! In fact, there is a UTF8-MAC encoding listed in the output of iconv --list – but it doesn’t improve the transliteration!

As written, the iconv utility cannot properly convert all utf-8-mac characters to ASCII. It converts the ones it can, and issues an error for the others. To get a “best effort” from iconv you can add the -c option, causing iconv to simply drop the characters it cannot convert. If you have a reasonably current Linux box handy, you can verify that iconv does a correct and proper ‘transliteration’ (//TRANSLIT) of the example used in this answer; i.e. no extra ' character.

And so, iconv seems to work at least some of the time in macOS… hope this helps.

Related Articles

Latest Articles