Saturday, October 21, 2017

How to handle non-ascii characters?

Background: I write  documents in MS Word, but my target format is HTML. After I do a Save as "Web Page (filtered)", I can use global replaces to get rid of most of the cruft that Word generates, but I have a problem with non-ASCII characters: cent sign, circle-r, dash, nbsp, etc.

None of these looks like themselves when I edit the file with vim in a cygwin Terminal window. I can search for [^ -~^t] to find the non-ASCII characters, then go to the original word document to find out what the correct character is. If I had only a few of these, that would be enough. But in a longer document, a given non-ASCII can occur hundreds of times. So once I've found (e.g.) an emdash, I want to replace _all_ occurrences with  "—". But I have no way of representing the character I want to replace on the command line.

I usually bring up the HTML file in Emacs so I can tell it to do a replace all on the character. I know emacs sort-of, but every time I want to do anything more than basic editing I have to look up the commands I want with ^hapropos. Is there a way to do this in vim without getting into emacs.

Note: ^t is what a tab character looks like on the vim command line.

No comments: