Friday, January 22, 2016

problem: editing a _corrupted_ CP1252 file

I have a number of 8-bit text files that _should_ be in CP1252, but they may
contain byte values that are undefined for CP1252, e.g. \x81, \x8D, \x8F, \x90 and \x9d.
I.e. these are potentially corrupted files that are mostly legal CP1252, should be legal
CP1252, and I have to make them legal CP1252.

The Problem: if I edit them as CP1252, the illegal bytes get converted into question-mark characters in the buffer.

Background

My buffer 'encoding' is always UTF-8. (I have to edit files in a number of different encodings, and
this usually works well.)

I have a little alias gvim1252 set to

gvim -c "e ++enc=cp1252"

so that invoking

$ gvim1252 filename.txt

loads filename.txt (let's assume that it _should_ be CP1252) and effectively invokes the command

:e ++enc=cp1252

telling gvim that the 'fileencoding' is (or at least should be) cp1252.

Inside the edit buffer (where the 'encoding' is UTF-8), any illegal byte values from the original input file
(such as \x81 and the four others listed above) that cannot be converted from CP1252 to UTF-8
(because they are simply undefined in CP1252) are simply and silently replaced with plain question-mark characters.

Even worse, if I then just write the buffer back out to file, the question marks in the buffer are
written to file as question marks. I lose the information about the original bad bytes, and in my case,
that's dangerous behavior. I need to easily find, evaluate, and fix such illegal characters during my editing.

Desired Behavior

1. When I edit a file that should be CP1252 (but might be corrupted with byte values
like \x81), and when I specify ++enc=cp1252, I'd like the bad byte values to be retained in the buffer,
perhaps shown as highlighted

<81>

or something else that stands out more than a plain question-mark character. These files can also
contain original question-marks that are supposed to be question marks.

2. If I write the buffer back to file, I'd like any illegal bytes like <81> that I haven't found/fixed
to be written back to file as they were originally. (I understand that this might be problematic.)

3. And, when I invoke ++enc=cp1252 on a corrupted file, perhaps I'd like some kind of error message telling me
that the file was not in the indicated cp1252 encoding. Even refusing to accept the ++enc command, for a corrupted
file, would be better than the current silent replacement of illegal bytes with question marks.

**** Any help getting the desired behavior would be much appreciated.

Ken


********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA





--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment