Thursday, January 28, 2016

Re: editing a _corrupted_ CP1252 file


Erik Christiansen <dvalin@internode.on.net>: Jan 27 08:03PM +1100

On 26.01.16 13:11, Kenneth Reid Beesley wrote:
> setglobal fileencoding=utf-8
 
> " when editing an existing file, try to read it in these encodings, use the first that succeeds
> set fileencodings=ucs-bom,utf-8,latin1
 
It doesn't seem all that complicated. A quick test on a "simulated
corrupted CP1252" file like yours both displayed <81>, and 8g8 worked
here without any fiddling at all. I have:
 
set encoding=utf-8 " Is default anyway, IIRC.
 
fileencodings=ucs-bom,utf-8,default,latin1
 
which seems to have latched on: fileencoding=latin1
given the input file.
 
That seems to confirm what ":h 8g8" says:
 
This works in two situations:
1. when 'encoding' is any 8-bit encoding
2. when 'encoding' is "utf-8" and 'fileencoding' is any 8-bit encoding
 
And I don't even have any "++bad=keep" anywhere.
 
Erik




Hi Erik,

Thanks again.  Here are the key points:

I know (omnisciently) that certain files should be cp1252, but that some of them are corrupted with undefined-in-cp1252 bytes like \x81.

I want to edit the file as CP1252, i.e., I want the fileencoding to be cp1252 because that's what the file is _supposed_ to be.
I want to see/find any bad bytes so that I can manually correct them (but I'm not perfect—I might miss some bad bytes).

AND CRUCIALLY When I try to write the buffer back to file, I want the fileencoding to be set to cp1252 so that gvim will refuse to write the file if I've missed any bad bytes like \x81.

gvim -c "e ++enc=cp1252  ++bad=keep"   filethatshouldbecp1252.txt

gives me exactly what I want.  The fileencoding is forced to be cp1252.  I see bad-for-cp1252 bytes like \x81 kept and displayed in blue as <81>, and I can find
them with 8g8.   AND If I neglected to fix a bad byte like \x81, and I try to write the buffer back to file, I get an error message saying that conversion failed.
That effectively forces me to fix the file, make it legal cp1252, before the buffer is written back to file.

If the fileencoding defaults to latin1, gvim will happily write the bad-for-cp1252 bytes back to the original file, which remains corrupted.

Best,

Ken

No comments: