Tuesday, January 26, 2016

Re: editing a _corrupted_ CP1252 file


Kenneth Reid Beesley <krbeesley@gmail.com>: Jan 25 11:18AM -0700


I know that some of my files are _supposed_ to be CP1252.
But beforehand I don't if or how they are corrupted. Usually the problem in a corrupted file is the presence of \x81, \x8D, \x8F, \x90 and/or \x9D bytes, which are illegal/undefined bytes in CP1252.
The files are programs, so I need to zero in on each invalid byte (invalid for CP1252), figure what's going on, and edit it appropriately.
So it needs to be done by hand. (There are not a lot of such bad characters.)
 
Again, the problem is that if I (try to) edit a corrupted file as CP1252 with :e ++enc=cp1252, the bad bytes get silently replaced in the buffer with question marks, which hides the problem rather than helping me find the bad bytes.
If I use 'tr' to replace the illegal bytes with some kind of valid bytes, then the problems are just hidden some other way.
If I try to edit a file as CP1252, using :e ++enc=cp1252, and the file contains invalid bytes, then I need alarm bells to go off somehow.
 

 


Erik Christiansen <dvalin@internode.on.net>: Jan 26 02:58PM +1100

On 25.01.16 11:18, Kenneth Reid Beesley wrote:
> (invalid for CP1252), figure what's going on, and edit it
> appropriately. So it needs to be done by hand. (There are not a lot
> of such bad characters.)
 
Ah, not simply remapping, then. For UTF-8, Vim has the "8g8" command, to
hop to the next encoding violation. Unfortunately, there's no mention
there of any ability to do that for CP1252.
 
What happens if you have fenc=utf-8, open the cp1252 file, and press 8g8 ?
 
Erik


Thanks again, Erik,

Here's my usual .gvimrc setup for encodings:

"  encoding used internally in the edit buffer
set encoding=utf-8

"  default encoding for saving any new files created
setglobal fileencoding=utf-8

" when editing an existing file, try to read it in these encodings, use the first that succeeds
set fileencodings=ucs-bom,utf-8,latin1

***************

Here's a little test.  First I create a little file that has bytes for 'a', 'b', 'c' and \x81 (which is undefined in
both ISO-8859-1 (Latin 1) and in CP1252 (also known as Windows-1252), 

$ printf "\x61\x62\x63\x81" > corrupted.txt

I confirm that the bytes are as expected using

$ od -t x1 corrupted.txt

If I try to iconv the file from cp1252 to UTF-8

$ iconv -f cp1252 -t UTF-8 corrupted.txt > utf8.txt

iconv chokes appropriately on the \x81 byte, outputting the error message

"iconv: corrupted.txt:1:2: cannot convert"

If I simply try to gvim the file, with the fileencodings as shown above, 

$ gvim corrupted.txt

the fileencoding gets set to latin1 (the last option in fileencodings), and the offending \x81 byte gets displayed as <81> (in blue).
The blue <81> represents a single byte, and the command 8g8 (which you suggested) moves the cursor to that byte.  That's not bad.  
At least I can find the offending bytes in a corrupted file.

********* Tests: FIddling with fileencodings

\x81 is undefined in both ISO-8859-1 and in CP1252
\x80 is  _defined_ in CP1252 but not assigned in ISO-8859-1 (but see below; \x80 seems officially to be a legal but non-graphic "C1" control character in latin1)

I create another little test file:

$ printf "\x80\x61\x62\x63\x81" > corrupted2.txt

$ iconv -f latin1 -t utf-8 corrupted2.txt

This, unfortunately, works.  Even though \x80 and \x81 are non-graphic bytes in Latin1, they are somehow considered valid, though little used, "C1" control bytes.
I've never quite understood this.  It makes it very hard to distinguish between ISO-8859-1 and CP1252.

$ iconv -f cp1252 -t utf-8 corrupted2.txt

chokes (appropriately) on the \x81 byte, which is not defined in cp1252.

$ iconv -f latin1 -t utf-8 corrupted2.txt

works without complaint.  Sigh.

If I change fileencodings to

set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252

and

$ gvim corrupted2.txt

the fileencoding gets set to latin1, but with <80> and <81> displayed in blue (representing single bytes) in the edit buffer.  The blue <HH> notation
seems to indicate that the byte is non-graphic.  There is no font glyph assigned to it.  I can use 8g8 to find the bad bytes displayed in blue as <HH>.

If I change fileencodings to

set fileencodings=ucs-bom,utf-8,cp1252,iso-8859-1

and

$ gvim corrupted2.txt

then the fileencoding is again detected (or defaulted to) latin1, with the <80> and <81> displayed in blue.

******* testing with a valid cp1252 file

$ printf "\x80\x61\x62\x63" > cp1252.txt

$ gvim cp1252.txt

brings up the file without any blue <HH> bytes.  The file is detected with fileencoding cp1252.

If I then change fileencodings to

set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252

and open the cp1252.txt file

$ gvim cp1252.txt

the encoding is detected as latin1 (because iso-8859-1 gets tried before cp1252, and it succeeds because
the value \x80 is a legal "C1" but non-graphic control byte in Latin1).

********** 

Conclusions:

It's rather hard to test if a file is Latin1 vs CP1252 because Latin1 does allow non-graphic "C1" control bytes
in the range \x80 through \x9F.

It seems worthless to set

set fileencodings=ucs-bom,utf-8,iso-8859-1,cp1252

because any valid cp1252 file (or even a file intended to be cp1252 but containing illegal, for cp1252, bytes like \x81)
will succeed as iso-8859-1, and the fileencoding will be assigned as iso-8859-1 (latin1).

Setting

set fileencodings=ucs-bom,utf-8,cp1252,iso-8859-1

will cause 'gvim' to edit a file with fileencoding cp1252 if the file contains bytes in the \x80 to \x9F range that are
legal for cp1252.

What's dangerous (for me) is invoking

$ gvim -c "e ++enc=cp1252" filethatshouldbecp1252.txt

on a file that should be cp1252 but might contain illegal bytes \x81, \x8D, \x8F, \x90 and \x9D, because
if such undefined bytes do appear in the file, they get silently converted to question-mark characters.
Ideally, alarm bells would go off.  This should fail like iconv does when told to convert a file as cp1252
when it isn't valid cp1252.  At least the illegal/undefined bytes should be displayed in blue as <81>
or whatever.  What's dangerous for me is the silent conversion of invalid characters like \x81 to question marks.

Thanks again,

Ken


********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA





No comments: