Monday, January 25, 2016

Re: editing a _corrupted_ CP1252 file


On 23Jan2016, at 22:44, vim_use@googlegroups.com wrote:

On 22.01.16 17:45, Kenneth Reid Beesley wrote:
> contain byte values that are undefined for CP1252, e.g. \x81, \x8D, \x8F, \x90 and \x9d.
> I.e. these are potentially corrupted files that are mostly legal CP1252, should be legal
> CP1252, and I have to make them legal CP1252.

Eric Christiansen replied
 
Have you considered using e.g. tr to translate everything in one go?
E.g.
 
$ tr '\201\215\217\220\235' 'ABCDE' < filename
 
In that line, \201 is octal for \x81, etc. The replacement characters
could also be specified in octal, if they're sufficiently weird. It
won't handle unicode, but that's not required here.
 
The job could also be done by sed or awk. Doing it by hand seems rather
laborious.

Thanks for the message, but tr is not a very attractive solution in my case.
I know that the files are _supposed_ to be CP1252.
But beforehand I don't if or how they are corrupted.  Usually the problem in a corrupted file is the presence of \x81, \x8D, \x8F, \x90 and/or \x9D bytes,
which are illegal/undefined bytes in CP1252.
The files are programs, so I need to zero in on each invalid byte (invalid for CP1252), figure what's going on, and edit it appropriately.
So it needs to be done by hand.  (There are not a lot of such bad characters.)

Again, the problem is that if I (try to) edit a corrupted file as CP1252 with :e ++enc=cp1252, the bad bytes get silently replaced in the buffer with question marks, which hides the problem rather than helping me find the bad bytes.
If I use 'tr' to replace the illegal bytes with some kind of valid bytes, then the problems are just hidden some other way.
If I try to edit a file as CP1252, using :e ++enc=cp1252, and the file contains invalid bytes, then I need alarm bells to go off somehow.

Looking at my .gvimrc file, I have the line

set fileencodings=ucs-bom,utf-8,iso-8859-1

I note that if I simply edit such a corrupted file without specifying :e ++enc=cp1252, then apparently gvim goes through the list of fileencodings, failing with ucs-bom, failing with uff-8, and then defaulting to try to edit the file as iso-8859-1.  The resulting edit buffer _retains_ any bad bytes, displaying them as <81>, <8d>, <8f>, <90> and <9D>, which is helpful.  

Perhaps the best I can do right now is to specify 

set fileencodings=ucs-bom,utf-8,cp1252,iso-8859-1
I'll try that for now.

Thanks again,

Ken


********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA





No comments: