Thursday, August 30, 2012

Re: ASCIIfication (removal of accent, cedilla, etc)

Hi Benjamin!

[reformated]

On Mi, 29 Aug 2012, Benjamin R. Haskell wrote:

> On Wed, 29 Aug 2012, Tim Chase wrote:
>
> >I've got some Portuguese text that I need to perform some
> >transformations on to make them ASCII (7-bit). That means
> >removing accent marks, cedillas, tildes, etc.
>
> Just to cover my bases: this seems like a bad idea in general. I
> don't know much about Portuguese, but one of the minimal pairs
> listed in the Wikipedia article for Portuguese phonology¹ is:
>
> pensamos "we think"
> vs.
> pensámos "we thought"
>
>
> >Is there some fast transform in Vim that I've missed, or an easy
> >way to go about this?
>
> In most contexts, Unicode strings are stored in Normal Form C (NFC),
> which means they're equivalent to having passed through Canonical
> Decomposition followed by Canonical Composition. This means that
> any characters that have "combined" codepoints are so combined.
>
> Characters in Unicode strings stored in Normal Form D (NFD) (==
> Canonical Decomposition) have their "combined" codepoints split into
> the base codepoint and "combining character" codepoints.
>
> As a practical example, the string "é" is:
>
> in NFC:
>
> U+00E9 LATIN SMALL LETTER E WITH ACUTE
>
> in NFD:
>
> U+0065 LATIN SMALL LETTER E
> U+0301 COMBINING ACUTE ACCENT
>
> Unicode consortium has full details².

The interesting part is you can add many many combining chars together
to create even new Chars, that don't exist as precombined separate
glyphs. And BTW: for decomposed chars, the 'delcombined' option can be
useful.

One of the major drawbacks is that this will probably cause a lot of
interoperability issues when exchanging data between Unix and Mac OS X,
because on Unix the NFC form is used, while Mac OS X saves data in NFD
form. I already have seen problems like this:

#v+
chrisbra@R500:~/charset$ ls
ä ä
chrisbra@R500:~/charset$ ls |xxd
0000000: 61cc 880a c3a4 0a a......
#v-

So one filename consists of
U+0061 LATIN SMALL LETTER A
U+0308 COMBINING DIAERESIS
while the other filename is stored as
U+00E4 LATIN SMALL LETTER A WITH DIAERESIS

In this case you can convert the filenames using convmv and using the
--nfc or --nfd switch.

I also have seen queries from developers, why sometimes data looks
totally garbled. After investigating, this happened because of NFC/NFD
confusion (or programs not correctly converting those chars).

> The 'icu' project³ (International Components for Unicode) has a
> converter similar to `iconv` called `uconv`, which also lets you
> specify a transliterator to run over the input. So, to get rid of
> accents, cedillas, tildes, etc, you can convert your text into
> Unicode NFD, then convert it to ASCII and discard any characters not
> in ASCII (which includes the combining accent marks).
>
> Assuming the text is encoded in the same encoding as your current
> locale and you're in a Unicode locale, you can pipe it through:
>
> uconv -t ASCII -x nfd -c
>
> -t ASCII = convert to ASCII (t = to/target)
> -x nfd = use the NFD transliterator
> -c = discard any characters that don't have equivalents in the target
>
> If your source data is in a different encoding and/or you're not in
> a Unicode locale (or just a differently-encoded locale), you might
> have to be more explicit, e.g.:
>
> uconv -f SOURCE-ENCODING -t ASCII -x nfd -c
>
> (where SOURCE-ENCODING could be, e.g. ISO-8859-1 or ISO-8859-15 --
> full list from running `uconv -l`)

Thanks Benjamin, that is really useful. I didn't know about uconv and
this looks interesting. Unfortunately, this doesn't work really well.
Consider this test file:
#v+
chrisbra@R500:~/charset$ cat file_utf8_nfc.txt èéêëē
ß
ü

Æ
Office
ế
2⁵
chrisbra@R500:~/charset$ uconv -f utf-8 -t ASCII -x nfd -c
file_utf8_nfc.txt eeeee

u


Oce
e
2
#v-

Slightly better is, to transliterate into NFKD (which allows to
transform single glyphs into similar letters) form, before deleting
non-ascii Chars, so this also doesn't work correctly.

#v+
chrisbra@R500:~/charset$ uconv -f utf-8 -t ASCII -x nfkd -c
file_utf8_nfc.txt eeeee

u


Office
e
25
#v-

As you can see, this doesn't work really well, for some more exotic
chars. Even the German Eszett 'ß', which should be not so unknown, can't
be converted to ss, which should certainly be possible.
In this case, iconv still works better:

#v+
chrisbra@R500:~/charset$ iconv -f utf-8 -t ascii//translit file_utf8_nfc.txt
eeeee
ss
ue
EUR
AE
Office
e
2?
#v-

The //translit means, to convert using approximation if a char cannot be
converted directly.

To come back to Vim, it should be possible, to use Vims iconv() function
together with the //translit string, to strip those diacritics, but
unfortunately, this doesn't seem to work very well (and also doesn't
seem to work on Windows at all, although my Vim has +iconv/dyn and I
have iconv.dll¹ lying around):

:%s#.#\=iconv(submatch(0), 'utf-8', 'ascii//translit')#g
produces:
?????
ss
?
EUR
AE
Office
?
2?

For German readers, I'll have also blogged about this at:
https://blog.256bit.org/archives/768-Das-Problem-mit-UTF-8-Teil2.html
https://blog.256bit.org/archives/724-Das-Problem-mit-UTF-8.html

For reference, I'll save this file below
http://www.256bit.org/~chrisbra/utf8_mail.html
in case google groups mangles the characters and browsers seem to be
better in rendering multibyte characters.

¹) In case you are looking for a iconv.dll for windows, you can download
it from here:
http://sourceforge.net/projects/gettext/files/latest/download
and while you are at it, you should possibly also download intl.dll from
http://sourceforge.net/projects/gettext/files/gettext-win32/0.13.1/gettext-runtime-0.13.1.bin.woe32.zip


regards,
Christian

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: