Wednesday, August 29, 2012

Re: ASCIIfication (removal of accent, cedilla, etc)

The tl;dr version, pipe it through:

uconv -t ASCII -x nfd -c


On Wed, 29 Aug 2012, Tim Chase wrote:

> I've got some Portuguese text that I need to perform some
> transformations on to make them ASCII (7-bit). That means removing
> accent marks, cedillas, tildes, etc.

Just to cover my bases: this seems like a bad idea in general. I don't
know much about Portuguese, but one of the minimal pairs listed in the
Wikipedia article for Portuguese phonology¹ is:

pensamos "we think"
vs.
pensámos "we thought"


> Is there some fast transform in Vim that I've missed, or an easy way
> to go about this?

In most contexts, Unicode strings are stored in Normal Form C (NFC),
which means they're equivalent to having passed through Canonical
Decomposition followed by Canonical Composition. This means that any
characters that have "combined" codepoints are so combined.

Characters in Unicode strings stored in Normal Form D (NFD) (==
Canonical Decomposition) have their "combined" codepoints split into the
base codepoint and "combining character" codepoints.

As a practical example, the string "é" is:

in NFC:

U+00E9 LATIN SMALL LETTER E WITH ACUTE

in NFD:

U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT

Unicode consortium has full details².

The 'icu' project³ (International Components for Unicode) has a
converter similar to `iconv` called `uconv`, which also lets you specify
a transliterator to run over the input. So, to get rid of accents,
cedillas, tildes, etc, you can convert your text into Unicode NFD, then
convert it to ASCII and discard any characters not in ASCII (which
includes the combining accent marks).

Assuming the text is encoded in the same encoding as your current
locale and you're in a Unicode locale, you can pipe it through:

uconv -t ASCII -x nfd -c

-t ASCII = convert to ASCII (t = to/target)
-x nfd = use the NFD transliterator
-c = discard any characters that don't have equivalents in the target

If your source data is in a different encoding and/or you're not in a
Unicode locale (or just a differently-encoded locale), you might have to
be more explicit, e.g.:

uconv -f SOURCE-ENCODING -t ASCII -x nfd -c

(where SOURCE-ENCODING could be, e.g. ISO-8859-1 or ISO-8859-15 -- full
list from running `uconv -l`)

--
Best,
Ben

¹: http://en.wikipedia.org/wiki/Portuguese_phonology
²: http://unicode.org/reports/TR15/
³: http://icu-project.org

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments:

Post a Comment