The tl;dr version, pipe it through:
uconv -t ASCII -x nfd -c
On Wed, 29 Aug 2012, Tim Chase wrote:
> I've got some Portuguese text that I need to perform some 
> transformations on to make them ASCII (7-bit).  That means removing 
> accent marks, cedillas, tildes, etc.
Just to cover my bases: this seems like a bad idea in general.  I don't 
know much about Portuguese, but one of the minimal pairs listed in the 
Wikipedia article for Portuguese phonology¹ is:
pensamos "we think"
vs.
pensámos "we thought"
> Is there some fast transform in Vim that I've missed, or an easy way 
> to go about this?
In most contexts, Unicode strings are stored in Normal Form C (NFC), 
which means they're equivalent to having passed through Canonical 
Decomposition followed by Canonical Composition.  This means that any 
characters that have "combined" codepoints are so combined.
Characters in Unicode strings stored in Normal Form D (NFD) (== 
Canonical Decomposition) have their "combined" codepoints split into the 
base codepoint and "combining character" codepoints.
As a practical example, the string "é" is:
in NFC:
U+00E9 LATIN SMALL LETTER E WITH ACUTE
in NFD:
U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT
Unicode consortium has full details².
The 'icu' project³ (International Components for Unicode) has a 
converter similar to `iconv` called `uconv`, which also lets you specify 
a transliterator to run over the input.  So, to get rid of accents, 
cedillas, tildes, etc, you can convert your text into Unicode NFD, then 
convert it to ASCII and discard any characters not in ASCII (which 
includes the combining accent marks).
Assuming the text is encoded in the same encoding as your current 
locale and you're in a Unicode locale, you can pipe it through:
uconv -t ASCII -x nfd -c
-t ASCII = convert to ASCII (t = to/target)
-x nfd = use the NFD transliterator
-c = discard any characters that don't have equivalents in the target
If your source data is in a different encoding and/or you're not in a 
Unicode locale (or just a differently-encoded locale), you might have to 
be more explicit, e.g.:
uconv -f SOURCE-ENCODING -t ASCII -x nfd -c
(where SOURCE-ENCODING could be, e.g. ISO-8859-1 or ISO-8859-15 -- full 
list from running `uconv -l`)
-- 
Best,
Ben
¹: http://en.wikipedia.org/wiki/Portuguese_phonology
²: http://unicode.org/reports/TR15/
³: http://icu-project.org
-- 
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
No comments:
Post a Comment