On 20/08/09 15:47, Brian Anderson wrote:
>
> I'm interested in learning how to use regular expressions in Vi(m) to
> search for Unicode code points.
>
> In a book about regexp, it describes how to search for Unicode code
> points by various means, and for various programming languages.
There are various kinds of regexps, for various languages (such as perl) 
and programs (such es egrep). Vim has its own. They are each subtly 
different.
The only authoritative documentation about Vim is the help that comes 
with Vim. It is extremely complete. The "chapter" about regular 
expressions (usually named "patterns" in Vim parlance) is accessed by
:help pattern.txt
IMHO the most useful part of that starts at
:help pattern-overview
>
> The book describes searching for a specific Unicode code point as \u2122
> or \x{2122}.
>
>   From what I've seen in the Vim help files, \u is to identify uppercase
> characters, not Unicode code points, and \x is for hexadecimal digits.
Yes and no. In Vim, you can search for, let's say, a Cyrillic lowercase 
soft sign with any of:
	/ь
(using the soft sign directly e.g. if you have a Russian keyboard or use 
a keymap)
/\%u044c
(see ":help /\%u")
/[\u044c]
(see not far below ":help /\]", and usually only valid in 'nocompatible' 
mode)
Or the same with ? instead of / to search backwards, or fь Fь tь or Tь 
(with no carriage return needed), to search only within the current line.
>
> The book also talks about  using Unicode property or categories in the
> search. The book indicates there are 30 Unicode categories, grouped into
> 7 super-categories.
> For example, \p{Ll} would find any lowercase letter that has an
> uppercase variant, and \p{Lo} any letter or ideograph that does not have
> lowercase and uppercase variants.
>
> Unicode blocks are defined as \p{IsGreekExtended}. Blocks consist of a
> single range of code points. Example: searching for any code point
> between U+0000...U+007F can be found with \p{InBasicLatin}.
>
> Unicode script is \p{Greek}. Each Unicode code point is part of only one
> Unicode script. So if I wanted to search for any Greek letter, I'd use
> \p{Greek}.
>
> Unicode grapheme is \X or \P{M}. This would be either single codepoints
> (U+00E0 Latin small letter a with grave accent) or combined codepoints
> (U+0061 Latin small letter a + U+0300 combining grave accent).
>
> Help on any of these, either in examples or where to look in the help
> files, welcome.
None of these apply to Vim. To search for anything between U+0000 and 
U+007F inclusive, you could use
/[\x00-\x7F]
though (with Vim) it might be more prudent to use
/[\x00\x01-\x7F]
since the NUL byte is usually represented by 0x0A to avoid having it 
terminate the C string representing the line. Vim cannot search for 
collections of more than 256 (or is it 257?) consecutive byte values, 
however, so this construct is useless if you want to search for "any CJK 
hanzi / kanji / hanja".
To search for either U+00E0 or U+0061 U+0300, use
	/\%xE0\|\%x61\%u0300
or even
	/à\|a\%u0300
where \| means "or" (so this tells Vim explicitly to search for _either_ 
a precomposed a-grave _or_ a small-a followed by a combining-grave). If 
in the middle of a longer regexp, you may wish to wrap that within \%( 
and \) to avoid having the \| apply to what comes before or after the 
a-grave.
Or you may wish to run a "normalization pass" first, such as
:%s/a\%u0300/à/g
but it might be unwieldy if there are many different such possible 
characters to normalize.
The only pattern atom specifically relevant for combining characters is 
\Z, which tells Vim to ignore any possible combining characters found in 
the text. This might be useful for Arabic or Hebrew text, where short 
vowels (represented by combining chracters) are optional, but it would 
be less useful for Latin, where "combined" a-acute, a-grave, 
a-circumflex, a-umlaut, a-ball, a-ogonek, a-macron and a-breve all would 
be lumped together with plain a, but "precomposed" á à â ä å ą ā ă would 
each be regarded as different (from each other and from plain a).
>
> Thanks.
>
> Brian
Best regards,
Tony.
-- 
You buttered your bread, now lie in it.
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---
 
No comments:
Post a Comment