Thursday, August 20, 2009

using regexp to search for Unicode code points and properties

I'm interested in learning how to use regular expressions in Vi(m) to
search for Unicode code points.

In a book about regexp, it describes how to search for Unicode code
points by various means, and for various programming languages.

The book describes searching for a specific Unicode code point as \u2122
or \x{2122}.

From what I've seen in the Vim help files, \u is to identify uppercase
characters, not Unicode code points, and \x is for hexadecimal digits.

The book also talks about using Unicode property or categories in the
search. The book indicates there are 30 Unicode categories, grouped into
7 super-categories.
For example, \p{Ll} would find any lowercase letter that has an
uppercase variant, and \p{Lo} any letter or ideograph that does not have
lowercase and uppercase variants.

Unicode blocks are defined as \p{IsGreekExtended}. Blocks consist of a
single range of code points. Example: searching for any code point
between U+0000...U+007F can be found with \p{InBasicLatin}.

Unicode script is \p{Greek}. Each Unicode code point is part of only one
Unicode script. So if I wanted to search for any Greek letter, I'd use
\p{Greek}.

Unicode grapheme is \X or \P{M}. This would be either single codepoints
(U+00E0 Latin small letter a with grave accent) or combined codepoints
(U+0061 Latin small letter a + U+0300 combining grave accent).

Help on any of these, either in examples or where to look in the help
files, welcome.

Thanks.

Brian

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

No comments:

Post a Comment