Monday, May 28, 2012

Re: how to match all Chinese chars?

On 28/05/12 08:10, Chris Jones wrote:
> On Sun, May 27, 2012 at 09:25:30PM EDT, William Fugy wrote:
>> On Mon, May 28, 2012 at 10:15 AM, Xell Liu <xell.liu@gmail.com> wrote:
>
> [..]
>
>>>> Unless I missed something, and if you absolutely need to do this,
>>>> you could bypass the limitation by breaking up the range like so:
>
>>>
>>>>> | :g/[一-仿伀-俿倀-儀 ... 鼀-龻]/
>>>>
>>> Good one! i'll give it a try. But so many characters,.....
>
> Depends how much one needs a regex that works for all cases or if
> something more relaxed can do the job at hand. I was also thinking that
> depending on the particular use case it might be possible to have
> a script create the regex and initialize a variable/register and use its
> contents in interactive commands to simulate a [:CJK:] character class
> more conveniently.
>
>>>> This corresponds to ranges:
>>>>
>>>> | \u4e00-\u4eff
>>>> | \u4f00-\u4fff
>>>> | \u5000-\u50ff
>>>> | ..
>>>> | \u9f00-\u9fbb¹
>>>>
>>>> Trouble is, this is going to add up to something like 80+ subranges and
>>>> may cause you to run into other limitations. I haven't tested the whole
>>>> range, only the above (it works here) but if nobody comes up with
>>>
>>>>> a better idea, and you choose go down this path, I would suggest
>>>>> generating the regex programatically..
>>>>
>>>
>> thank you. Apparently it has just to be done like this way. Now I'm
>> dealing with this problem by Perl. Hope Vim could accomplish it.
>
> I don't use Perl but I would have expected it to provide native support
> for Unicode blocks. In this instance '\p{InCJk_Unified_Ideographs},
> which corresponds precisely to U+4E00...U+9FFF.
>
> See this:
>
> http://www.regular-expressions.info/unicode.html
>
>>>>> ¹ I think \u4e00-\u9fbb is the correct CJK range
>>>>
>>> Yes. it's accurate.
>
> Sorry.. in fact, correct was the wrong word.. I really meant something
> like 'effectively assigned'.. \u9fbb-\u9fff do belong to the unicode
> range but afaict no characters have been assigned. Which makes it
> impossible to refer to them by character.. only by code point.
>
> CJ
>

There are additional "rare" CJK characters outside the BMP (in plane 2),
and there are other CJK "wide" characters elsewhere in the BMP (e.g.
fullwidth space, U+3000). For details, see "East Asian Scripts" in the
rightmost column of http://www.unicode.org/charts/ — hovering your mouse
over a link will display the codepoint range in a tooltip.

However, there is also a limitation in Vim, namely, a collection can
only match (IIRC) at most 257 different individual characters at the
same point. 4E00..9FFF alone is already much more than that.


Best regards,
Tony.
--
Genderplex, n.:
The predicament of a person in a restaurant who is unable to
determine his or her designated restroom (e.g., turtles and
tortoises).
-- Rich Hall, "Sniglets"

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: