Tuesday, February 24, 2015

Re: Word boundry would not work when using some wierd Unicode chars with the 'contained' syntax

On Sunday, February 8, 2015 at 5:16:21 AM UTC+8, Jacky Liu wrote:
> Here is the VimL code I wrote:
>
> " Use some wierd Unicode chars to mark the region, '+' being put here as a contrast.
> syntax region myCmdLine matchgroup=myCmdLine_ start=/[⣱+]/ end=/[⡇⡗⡧+]/
> hi link myCmdLine _LightGreen_233b5a
> hi link myCmdLine_ Normal
>
> syntax keyword myCmdName man bind less containedin=myCmdLine contained
> hi link myCmdName _Green_233b5a
>
> And here's its effect on some simple demonstrating text (see attached image file)
>
> With '+' as the marker all three syntax keywords were correctlly recognized, but not with the abnormal Unicode chars
>
> Another thing is using '*' to do a quick search would work normally, as would do the following search command:
>
> /\<man\|bind\|less\>
>
> 'iskeyword' or 'regexpengine' option seems have no effect here.
>
> Should this be considered a bug?



Update:

I've found a solution. Although a slight modification to Vim source would be involved, it solves the problem without any seeming side effects.

The method is changing the classification of certain characters as one desire, by modifying this file: vim74/src/mbyte.c:

/*
* Get class of a Unicode character.
* 0: white space
* 1: punctuation
* 2 or bigger: some class of word character.
*/
int
utf_class(c)
int c;
{
/* sorted list of non-overlapping intervals */
static struct clinterval
{
unsigned short first;
unsigned short last;
unsigned short class;
} classes[] =
{
{0x037e, 0x037e, 1}, /* Greek question mark */
{0x0387, 0x0387, 1}, /* Greek ano teleia */
{0x055a, 0x055f, 1}, /* Armenian punctuation */
{0x0589, 0x0589, 1}, /* Armenian full stop */
{0x05be, 0x05be, 1},
{0x05c0, 0x05c0, 1},
... ...


the above list in mbyte.c defines character slices within the unicode table and how they are to be classified. change the last value to '1' will make that segment punctuation characters, and after recompile&install, word boundry would apply where they appear.

There's another data structure in the same file which specifies the display width of characters:

/*
* For UTF-8 character "c" return 2 for a double-width character, 1 for others.
* Returns 4 or 6 for an unprintable character.
* Is only correct for characters >= 0x80.
* When p_ambw is "double", return 2 for a character with East Asian Width
* class 'A'(mbiguous).
*/
int
utf_char2cells(c)
int c;
{
/* Sorted list of non-overlapping intervals of East Asian double width
* characters, generated with ../runtime/tools/unicode.vim. */
static struct interval doublewidth[] =
{
{0x1100, 0x115f},
{0x11a3, 0x11a7},
{0x11fa, 0x11ff},
{0x2329, 0x232a},
{0x2e80, 0x2e99},
{0x2e9b, 0x2ef3},
... ...

Characters specified by this list would be drawn as double width, this is when the 'ambiwidth' option was set to "double".

The unicode table is so immense that it's not possible to make one classification of characters that suits everybody, so I think the above would be sometimes inevitable

Thank you ~

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments: