Thursday, December 15, 2011

Re: Find non-printing characters

On 15/12/11 22:15, Graham Lawrence wrote:
> How can I find non-printing characters in a text? I do not know which
> specific characters I'm looking for, only that two different such
> exist. I have tried /Ctrl+V Ctrl+A thru Z to no avail. Others that I
> found visually appeared in vim as ~V ~W etc, but /~ would not go to any
> of them so the tilde must designate tokens for something else. As the
> text was derived from html, I suspect what I'm looking for are those
> curly opening and closing double-quotes.
>
> --
> You received this message from the "vim_use" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php

For Latin1, the nonprinting characters are 0x00 to 0x1F (Ctrl-@ to
Ctrl-_) and 0xFF to 0x9F (Ctrl-? to Ctrl-Alt-_). The following mapping
ought to find them (assuming 'magic' and 'nocompatible'):

:map <F4> /[<Bslash>x00-<Bslash>x1F<Bslash>xFF-<Bslash>x9F]<CR>
:map <S-F4> ?[<Bslash>x00-<Bslash>x1F<Bslash>xFF-<Bslash>x9F]<CR>

Note: this considers the space (0x20), the no-break space (0xA0) and the
soft hyphen (0xAD) as "printing", the tab (0x09), carriage return (0x0D)
and form feed (0x0C) as "nonprinting"; it also does not regard the
end-of-line character (0x0A under Unix, Ox0D followed by 0x0A under
Windows, 0x0D under Mac OS 9 or earlier) as part of the line. If your
assumptions are different, a more or less trivial modification of the
above mappings should suit you.

For UTF-8 it's harder since there is a limit (257 or 258 I think) to the
number of different characters that a collection can match, and OTOH
there are non-printing characters all over the Unicode range, especially
if you include "noncharacters", "invalid codepoints", unpaired
surrogates (or any surrogates, even paired, if found in other than
UTF-16 be or le) and "private-use" codepoints.

To find _only_ invalid UTF-8 bytes (in Latin1 text), use 8g8 in Normal mode.

To find the value of the character under the cursor (as a printable
character if it is one, and in decimal, octal and hex), use ga

The representation ^A ~B |C (usually in blue) used by Vim for characters
declared as not part of 'isprint', means Ctrl-A, Ctrl-Alt-B, Alt-C. See
the option's help for details.

see
:help /[]
:help /\]
:help map_backslash
:help 8g8
:help ga
:help 'isprint'
http://www.unicode.org/charts/
and in particular
http://www.unicode.org/charts/PDF/U0000.pdf
http://www.unicode.org/charts/PDF/U0080.pdf

(about the latter two, note that Unicode codepoints U+0000 to U+00FF are
the 256 characters of Latin1 in the same order).

Best regards,
Tony.
--
Conscience is a mother-in-law whose visit never ends.
-- H. L. Mencken

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: