Friday, December 16, 2011

Re: Find non-printing characters



On Thu, Dec 15, 2011 at 4:42 PM, Tony Mechelynck <antoine.mechelynck@gmail.com> wrote:
On 15/12/11 22:15, Graham Lawrence wrote:
How can I find non-printing characters in a text?  I do not know which
specific characters I'm looking for, only that two different such
exist.  I have tried /Ctrl+V Ctrl+A thru Z to no avail.  Others that I
found visually appeared in vim as ~V ~W etc, but /~ would not go to any
of them so the tilde must designate tokens for something else.  As the
text was derived from html, I suspect what I'm looking for are those
curly opening and closing double-quotes.

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

For Latin1, the nonprinting characters are 0x00 to 0x1F (Ctrl-@ to Ctrl-_) and 0xFF to 0x9F (Ctrl-? to Ctrl-Alt-_). The following mapping ought to find them (assuming 'magic' and 'nocompatible'):

 :map <F4> /[<Bslash>x00-<Bslash>x1F<Bslash>xFF-<Bslash>x9F]<CR>
 :map <S-F4> ?[<Bslash>x00-<Bslash>x1F<Bslash>xFF-<Bslash>x9F]<CR>

Note: this considers the space (0x20), the no-break space (0xA0) and the soft hyphen (0xAD) as "printing", the tab (0x09), carriage return (0x0D) and form feed (0x0C) as "nonprinting"; it also does not regard the end-of-line character (0x0A under Unix, Ox0D followed by 0x0A under Windows, 0x0D under Mac OS 9 or earlier) as part of the line. If your assumptions are different, a more or less trivial modification of the above mappings should suit you.

For UTF-8 it's harder since there is a limit (257 or 258 I think) to the number of different characters that a collection can match, and OTOH there are non-printing characters all over the Unicode range, especially if you include "noncharacters", "invalid codepoints", unpaired surrogates (or any surrogates, even paired, if found in other than UTF-16 be or le) and "private-use" codepoints.

To find _only_ invalid UTF-8 bytes (in Latin1 text), use 8g8 in Normal mode.

To find the value of the character under the cursor (as a printable character if it is one, and in decimal, octal and hex), use ga

The representation ^A ~B |C (usually in blue) used by Vim for characters declared as not part of 'isprint', means Ctrl-A, Ctrl-Alt-B, Alt-C. See the option's help for details.

see
       :help /[]
       :help /\]
       :help map_backslash
       :help 8g8
       :help ga
       :help 'isprint'
       http://www.unicode.org/charts/
and in particular
       http://www.unicode.org/charts/PDF/U0000.pdf
       http://www.unicode.org/charts/PDF/U0080.pdf

(about the latter two, note that Unicode codepoints U+0000 to U+00FF are the 256 characters of Latin1 in the same order).

Best regards,
Tony.
--
Conscience is a mother-in-law whose visit never ends.
               -- H. L. Mencken

Many thanks, just what I needed.

Graham

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: