Thursday, September 11, 2014

Re: sorted file takes much longer to load

On 11 September 2014, John Little <John.B.Little@gmail.com> wrote:
> On Friday, September 12, 2014 2:55:17 AM UTC+12, Ben Fritz wrote:
>
> > I have an idea:
> >
> > If the unsorted file has "bad" characters early in the file, then
> > the early encodings in 'fileencodings' will fail quickly.
> >
> > But if the sorted file places those bad characters late in the file,
> > then the conversion may need to read most of the file before it
> > fails, repeated for possibly multiple encodings.
>
> Yes, something like this is happening. After
> :g/[^ -~]/move 1
>
> The file then loads quickly. If those 13 lines are moved to the end
> of the file the file takes nearly 3 minutes to load.

Here's a simple experiment that shows that this is indeed what's
going on.

In what follows ascii.txt is a 35M file of purely ASCII text:

$ LC_ALL=C pcregrep '[^[:print:]]' ascii.txt

$ ls -hs ascii.txt
35M ascii.txt

Then we add character \x83 at the beginning and at end:

$ perl -e 'print "\x83\n"' | cat - ascii.txt >test1.txt
$ perl -e 'print "\x83\n"' | cat ascii.txt - >test2.txt

Opening the first file is fast, and opening the second one is slow:

$ time LC_CTYPE=en_US.UTF-8 vim -u NONE -i NONE -N -X test1.txt -c q
real 0m0.273s
user 0m0.247s
sys 0m0.025s

$ time LC_CTYPE=en_US.UTF-8 vim -u NONE -i NONE -N -X test2.txt -c q
real 0m1.296s
user 0m1.256s
sys 0m0.042s

But LC_CTYPE to C makes opening both files a lot faster:

$ time LC_CTYPE=C vim -u NONE -i NONE -N -X test1.txt -c q
real 0m0.109s
user 0m0.084s
sys 0m0.024s

$ time LC_CTYPE=C vim -u NONE -i NONE -N -X test2.txt -c q
real 0m0.111s
user 0m0.098s
sys 0m0.013s

The difference is fileencodings:

$ LC_CTYPE=en_US.UTF-8 vim -u NONE -i NONE -N -X -c 'redir >out1 | echo &fencs | q'
$ cat out1
ucs-bom,utf-8,default,latin1

$ LC_CTYPE=C vim -u NONE -i NONE -N -X -c 'redir >out2 | echo &fencs | q'
$ cat out2
ucs-bom

And indeed, setting fileencodings to ucs-bom makes reading test2.txt
fast:

$ time LC_CTYPE=en_US.UTF-8 vim -u NONE -i NONE -N -X -c 'set fencs=ucs-bom | e test2.txt | q'
real 0m0.119s
user 0m0.106s
sys 0m0.013s

> However, using
>
> vim -u NONE ++enc=latin1 file.txt

That's because:

E492: Not an editor command: +enc=latin1

> or
>
> vim -u NONE -c "set fencs=latin1" file.txt

That's because "-c" commands are run after the file was loaded:

$ vim -h | fgrep -w -- -c
-c <command> Execute <command> after loading the first file

> or setting fencs=latin1 in my .vimrc do not avoid the
> slowness. Starting vim with just -u NONE then
>
> :e ++enc=latin1 file.txt
>
> does. I don't understand.

That's because your test file contains character \x83, which is
illegal in latin1. Try ucs-bom instead.

/lcd

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment