Wednesday, November 19, 2014

Re: :%s//\=@o/gce ignores c flag in key mapping

On 18.11.14 15:26, porphyry5 wrote:
> I'm not sure how awk organizes arrays internally, but I used a plain
> numeric index as I figured it must use an address array to reference
> the words array, and with a numeric index I could use a binary search
> pattern to locate the word.

That is all unnecessary - location is efficiently done for us, _without_
searching, in an associative array:

http://en.wikipedia.org/wiki/Associative_array

> I think an associative array must use a linear search pattern because
> awk has no way of knowing if the array is actually in sequence.

Erroneous assumption. The wikipedia page has links to "hash table" and
"hash function". A quick glance at them indicates that they should
suffice to explain. An associative array uses a hash to directly vector
to an array element indexed by an arbitrary string. As a consequence,
iteration over the array does not result in an easily anticipated
sequence. Does that matter when the array exists primarily for testing
set membership?

With the associative array, no iteration over the list is required. The
hashing algorithm rapidly generates the address of a "bucket",
containing from one to just a handful (if the hashing algorithm is poor,
and the array elements exceedingly numerous) of strings with the same
hash. I.e. we go straight to the match. It can be seen as a form of
"content addressable memory", perhaps.

That is why I previously wrote:
> > and membership tested with an "if (word in list) ..." in an
> > unconditional action handling the input stream".

If words arrive one per line, then:

{ if ($1 in my_word_list) ... ; else ... }

immediately tests membership, without any search loop - explicit or
implicit. To fill the associative array, this should suffice:

BEGIN {
file = "path/to/my_words"
while ((x = getline < file) > 0) # It does need the extra braces.
{ my_word_list[$1]++ } # If entry > 1, duplicate.
if (x < 0)
{ print "\n\t my_scriptname: " file ": File not found.\n" ; exit }
}

AND, if an alphabetical sort of the word list is ever required for some
reason, then either presort "path/to/my_words", or pipe it to sort,
either in the shell or within awk, when needed. It is just not
economical to iterate (even in a binary search) over 200,000 words, for
_every_ input word, just to have a sorted word list, maybe once or twice
a year, if ever.

> And of course, I have a second array of word suffixes to reference if
> the word of interest is not the root form.

Now I'm curious - how are the suffixes matched to those of the 200,000
words which may take them? I have a Danish word list which simply
includes each word variant explicitly. If an explicit list takes you to
a million words, then you just need a good hashing algorithm and a
million buckets, _if_ a short search through two or three words is to be
avoided. But a short search in a small bucket would have to be faster
than figuring out which suffixes go with what. Such matching would need
each list word to be identified as noun, adjective, or verb, at a
minimum, to permit correct attachment of adverb suffixes, wouldn't it?

You seem to have a quite interesting task to deal with.

Erik

--
mfox: You can't have infinite growth in a finite world, over population will doom
us all in the end because capitalism depends on infinite growth. We are just
rearranging deck chairs on the titanic.
Maxx: Totally agree mfox, I think someone put Norman Lindsay's "The Magic
Pudding" in the non-fiction section.
- http://www.abc.net.au/news/2014-10-16/kohler-when-a-central-banker-talks-like-this-pay-attention/5815392

--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments: