On Wednesday, November 19, 2014 12:10:08 AM UTC-8, Erik Christiansen wrote:
> On 18.11.14 15:26, porphyry5 wrote:
> > I'm not sure how awk organizes arrays internally, but I used a plain
> > numeric index as I figured it must use an address array to reference
> > the words array, and with a numeric index I could use a binary search
> > pattern to locate the word.
> 
> That is all unnecessary - location is efficiently done for us, _without_
> searching, in an associative array:
> 
> http://en.wikipedia.org/wiki/Associative_array
> 
> > I think an associative array must use a linear search pattern because
> > awk has no way of knowing if the array is actually in sequence.
> 
> Erroneous assumption. The wikipedia page has links to "hash table" and
> "hash function". A quick glance at them indicates that they should
> suffice to explain. An associative array uses a hash to directly vector
> to an array element indexed by an arbitrary string. As a consequence,
> iteration over the array does not result in an easily anticipated
> sequence. Does that matter when the array exists primarily for testing
> set membership?
> 
> With the associative array, no iteration over the list is required. The
> hashing algorithm rapidly generates the address of a "bucket",
> containing from one to just a handful (if the hashing algorithm is poor,
> and the array elements exceedingly numerous) of strings with the same
> hash. I.e. we go straight to the match. It can be seen as a form of
> "content addressable memory", perhaps.
> 
> That is why I previously wrote:
> > > and membership tested with an "if (word in list) ..." in an
> > > unconditional action handling the input stream".
> 
> If words arrive one per line, then:
> 
>    { if ($1 in my_word_list) ... ; else ... }
> 
> immediately tests membership, without any search loop - explicit or
> implicit. To fill the associative array, this should suffice:
> 
> BEGIN {
>    file = "path/to/my_words"
>    while ((x = getline < file) > 0)     # It does need the extra braces.
>    {  my_word_list[$1]++ }              # If entry > 1, duplicate.
>    if (x < 0)
>    {  print "\n\t my_scriptname: " file ": File not found.\n" ; exit }
> }
> 
> AND, if an alphabetical sort of the word list is ever required for some
> reason, then either presort "path/to/my_words", or pipe it to sort,
> either in the shell or within awk, when needed. It is just not
> economical to iterate (even in a binary search) over 200,000 words, for
> _every_ input word, just to have a sorted word list, maybe once or twice
> a year, if ever.
> 
> > And of course, I have a second array of word suffixes to reference if
> > the word of interest is not the root form.
> 
> Now I'm curious - how are the suffixes matched to those of the 200,000
> words which may take them? I have a Danish word list which simply
> includes each word variant explicitly. If an explicit list takes you to
> a million words, then you just need a good hashing algorithm and a
> million buckets, _if_ a short search through two or three words is to be
> avoided. But a short search in a small bucket would have to be faster
> than figuring out which suffixes go with what. Such matching would need
> each list word to be identified as noun, adjective, or verb, at a
> minimum, to permit correct attachment of adverb suffixes, wouldn't it?
> 
> You seem to have a quite interesting task to deal with.
> 
> Erik
> 
> -- 
> mfox: You can't have infinite growth in a finite world, over population will doom
> us all in the end because capitalism depends on infinite growth. We are just
> rearranging deck chairs on the titanic.
> Maxx: Totally agree mfox, I think someone put Norman Lindsay's "The Magic
> Pudding" in the non-fiction section.
> - http://www.abc.net.au/news/2014-10-16/kohler-when-a-central-banker-talks-like-this-pay-attention/5815392
So if I read this hash table stuff correctly any data item that one wants to look up in an associative array generates its own address in memory, by using (a constant number of bytes of?) its value as a single binary number and modifying that with some arithmetic algorithm to produce an address in the available range.  That is a really sweet concept, but it must be rather wasteful of memory, lots of unused spaces within its range.  I use ultra-cheap, under-powered laptops (currently acer c720, new laptop for < $200) which has only 2gb memory, but for 200,000 odd words that's 10K available per word, lots of available waste space.
I really appreciate you hammering home this point about associative arrays, I always assumed, their being so convenient to use, that they had to be a very performance-costly luxury.  Who knew.
But I will forthwith rewrite the awk program with associative arrays and compare the timing for the two methods.  Thank you very much for the trouble you have gone to on this.
-- 
-- 
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
--- 
You received this message because you are subscribed to the Google Groups "vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vim_use+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Wednesday, November 19, 2014
Subscribe to:
Post Comments (Atom)
 
No comments:
Post a Comment