User talk:Tonyjv

From Rosetta Code

Can we talk about your Python edits please?

Hi, I have spent time, and looked at most of your edits and disagreed with them. Is it possible that we could talk about our respective views on your edits, and maybe invite others for their views too so we can reach a consensus?

The site is a collaborative wiki, so it would be great if several people gave their input on this, so please speak up!
Thanks --Paddy3118 21:23, 30 June 2010 (UTC)

Have been busy with DaniWeb (member of month in April) and trying to do some work so Have not checked these pages for ages. Now I am looking into getting involved also with Go language, so I came back to these pages and saw your comment long time ago.

Sort instead of max in Anagrams

Hi Tony, If you only use the max value in your Anagrams#Python entry, why change from using max to using sort? Thanks, --Paddy3118 11:26, 19 September 2011 (UTC)

Because it pays of when looping to give all that length solutions, see the break in the loop and compare it to other solution.--Tonyjv 16:10, 20 September 2011 (UTC)

How is sorting everything for just the largest value 'better' than just explicitely finding the largest? Isn't the latter more clear? --Paddy3118 18:14, 20 September 2011 (UTC)

Because it is not a maximum but all maximums, same way as groupby need sorted input to do the same. We want all cases when the maximum is reached and it is done more economically (around 10% faster in this case in my computer with in place sort) by sorting the anagrams by number of anagrams in first place. Then we can do printing out result very economically without full pass of groups (including taking each length again) --Tonyjv 20:43, 20 September 2011 (UTC)

Other style version I used for comparision goes like this (Python3): <lang python> import urllib.request, itertools import time words = urllib.request.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split() print('Words ready') t0 = time.clock() anagrams = [list(g) for k,g in itertools.groupby(sorted(words, key=sorted), key=sorted)] count = max(len(ana) for ana in anagrams) for ana in anagrams:

   if len(ana) == count:
       print(ana)

t0 -= time.clock() print('Finished in %f s' % -t0) </lang>

I get it now. Rather than getting the max then plucking those with that length from a list, you calculate the length sort on it and store it, then select from the maximum end of the sorted list until the length is no longer equal to the length of the first. Neat. Thanks for taking the time to explain. --Paddy3118 21:07, 20 September 2011 (UTC)
Sorting takes O(n log n) time; whereas finding the maximum and filtering a list takes linear time. It is true that sorting is being done regardless (because of the groupby(sorted(...))), so both your previous example and the current example take O(n log n) time overall. But since the grouping part and the finding maximum length part are independent, I don't see how choosing a slower algorithm for the latter part will improve the algorithm overall. --Spoon! 01:43, 21 September 2011 (UTC)
Isn't this a moot point, anyway? Most of the time is spent sorting letters in words and grouping anagrams, optimizing after that is barking up the wrong tree. If you really want to make it faster, avoid stuff like modules where you don't have control over. The following runs about twice as fast as the example above on my system dictionary, probably because it doesn't call sorted on each word twice:<lang python>d = {}

for w in open('dict').read().split(): x = "".join(sorted(w)) if x in d: d[x].append(w) else: d[x] = [w]

lm = max([len(d[x]) for x in d]) for x in d: if (len(d[x]) == lm): print lm, d[x]</lang>

P.S. I try and not make the assumption that better==faster. There are other considerations to take into account. Having said that, your faster example above has a certain elegance... --Paddy3118 06:03, 21 September 2011 (UTC)

That will not run in Python 3 as bytes can not be joined, here is my fastest version not using groupby, did not compare with the non-groupby version in the page under discussion (this runs both in Python2 and Python3): <lang python> from collections import defaultdict import time

try:

   import urllib.request
   words = urllib.request.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split()

except ImportError:

   import urllib
   words = urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split()

print('Words ready')

d = defaultdict(list) t0 = time.clock() for w in words:

   d["".join(sorted(str(w)))].append(w)

d=sorted(d.values(), key=len, reverse=True) lm = len(d[0]) for value in d:

   if (len(value) == lm):
       print(value)
   else:
       break

t0 -= time.clock() print('Finished in %f s\n' % -t0)

</lang>

Also you can collect the result ready, about same running time though:

<lang python> print('Collect result') d = defaultdict(list) t0 = time.clock() lm = 0

for w in words:

   key = "".join(sorted(str(w)))
   d[key].append(w)
   lk = len(d[key])
   if lk < lm:
       continue
   
   if lk > lm:
       lm = lk
       result = [d[key]]
       #print('New length: %i (%s)' % (lm, result))
   else:
       result.append(d[key])
       

t0 -= time.clock() for value in result:

   print(value)

print('Finished in %f s (except print)\n' % -t0)

</lang>

A few suggestions:
  1. If you are going to benchmark a lot, pull the dict.txt file to your local harddrive, don't repeatedly urlopen it from remote host.
  2. My example code runs under python 3.1 just fine, except for the print syntax change.
  3. Like I said, majority of time is spent sorting letters; once all anagrams are in place, sorting the list or not makes no big difference (your sorted version above is only marginally slower).
--Ledrug 07:21, 21 September 2011 (UTC)

From urllib.request.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split() it produces bytes, which can not be joined before str, it gives error: TypeError: must be str, not bytes. For my anagrams program, I actually do building anagram synonym dictionary only once, if it is not saved on disc already. I do need to put little sorting effort to make the list length descending order and filtering special character (we'd -> dew for example), allowing the dictionary to be any collection of words one per line, as that works better for multiword anagrams. I only split the words possible to current source words, otherwise I keep the list as string of words. In my humble Sempron PC, my own names anagrams from Finnish dictionary (82229 words, 79248 sorted letter combinations, your program took 0.8 s to find interestingly also 5 word anagram, 5 different ones), takes around 600 ms (python 2.6 and psyco for best results):

52 words loaded in 68 ms. 1564 anagrams of tonyveijalainen found! Processing took 567 ms.

You are describing a completely different problem. Sorting may help with multi-word anagrams, and should help with data reuse, but for the specific task it does no benefit.
Incidentally, what do you mean by "5 word anagram"? Here's a 12 from my Finnish dictionary: (12, ['ankarasti', 'ankarista', 'arkistaan', 'karitsana', 'karsintaa', 'karsitaan', 'kitaransa', 'narikasta', 'rakastani', 'rankaista', 'raskainta', 'sarkainta']). If that doesn't count somehow, at least there is: (6, ['painottumassa', 'poistumastaan', 'punoittamassa', 'putoamistansa', 'upottamassani', 'upottamissaan']) --Ledrug 20:21, 21 September 2011 (UTC)

I have Finnish dictionary with words in basic forms only, in Linux side I did some processing in Ubuntu installed word file (had to remove \\$ at end and filter repeated words) I got also two 12s: ['rankaista', 'ankarista', 'karitsana', 'karsintaa', 'raskainta', 'kitaransa', 'karsitaan', 'rakastani', 'sarkainta', 'arkistaan', 'narikasta', 'ankarasti'] ['nostaa', 'otsaan', 'tasona', 'saaton', 'sotaan', 'sotana', 'sontaa', 'sanota', 'otsana', 'satona'] with my collect version, but it took 9.36 s without psyco (interestingly 9.64 s with psyco) python 2.6.5