User talk:Tonyjv: Difference between revisions

← Older edit

User talk:Tonyjv (view source)

Revision as of 08:57, 23 September 2011

5,031 bytes added , 12 years ago

→‎Sort instead of max in Anagrams

Anonymous user

rosettacode>Tonyjv

Revision as of 01:44, 21 September 2011 (view source) rosettacode>Spoon! m (→‎Sort instead of max in Anagrams) ← Older edit		Latest revision as of 08:57, 23 September 2011 (view source) rosettacode>Tonyjv (→‎Sort instead of max in Anagrams)
(18 intermediate revisions by 4 users not shown)
Line 34: :I get it now. Rather than getting the max then plucking those with that length from a list, you calculate the length sort on it and store it, then select from the maximum end of the sorted list until the length is no longer equal to the length of the first. Neat. Thanks for taking the time to explain. --[[User:Paddy3118\|Paddy3118]] 21:07, 20 September 2011 (UTC) :Sorting ~~is slow because it~~ takes O(n log n) time; whereas finding the maximum and filtering a list takes linear time. ~~You~~It ~~are~~is ~~correct~~true that ~~your~~sorting ~~previous~~is ~~solution~~being ~~also~~done ~~used~~regardless ~~sorting,~~(because byof the <code>groupby(sorted(...))</code>), ~~and~~so ~~thus~~both ~~also~~your ~~incurred~~previous ~~the cost of sorting,~~example and ~~thus was no better. However,~~ the ~~solutions~~current ~~that~~example ~~use sorting~~take O(~~Tonyjv's~~n ~~solution~~log ~~and the other one using groupby~~n) ~~are~~time ~~still~~overall. ~~inferior~~But tosince the ~~solutions~~grouping ~~that~~part ~~use~~and athe ~~hash~~finding ~~table~~maximum ~~based~~length ~~dictionary~~part orare ~~equivalent (defaultdict~~independent, ~~Counter)~~I todon't ~~group~~see ~~which~~how ~~are~~choosing ~~much~~a ~~faster~~slower --algorithm ~~linear~~for ~~time~~the onlatter ~~average~~part --will ~~assuming~~improve the ~~hash function is not~~algorithm ~~terrible~~overall. --[[User:Spoon!\|Spoon!]] 01:43, 21 September 2011 (UTC) :: Isn't this a moot point, anyway? Most of the time is spent sorting letters in words and grouping anagrams, optimizing ''after'' that is barking up the wrong tree. If you really want to make it faster, avoid stuff like modules where you don't have control over. The following runs about twice as fast as the example above on my system dictionary, probably because it doesn't call <code>sorted</code> on each word twice:<lang python>d = {} for w in open('dict').read().split(): x = "".join(sorted(w)) if x in d: d[x].append(w) else: d[x] = [w] lm = max([len(d[x]) for x in d]) for x in d: if (len(d[x]) == lm): print lm, d[x]</lang> :::P.S. I try and not make the assumption that better==faster. There are other considerations to take into account. Having said that, your faster example above has a certain elegance... --[[User:Paddy3118\|Paddy3118]] 06:03, 21 September 2011 (UTC) That will not run in Python 3 as bytes can not be joined, here is my fastest version not using groupby, did not compare with the non-groupby version in the page under discussion (this runs both in Python2 and Python3): <lang python> from collections import defaultdict import time try: import urllib.request words = urllib.request.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split() except ImportError: import urllib words = urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split() print('Words ready') d = defaultdict(list) t0 = time.clock() for w in words: d["".join(sorted(str(w)))].append(w) d=sorted(d.values(), key=len, reverse=True) lm = len(d[0]) for value in d: if (len(value) == lm): print(value) else: break t0 -= time.clock() print('Finished in %f s\n' % -t0) </lang> Also you can collect the result ready, about same running time though: <lang python> print('Collect result') d = defaultdict(list) t0 = time.clock() lm = 0 for w in words: key = "".join(sorted(str(w))) d[key].append(w) lk = len(d[key]) if lk < lm: continue if lk > lm: lm = lk result = [d[key]] #print('New length: %i (%s)' % (lm, result)) else: result.append(d[key]) t0 -= time.clock() for value in result: print(value) print('Finished in %f s (except print)\n' % -t0) </lang> : A few suggestions: # If you are going to benchmark a lot, pull the dict.txt file to your local harddrive, don't repeatedly urlopen it from remote host. # My example code runs under python 3.1 just fine, except for the <code>print</code> syntax change. # Like I said, majority of time is spent sorting letters; once all anagrams are in place, sorting the list or not makes no big difference (your sorted version above is only marginally slower). :--[[User:Ledrug\|Ledrug]] 07:21, 21 September 2011 (UTC) From urllib.request.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split() it produces bytes, which can not be joined before str, it gives error: TypeError: must be str, not bytes. For my anagrams program, I actually do building anagram synonym dictionary only once, if it is not saved on disc already. I do need to put little sorting effort to make the list length descending order and filtering special character (we'd -> dew for example), allowing the dictionary to be any collection of words one per line, as that works better for multiword anagrams. I only split the words possible to current source words, otherwise I keep the list as string of words. In my humble Sempron PC, my own names anagrams from Finnish dictionary (82229 words, 79248 sorted letter combinations, your program took 0.8 s to find interestingly also 5 word anagram, 5 different ones), takes around 600 ms (python 2.6 and psyco for best results): 52 words loaded in 68 ms. 1564 anagrams of tonyveijalainen found! Processing took 567 ms. : You are describing a completely different problem. Sorting may help with multi-word anagrams, and should help with data reuse, but for the specific task it does no benefit. : Incidentally, what do you mean by "5 word anagram"? Here's a 12 from my Finnish dictionary: (12, ['ankarasti', 'ankarista', 'arkistaan', 'karitsana', 'karsintaa', 'karsitaan', 'kitaransa', 'narikasta', 'rakastani', 'rankaista', 'raskainta', 'sarkainta']). If that doesn't count somehow, at least there is: (6, ['painottumassa', 'poistumastaan', 'punoittamassa', 'putoamistansa', 'upottamassani', 'upottamissaan']) --[[User:Ledrug\|Ledrug]] 20:21, 21 September 2011 (UTC) I have Finnish dictionary with words in basic forms only, in Linux side I did some processing in Ubuntu installed word file (had to remove \\$ at end and filter repeated words) I got also two 12s: ['rankaista', 'ankarista', 'karitsana', 'karsintaa', 'raskainta', 'kitaransa', 'karsitaan', 'rakastani', 'sarkainta', 'arkistaan', 'narikasta', 'ankarasti'] ['nostaa', 'otsaan', 'tasona', 'saaton', 'sotaan', 'sotana', 'sontaa', 'sanota', 'otsana', 'satona'] with my collect version, but it took 9.36 s without psyco (interestingly 9.64 s with psyco) python 2.6.5