User talk:Tonyjv: Difference between revisions

 
(18 intermediate revisions by 4 users not shown)
Line 34:
:I get it now. Rather than getting the max then plucking those with that length from a list, you calculate the length sort on it and store it, then select from the maximum end of the sorted list until the length is no longer equal to the length of the first. Neat. Thanks for taking the time to explain. --[[User:Paddy3118|Paddy3118]] 21:07, 20 September 2011 (UTC)
 
:Sorting is slow because it takes O(n log n) time; whereas finding the maximum and filtering a list takes linear time. YouIt areis correcttrue that yoursorting previousis solutionbeing alsodone usedregardless sorting,(because byof the <code>groupby(sorted(...))</code>), andso thusboth alsoyour incurredprevious the cost of sorting,example and thus was no better. However, the solutionscurrent thatexample use sortingtake O(Tonyjv'sn solutionlog and the other one using groupbyn) aretime stilloverall. inferiorBut tosince the solutionsgrouping thatpart useand athe hashfinding tablemaximum basedlength dictionarypart orare equivalent (defaultdictindependent, Counter)I todon't groupsee whichhow arechoosing mucha fasterslower --algorithm linearfor timethe onlatter averagepart --will assumingimprove the hash function is notalgorithm terribleoverall. --[[User:Spoon!|Spoon!]] 01:43, 21 September 2011 (UTC)
:: Isn't this a moot point, anyway? Most of the time is spent sorting letters in words and grouping anagrams, optimizing ''after'' that is barking up the wrong tree. If you really want to make it faster, avoid stuff like modules where you don't have control over. The following runs about twice as fast as the example above on my system dictionary, probably because it doesn't call <code>sorted</code> on each word twice:<lang python>d = {}
for w in open('dict').read().split():
x = "".join(sorted(w))
if x in d:
d[x].append(w)
else:
d[x] = [w]
 
lm = max([len(d[x]) for x in d])
for x in d:
if (len(d[x]) == lm): print lm, d[x]</lang>
 
:::P.S. I try and not make the assumption that better==faster. There are other considerations to take into account. Having said that, your faster example above has a certain elegance... --[[User:Paddy3118|Paddy3118]] 06:03, 21 September 2011 (UTC)
That will not run in Python 3 as bytes can not be joined, here is my fastest version not using groupby, did not compare with the non-groupby version in the page under discussion (this runs both in Python2 and Python3):
<lang python>
from collections import defaultdict
import time
 
try:
import urllib.request
words = urllib.request.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split()
except ImportError:
import urllib
words = urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split()
 
print('Words ready')
 
d = defaultdict(list)
t0 = time.clock()
for w in words:
d["".join(sorted(str(w)))].append(w)
 
d=sorted(d.values(), key=len, reverse=True)
lm = len(d[0])
for value in d:
if (len(value) == lm):
print(value)
else:
break
t0 -= time.clock()
print('Finished in %f s\n' % -t0)
 
</lang>
 
Also you can collect the result ready, about same running time though:
 
<lang python>
print('Collect result')
d = defaultdict(list)
t0 = time.clock()
lm = 0
 
for w in words:
key = "".join(sorted(str(w)))
d[key].append(w)
lk = len(d[key])
if lk < lm:
continue
if lk > lm:
lm = lk
result = [d[key]]
#print('New length: %i (%s)' % (lm, result))
else:
result.append(d[key])
t0 -= time.clock()
for value in result:
print(value)
print('Finished in %f s (except print)\n' % -t0)
 
</lang>
: A few suggestions:
# If you are going to benchmark a lot, pull the dict.txt file to your local harddrive, don't repeatedly urlopen it from remote host.
# My example code runs under python 3.1 just fine, except for the <code>print</code> syntax change.
# Like I said, majority of time is spent sorting letters; once all anagrams are in place, sorting the list or not makes no big difference (your sorted version above is only marginally slower).
:--[[User:Ledrug|Ledrug]] 07:21, 21 September 2011 (UTC)
 
From urllib.request.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split() it produces bytes, which can not be joined before str, it gives error: TypeError: must be str, not bytes. For my anagrams program, I actually do building anagram synonym dictionary only once, if it is not saved on disc already. I do need to put little sorting effort to make the list length descending order and filtering special character (we'd -> dew for example), allowing the dictionary to be any collection of words one per line, as that works better for multiword anagrams. I only split the words possible to current source words, otherwise I keep the list as string of words. In my humble Sempron PC, my own names anagrams from Finnish dictionary (82229 words, 79248 sorted letter combinations, your program took 0.8 s to find interestingly also 5 word anagram, 5 different ones), takes around 600 ms (python 2.6 and psyco for best results):
 
52 words loaded in 68 ms.
1564 anagrams of
tonyveijalainen found!
Processing took 567 ms.
 
: You are describing a completely different problem. Sorting may help with multi-word anagrams, and should help with data reuse, but for the specific task it does no benefit.
: Incidentally, what do you mean by "5 word anagram"? Here's a 12 from my Finnish dictionary: (12, ['ankarasti', 'ankarista', 'arkistaan', 'karitsana', 'karsintaa', 'karsitaan', 'kitaransa', 'narikasta', 'rakastani', 'rankaista', 'raskainta', 'sarkainta']). If that doesn't count somehow, at least there is: (6, ['painottumassa', 'poistumastaan', 'punoittamassa', 'putoamistansa', 'upottamassani', 'upottamissaan']) --[[User:Ledrug|Ledrug]] 20:21, 21 September 2011 (UTC)
 
I have Finnish dictionary with words in basic forms only, in Linux side I did some processing in Ubuntu installed word file (had to remove \\$ at end and filter repeated words) I got also two 12s: ['rankaista', 'ankarista', 'karitsana', 'karsintaa', 'raskainta', 'kitaransa', 'karsitaan', 'rakastani', 'sarkainta', 'arkistaan', 'narikasta', 'ankarasti']
['nostaa', 'otsaan', 'tasona', 'saaton', 'sotaan', 'sotana', 'sontaa', 'sanota', 'otsana', 'satona']
with my collect version, but it took 9.36 s without psyco (interestingly 9.64 s with psyco) python 2.6.5
Anonymous user