User talk:WillNess: Difference between revisions

done!
(done!)
 
(13 intermediate revisions by 3 users not shown)
Line 61:
 
: I've now re-coded foldl' as loops too and got essentially same timings as your entry on Ideone (about 20% speedup; won't edit it in, to keep the clarity of the current version). But, there is still potential for improvement: the k-j enumeration is easily parallelizable: partition the `k`s such that k-j triangle is divided into 4 (8, ...) strips of roughly equal areas, and process each bundle in parallel. Another, really vague idea I once had, can be seen [http://stackoverflow.com/questions/10126285/generating-integers-in-ascending-order-using-a-set-of-prime-numbers/10160054#comment15193650_10126894 here], but it's really vague. :) -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 11:07, 20 August 2016 (UTC)
:: I think you'll find that the loops encoding is quite a bit more than a 20% to 30% speed-up when using 64-bit code and Word64 for the internal count computation - more like 200% (at least on my AMD Bulldozer which isn't very efficient at using CPU cache). The reasons for the extreme difference are several: 1) simulated 64 bit operations in 32-bit code are extremely inefficient in current GHC due to the extremely poor register allocations, which means there will be a large number of register spills and reloads to/from RAM or at least CPU cache, and 2) the inner loop consists of fairly simple computations but uses quite a large number of values, including j, jlmt, lb3, hi, w (or lo in my code), q, frac, r, and i (in both integer and float form), which if not optimized properly as to register use require a lot more registers than the x86 architecture provides (even worse in current GHC's view). When each of these are one 64-bit register, the inner loop runs very fast in just 15 CPU clock cycles or so for the loop encoding without the list comprehensions overhead, but this is multiplied when using memory and multiplied yet again when some of the operations need to use 64-bit precision (for counting) by simulating from 32-bit so that the list comprehensions overhead isn't as significant. Using the LLVM backend might help some here even for 32-bit code in more efficient register use, but of course we can't use that on IdeOne (AFAIK). Note that I haven't been able to test LLVM use either as under Windows LLVM doesn't work with 64-bit and I can't get it working with 32-bit for this program (promised to be fixed in GHC version 8.2)... --[[User:GordonBGood|GordonBGood]] ([[User talk:GordonBGood|talk]]) 01:20, 21 August 2016 (UTC)
 
:: Yes, of course parallelization of the outer loop is easily possible and would speed the code up by a constant factor, but is probably a little much for a RosettaCode posting. --[[User:GordonBGood|GordonBGood]] ([[User talk:GordonBGood|talk]]) 01:20, 21 August 2016 (UTC)
 
::: Interesting stuff, thanks. Of course I always expect the more abstract source code to be compiled into as efficient low-level code as possible; that was The High Premise of Equational Reasoning as I understood it. All this toiling with the minutiae of coding is off-putting for me. I'd rather prefer <code>(sum *** concat) . unzip $ [.....]</code> instead. :) :) -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 09:03, 21 August 2016 (UTC)
 
:::: That's a beautiful expression of the code, but it has a horrible space leak such that it's "out of memory" for anything much above 10 billion and takes about 80% of the time on garbage collection; I think the problem is that the generated zip list is entirely unevaluated (including the "add to list" predicate check) until used by the "sum" and "sortBy" functions - a problem that the "foldl'" version didn't have as it was forced to be strict. I'm afraid that while the THPoER highly abstract form can sometimes reduce complex code into something simple such that it runs faster, in this case, the abstract code comes at an extremely high cost, due to it practically being expressed as non-strict lists with the construction of such lists taking many extra cycles in the innermost loop. There are also the practical matters of memory use and garbage collection... Sometimes we need to look at code from a low level perspective to see what it is really doing; I'm afraid that in this case the non-strictness of Haskell has bitten you on the a... ;) --[[User:GordonBGood|GordonBGood]] ([[User talk:GordonBGood|talk]]) 03:56, 22 August 2016 (UTC)
 
::::: Yes, I saw that error for above 10--20B; I decided to not care about it here, and to go with the shorter code to express the algorithm clearly. The "out of memory" bit comes from Ideone and I'm not entirely sure why, when it's still reporting the exact same memory consumption for all the cases below 10B. ''Real'' space leaks are usually seen on Ideone as reported memory climbing up into the 100s MBs, not so here. Any case, it's a compiler issue. And, we have your version on the page now for the optimized code. :) -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 06:50, 22 August 2016 (UTC)
 
:::::: Just for completeness, the "out of memory" error isn't just IdeOne but also with Windows GHC version 8.0.1 for values approaching a trillion on my machine (which has a lot of memory; I haven't tried to determine the exact limit); the < 20 billion limit on IdeOne is due to the amount of memory assigned. I suspect it isn't flagged as a "real" space leak because there is not a fixed pointer to the unevaluated zipped list; there is just the unevaluated list comprehensions. When the "sum" operation is run, it forces the partial evaluation of every list element to the first part of the tuple '''but not the evaluation of the predicate on whether a given element gets including in the concatenated list or not'''; this last doesn't get evaluated until the determined values get sorted, at which point the unused values get dumped (and garbage collected). So, during the course of calculation to a trillion, something like 100's of Gigabytes would be consumed (my machine doesn't have that many and thus fails). Thus, during the execution of the program the memory use goes from zero to very high but is back to very low again by the time the program ends. Although the expression of the algorithm is beautiful, the memory use profile is not! And yes, my version shows what can be done, including that if IdeOne were 64-bit, 10 billion should take less than 150 milliseconds, not about a half second for the "foldl'" version, not about 1.5 seconds for this version, and that maximum memory residency (as determined by "+RTS -s") is almost zero not many many megabytes. --[[User:GordonBGood|GordonBGood]] ([[User talk:GordonBGood|talk]]) 08:28, 22 August 2016 (UTC)
 
::::::: Ah, great, thanks for that. Yes, I understand the space leak here, that's why I wrote those other versions back then; just thought ''maybe'' they fixed it by now with the newer GHC somehow. You're right, all this should be made clear in the text. -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 10:34, 22 August 2016 (UTC)
 
::::::: I've tried it locally, and it reports "234MB memory in use" for 1B and 1187MB for the 10B. I don't understand what is this 7.8MB figure that Ideone shows for both. I was misled by this. -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 10:45, 22 August 2016 (UTC)
 
::::::: BTW the really short version of it is just <code>(Sum c,b) = mconcat [ ( Sum (i+1), ...</code>. Here it could be argued that it '''''ought''''' to be compiled efficiently, as the whole premise of Monoids is that associativity enables re-parenthesization ''(<code>a:b:c:... = [a]++([b]++([c]++...)) = ([a]++[b])++([c]++...)</code>)'', which is the basis for the efficiency of the strict left fold. Yet it is more than twice worse than the <code>prod</code> code, both in time and space.
 
::::::: BTW, I got 1B:0.45s-269MB and 10B:1.75s-1343MB, which suggests 1T:38s-'''28.8GB''' ''"total memory in use"''. The fold-based version indeed did good at 1T:11.2s-'''10MB'''. I use the signature <code>Int -> (Double, (Int,Int,Int))</code> which produced the fastest code in my tests (simple -O2). -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 17:13, 22 August 2016 (UTC)
 
:::::::: Looks good; it's still about three times slower and uses more memory than 1T:3.75s:3MB "total memory in use" (from local) as for my version on IdeOne but it is good as referents will have a choice. I went back to using Word64 as the fastest (just a slightly higher range than Int) even for 32-bit code as it is faster, since it then uses the same type for the internal count. it looks like for all the Monoid stuff to work as you would like to see it there will have to be all kinds of specializations that haven't been added - complex stuff, and doing "concat" efficiently has always been a headache. --[[User:GordonBGood|GordonBGood]] ([[User talk:GordonBGood|talk]]) 00:50, 23 August 2016 (UTC)
 
== Factors of an integer ==
 
The current algorithms factors_co and factors_o seem to give the wrong result. For example, factors_co 120 is missing the number 12 in its result. --[[User:Helge|Helge]] ([[User talk:Helge|talk]]) 13:15, 26 December 2016 (UTC)
 
: Good catch, thanks! Will fix. -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 21:06, 26 December 2016 (UTC)
: Done! Thanks again for spotting this. An added bonus was, the code simplified a bit, too. -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 07:43, 27 December 2016 (UTC)
751

edits