User talk:WillNess: Difference between revisions

Line 61:

: I've now re-coded foldl' as loops too and got essentially same timings as your entry on Ideone (about 20% speedup; won't edit it in, to keep the clarity of the current version). But, there is still potential for improvement: the k-j enumeration is easily parallelizable: partition the `k`s such that k-j triangle is divided into 4 (8, ...) strips of roughly equal areas, and process each bundle in parallel. Another, really vague idea I once had, can be seen [http://stackoverflow.com/questions/10126285/generating-integers-in-ascending-order-using-a-set-of-prime-numbers/10160054#comment15193650_10126894 here], but it's really vague. :) -- [[User:WillNess|WillNess]] ([[User talk:WillNess|talk]]) 11:07, 20 August 2016 (UTC)

:: I think you'll find that the loops encoding is quite a bit more than a 20% to 30% speed-up when using 64-bit code and Word64 for the internal count computation - more like 200% (at least on my AMD Bulldozer which isn't very efficient at using CPU cache). The reasons for the extreme difference are several: 1) simulated 64 bit operations in 32-bit code are extremely inefficient in current GHC due to the extremely poor register allocations, which means there will be a large number of register spills and reloads to/from RAM or at least CPU cache, and 2) the inner loop consists of fairly simple computations but uses quite a large number of values, including j, jlmt, lb3, hi, w (or lo in my code), q, frac, r, and i (in both integer and float form), which if not optimized properly as to register use require a lot more registers than the x86 architecture provides (even worse in current GHC's view). When each of these are one 64-bit register, the inner loop runs very fast in just 15 CPU clock cycles or so for the loop encoding without the list comprehensions overhead, but this is multiplied when using memory and multiplied yet again when some of the operations need to use 64-bit precision (for counting) by simulating from 32-bit so that the list comprehensions overhead isn't as significant. Using the LLVM backend might help some here even for 32-bit code in more efficient register use, but of course we can't use that on IdeOne (AFAIK). Note that I haven't been able to test LLVM use either as under Windows LLVM doesn't work with 64-bit and I can't get it working with 32-bit for this program (promised to be fixed in GHC version 8.2)... --[[User:GordonBGood|GordonBGood]] ([[User talk:GordonBGood|talk]]) 01:20, 21 August 2016 (UTC)

:: Yes, of course parallelization of the outer loop is easily possible and would speed the code up by a constant factor, but is probably a little much for a RosettaCode posting. --[[User:GordonBGood|GordonBGood]] ([[User talk:GordonBGood|talk]]) 01:20, 21 August 2016 (UTC)