Legendre prime counting function: Difference between revisions

Content deleted Content added
→‎{{header|Nim}}: Add more efficient non-memoized version...
→‎Non-Memoized Version: Add an even faster version by reducing recursion...
Line 740:
let nstrt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let nelpsd = (getMonoTime() - nstrt).inMilliseconds
echo "This took ", nelpsd, " milliseconds."</syntaxhighlight>
</syntaxhighlight>
{{out}}
<pre>π(10^0) = 0
Line 757 ⟶ 755:
π(10^9) = 50847534
This took 194 milliseconds.</pre>
As shown in the output, the above program runs without memoization about 12 times faster and using a huge amount less memory than the previous memoized algorithm. Other than not requiring memoization, the implementation also substitutes floating point division for integer division so as to be faster for many CPU architectures (although only being of 53-bits accuracy instead of 64-bits; which limited precision will limit the range over which the primes can be accurately counted, but as this algorithm is only reasonably useful to about ten to the fourteenth power anyway, that shouldn't be a problem.
 
'''Using Look Up Table for "TinyPhi" and Decreasing Recursion'''
A further optimization that would greatly reduce the number of operations and thus the run time for these small ranges is to use a pre-computed Look Up Table (LUT) array for the "leaves" where the "degree" is small (the smallest primes used); this would reduce the "width" of the tree at its widest "base" to eliminate many divisions and recursions and thus reduce the execution time greatly, although the effect is more for smaller ranges such as this as for greater ranges.
 
A further optimization that would greatly reducereduces the number of operations and thus the run time for these small ranges is to use a pre-computed "TinyPhi" Look Up Table (LUT) array for the "leaves" where the "degree" is small (such as degree of six for the smallest primes used as 13 as used here, which then forms a repeating wheel pattern of 30030 values); this would reduce the "width" of the tree at its widest "base" to eliminate many divisions and recursions and thus reduce the execution time greatly, although the effect is more for smaller ranges such as this as for greater ranges. As well, one can compute the "tree" starting at the lower left, which then reduces some of the recursions to a simple loop and can eliminate the recursions used for succeeding "ones" according to the optimization eliminating the need for more recursion. This is because "Phi"/φ can be calculated by the given "top down" recursion expression given in the task which is derived from the "bottom up" expression as follows:
For larger ranges, a better algorithm such as that of Legarias, Miller, and Odlyzko (LMO) will perform much better than this and use much less memory, although at the cost of greater complexity and more lines of code (about 500 LOC). There is also a variation of the Legendre algorithm which is to this Legendre algorithm as the LMO algorithm is to the Meissel algorithm (which directly followed and improved the Legendre work), which can be implemented in about 60 LOC and is much faster although it still uses a lot of memory.
:<math>\lfloor x\rfloor - \sum_{i}\left\lfloor\frac{x}{p_i}\right\rfloor + \sum_{i<j} \left\lfloor\frac{x}{p_ip_j}\right\rfloor - \sum_{i<j<k}\left\lfloor\frac{x}{p_ip_jp_k}\right\rfloor + \cdots</math>
(where <math>\lfloor{x}\rfloor</math> denotes the floor function).
 
The following Nim code implements these changes:
<syntaxhighlight lang="nim"># compile with: nim c -d:danger -t:-march=native --gc:arc
 
from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt
 
let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))
 
const TinyPhiPrimes = [2, 3, 5, 7, 11, 13]
const TinyPhiCirc = 3 * 5 * 7 * 11 * 13
const TinyPhiRes = 2 * 4 * 6 * 10 * 12
const CC = 6
proc makeTinyPhiLUT(): array[TinyPhiCirc, uint16] =
for i in 0 .. TinyPhiCirc - 1: result[i] = 1
for i in 1 .. 6:
if result[i] == 0: continue
result[i] = 0; let bp = i + i + 1
let sqri = (i + i) * (i + 1)
for c in countup(sqri, TinyPhiCirc - 1, bp): result[c] = 0
var acc = 0'u16;
for i in 0 .. TinyPhiCirc - 1: acc += result[i]; result[i] = acc
const TinyPhiLUT = makeTinyPhiLUT()
proc tinyPhi(x: int64): int64 {.inline.} =
let ndx = (x - 1) div 2; let numtot = ndx div TinyPhiCirc.int64
return numtot * TinyPhiRes.int64 + TinyPhiLUT[(ndx - numtot * TinyPhiCirc.int64).int].int64
 
proc countPrimes(n: int64): int64 =
if n < 169: # below 169 whose sqrt is 13 is where TinyPhi doesn't work...
if n < 3: return if n < 2: 0 else: 1
# adjust for the missing "degree" base primes
if n <= 13: return (n - 1) div 2 + (if n < 9: 1 else: 0)
return 5 + TinyPhiLUT[(n - 1).int div 2].int64
let rtlmt = n.float64.sqrt.int
let mxndx = (rtlmt - 1) div 2
var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0((mxndx + 8) div 8))
for i in 1 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) != 0: continue
let sqri = (i + i) * (i + 1)
if sqri > mxndx: break
let bp = i + i + 1
for c in countup(sqri, mxndx, bp):
let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
var pisqrt = 0'i64
for i in 0 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) == 0: pisqrt += 1
let primes = cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * pisqrt.int))
var j = 0
for i in 0 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) == 0: primes[j] = (i + i + 1).uint32; j += 1
var phi = tinyPhi(n)
proc lvl(m, mbf: int64; mxa: int) = # recurse from bottom left of "tree"...
for a in CC .. mxa:
let p = primes[a].int64
if m < p * p: phi += mbf * (mxa - a + 1).int64; return # rest of level all ones!
let nm = (m.float64 / p.float64).int64; phi += mbf * tinyPhi(nm)
if a > CC: lvl(nm, -mbf, a - 1) # split
# finished level!
lvl(n, -1, pisqrt.int - 1); result = phi + pisqrt - 1
cmpsts.dealloc; primes.dealloc
 
let strt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "This took ", elpsd, " milliseconds."</syntaxhighlight>
The results of the above code are identical to the previous version except it is almost 20 times faster due to the reduced recursion; however, the gain due to the "TinyPhi" LUT is most effective for relatively trivial counting ranges such as this and will have less and less benefit as ranges get larger, although the reduced recursion due to reordering is a constant factor gain across the range.
 
Larger "TinyPhi" LUT's can be used for some added benefit (especially for these smaller ranges) although they quickly get quite huge where the benefit is reduced due to lesser cache associativity; for instance the LUT for a "TinyPhi" of degree eight can a wheel circumference of 9699690 values which can be reduced by a factor of two by only considering odd values and another factor of two using the fact that the LUT is symmetric approaching the middle from the beginning and end of the LUT.
 
For larger ranges, a better algorithm such as that of Legarias, Miller, and Odlyzko (LMO) will perform much better than this and use much less memory, although at the cost of greater complexity and more lines of code (about 500 LOC). There is also a variation of the Legendre algorithm which is to this basic Legendre algorithm as the LMO algorithm is to the Meissel algorithm (which directly followed and improved the Legendre work), and which can be implemented in about 60 LOC andto isbe much faster due to reduced computational complexity although it still uses a lot of memory.
 
=={{header|Perl}}==