# Legendre prime counting function

**Legendre prime counting function**

You are encouraged to solve this task according to the task description, using any language you may know.

The prime-counting function π(n) computes the number of primes not greater than n. Legendre was the first mathematician to create a formula to compute π(n) based on the inclusion/exclusion principle.

To calculate:

Define

φ(x, 0) = x φ(x, a) = φ(x, a−1) − φ(⌊x/p_{a}⌋, a−1), where p_{a}is the a^{th}prime number.

then

π(n) = 0 when n < 2 π(n) = φ(n, a) + a - 1, where a = π(√n), n ≥ 2

The Legendre formula still requires the use of a sieve to enumerate primes; however it's only required to sieve up to the √n, and for counting primes, the Legendre method is generally much faster than sieving up to n.

- Task

Calculate π(n) for values up to 1 billion. Show π(n) for n = 1, 10, 100, ... 10^{9}.

For this task, you may refer to a prime number sieve (such as the Sieve of Eratosthenes or the extensible sieve) in an external library to enumerate the primes required by the formula. Also note that it will be necessary to memoize the results of φ(x, a) in order to have reasonable performance, since the recurrence relation would otherwise take exponential time.

**Comments on Task**

Regarding "it will be necessary to memoize the results of φ(x, a)", it will have exponential time performance without memoization only if a very small optimization that should be obvious is not done: it should be obvious that one can stop "splitting" the φ(x, a) "tree" when 'x' is zero, and even before that since the real meaning of the "phi"/φ function is to produce a count of all of the values greater than zero (including one) up to `x` that have been culled of all multiples of the primes up to and including the `p_{a}` prime value, if `x` is less than or equal to `p_{a}`, then the whole "tree" must result in a value of just one. If this minor (and obvious) optimization is done, the "exponential time" performance goes away, memoization is not absolutely necessary (saving the overhead in time and space of doing the memoization), and the time complexity becomes O(n/(log n^{2}) and the space complexity becomes O(n^{1/2}/log n) as they should be.

This is the problem when non-mathematician programmers blindly apply such a general formula as the recursive Legendre one without doing any work to understand it or even doing some trivial hand calculations to better understand how it works just because they have very powerful computers which mask the limitations: a few minutes of hand calculation would make it obvious that there is no need to "split"/recursively call for "phi" nodes where the first argument is zero, and someone with a mathematics interest would then investigate to see if that limit can be pushed a little further as here to the range of nodes whose result will always be one. Once there is no exponential growth of the number of "nodes", then there is no need for memoization as usually implemented with a hash table at a huge cost of memory overhead and constant time computation per operation.

As to "the Legendre method is generally much faster than sieving up to n.", while the number of operations for the Legendre algorithm is about a factor of `log n` squared less than the number of operations for odds-only Sieve of Eratosthenes (SoE) sieving, those operations are "divide" operations which are generally much slower than the simple array access and addition operations used in sieving and the SoE can be optimized further using wheel factorization so that the required time for a given range can be less for a fully optimized SoE than for the common implementation of the Legendre algorithm; the trade off is that a fully optimized SoE is at least 500 lines of code, whereas the basic version of the Legendre prime counting algorithm is only about 40 to 50 lines of code (depending somewhat on the language used).

Also note that the Legendre prime counting function was never used practically at the time it was invented other than to demonstrate that it would find the count of primes to a trivial range only knowing the primes up to the square root of that range and there were too many operations (especially long integer division operations) to actually use it for any reasonably range even with this optimization (about 250 thousand divisions to count primes to ten million), but the follow-on work by Meissel in the 1800's definitely would have used this optimization and others in order to hand calculate the number of primes to a billion (1e9) in about ten years. Even with this optimization, Meissel would have had to hand calculate over five million divisions, so certainly used other Look Up Tables (LUT's) although certainly not caching of Phi/φ values in order to reduce the work to something possible in this amount of time. A "TinyPhi" LUT table for the first six primes of thirteen and less would have reduced the amount of work Meissel did to about 600 thousand divisions, but even that would have been perhaps too much and it is very likely that he also used "partial sieving" techniques, although that would have meant that as well as a table of the primes up to a million, he would have also needed 161 other tables of that range to a million sieved by the primes up to 13, 17, 19, to 997; however, that extra work in building these tables (which might have been done mechanically) would pay off in reducing the number of divisions to about seven thousand so the divisions become a minor problem possible to do over months and the majority of the time would be spent producing the partial sieving tables up to a million.

The reason that Meissel refined the Legendre method would have been that, even applying all of the optimizations including "partial sieving", he would still have had to do about three and a half million divisions to count the primes to a billion even if the number of primes and "partial sieve tables" only needed to be known to about 32 thousand, where his "Meissel" algorithm reduced the number of divisions to only a few thousand as per the above. Without a computer, he could never have completed the calculation of the number of primes to a billion using an optimized Legendre algorithm where he could using his modification. However, modern computers make (reasonably) quick work of integer divisions so that optimized algorithms of the Legendre type become moderately useful although at the cost of memory use as compared to Meissel type algorithms.

## 11l

```
F primes_up_to_limit(Int limit)
[Int] r
I limit >= 2
r.append(2)
V isprime = [1B] * ((limit - 1) I/ 2)
V sieveend = Int(sqrt(limit))
L(i) 0 .< isprime.len
I isprime[i]
Int p = i * 2 + 3
r.append(p)
I i <= sieveend
L(j) ((p * p - 3) >> 1 .< isprime.len).step(p)
isprime[j] = 0B
R r
V p = primes_up_to_limit(Int(sqrt(1'000'000'000)))
F phi(x, =a)
F phi_cached(x, a)
[(Int, Int) = Int] :cache
I (x, a) C :cache
R :cache[(x, a)]
V r = phi(x, a)
:cache[(x, a)] = r
R r
V res = 0
L
I a == 0 | x == 0
R x + res
a--
res -= phi_cached(x I/ :p[a], a)
F legpi(n)
I n < 2
R 0
V a = legpi(Int(sqrt(n)))
R phi(n, a) + a - 1
L(e) 10
print(‘10^’e‘ ’legpi(10 ^ e))
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## C

Using fixed width types so requires C99 or later.

Surprisingly only half as fast as Go on the same machine (GCC -O3 Ubuntu 22.04) for a billion numbers.

However, the position is reversed if we let it run through to a trillion numbers - about 100 ms for C compared to 160 ms for Go.

```
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
const uint8_t masks[8] = {1, 2, 4, 8, 16, 32, 64, 128};
#define half(n) ((int64_t)((n) - 1) >> 1)
#define divide(nm, d) ((uint64_t)((double)nm / (double)d))
int64_t countPrimes(uint64_t n) {
if (n < 9) return (n < 2) ? 0 : ((int64_t)n + 1) / 2;
uint64_t rtlmt = (uint64_t)sqrt((double)n);
int64_t mxndx = (int64_t)((rtlmt - 1) / 2);
int arrlen = (int)(mxndx + 1);
uint32_t *smalls = malloc(arrlen * 4);
uint32_t *roughs = malloc(arrlen * 4);
int64_t *larges = malloc(arrlen * 8);
for (int i = 0; i < arrlen; ++i) {
smalls[i] = (uint32_t)i;
roughs[i] = (uint32_t)(i + i + 1);
larges[i] = (int64_t)((n/(uint64_t)(i + i + 1) - 1) / 2);
}
int cullbuflen = (int)((mxndx + 8) / 8);
uint8_t *cullbuf = calloc(cullbuflen, 1);
int64_t nbps = 0;
int rilmt = arrlen;
for (int64_t i = 1; ; ++i) {
int64_t sqri = (i + i) * (i + 1);
if (sqri > mxndx) break;
if (cullbuf[i >> 3] & masks[i & 7]) continue;
cullbuf[i >> 3] |= masks[i & 7];
uint64_t bp = (uint64_t)(i + i + 1);
for (int64_t c = sqri; c < (int64_t)arrlen; c += (int64_t)bp) {
cullbuf[c >> 3] |= masks[c & 7];
}
int nri = 0;
for (int ori = 0; ori < rilmt; ++ori) {
uint32_t r = roughs[ori];
int64_t rci = (int64_t)(r >> 1);
if (cullbuf[rci >> 3] & masks[rci & 7]) continue;
uint64_t d = (uint64_t)r * bp;
int64_t t = (d <= rtlmt) ? larges[(int64_t)smalls[d >> 1] - nbps] :
(int64_t)smalls[half(divide(n, d))];
larges[nri] = larges[ori] - t + nbps;
roughs[nri] = r;
nri++;
}
int64_t si = mxndx;
for (uint64_t pm = (rtlmt/bp - 1) | 1; pm >= bp; pm -= 2) {
uint32_t c = smalls[pm >> 1];
uint64_t e = (pm * bp) >> 1;
for ( ; si >= (int64_t)e; --si) smalls[si] -= c - (uint32_t)nbps;
}
rilmt = nri;
nbps++;
}
int64_t ans = larges[0] + (int64_t)((rilmt + 2*(nbps - 1)) * (rilmt - 1) / 2);
int ri, sri;
for (ri = 1; ri < rilmt; ++ri) ans -= larges[ri];
for (ri = 1; ; ++ri) {
uint64_t p = (uint64_t)roughs[ri];
uint64_t m = n / p;
int ei = (int)smalls[half((uint64_t)m/p)] - nbps;
if (ei <= ri) break;
ans -= (int64_t)((ei - ri) * (nbps + ri - 1));
for (sri = ri + 1; sri < ei + 1; ++sri) {
ans += (int64_t)smalls[half(divide(m, (uint64_t)roughs[sri]))];
}
}
free(smalls);
free(roughs);
free(larges);
free(cullbuf);
return ans + 1;
}
int main() {
uint64_t n;
int i;
clock_t start = clock();
for (i = 0, n = 1; i < 10; ++i, n *= 10) {
printf("10^%d %ld\n", i, countPrimes(n));
}
clock_t end = clock();
printf("\nTook %f seconds\n", (double) (end - start) / CLOCKS_PER_SEC);
return 0;
}
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534 Took 0.003843 seconds

## C++

```
#include <cmath>
#include <iostream>
#include <vector>
std::vector<int> generate_primes(int limit) {
std::vector<bool> sieve(limit >> 1, true);
for (int p = 3, s = 9; s < limit; p += 2) {
if (sieve[p >> 1]) {
for (int q = s; q < limit; q += p << 1)
sieve[q >> 1] = false;
}
s += (p + 1) << 2;
}
std::vector<int> primes;
if (limit > 2)
primes.push_back(2);
for (int i = 1; i < sieve.size(); ++i) {
if (sieve[i])
primes.push_back((i << 1) + 1);
}
return primes;
}
class legendre_prime_counter {
public:
explicit legendre_prime_counter(int limit);
int prime_count(int n);
private:
int phi(int x, int a);
std::vector<int> primes;
};
legendre_prime_counter::legendre_prime_counter(int limit) :
primes(generate_primes(static_cast<int>(std::sqrt(limit)))) {}
int legendre_prime_counter::prime_count(int n) {
if (n < 2)
return 0;
int a = prime_count(static_cast<int>(std::sqrt(n)));
return phi(n, a) + a - 1;
}
int legendre_prime_counter::phi(int x, int a) {
if (a == 0)
return x;
if (a == 1)
return x - (x >> 1);
int pa = primes[a - 1];
if (x <= pa)
return 1;
return phi(x, a - 1) - phi(x / pa, a - 1);
}
int main() {
legendre_prime_counter counter(1000000000);
for (int i = 0, n = 1; i < 10; ++i, n *= 10)
std::cout << "10^" << i << "\t" << counter.prime_count(n) << '\n';
}
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## Chapel

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Chapel versions are [translated and improved from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

**The Basic Recursive Legendre Algorithm**

```
// compile with --fast for maximum speed...
use Time;
proc countPrimes(lmt: uint(64)): int(64) {
if lmt < 9 { // when there are no odd primes less than square root...
if lmt < 3 { if lmt < 2 { return 0; } else { return 1; } }
return (lmt - (lmt >> 1)): int(64);
}
// Chapel doesn't have closures, so emulate them with a class...
class LegendrePi {
var n: uint(64);
var dom: domain(1);
var oprms: [dom] uint(32);
proc init(n: uint(64)) {
// first, an array of odd primes to the square root of n is generated...
this.n = n;
const sqrtn = sqrt(n: real(64)): int(64);
const rtlmt = (sqrtn - 3) / 2; this.dom = {0 .. rtlmt};
this.oprms = 0;
for i in 0 .. rtlmt do this.oprms[i] = (i + i + 3): uint(32);
var i = 0;
for i in (0 ..) { // cull the array
var ci = (i + i) * (i + 3) + 3; if ci > rtlmt { break; }
const bp = i + i + 3;
while (ci <= rtlmt) { this.oprms[ci] = 0; ci += bp; }
}
var psz = 0;
for ti in 0 .. rtlmt { // compress the odd primes array...
const tv = this.oprms[ti];
if tv != 0 { this.oprms[psz] = tv; psz += 1; }
}
this.dom = { 0 ..< psz };
}
proc phi(x: uint(64), a: int): int(64) {
if a <= 0 { return (x - (x >> 1)): int(64); } // take care of prime of 2
const na = a - 1; const p = this.oprms[na]: uint(64);
if x <= p { return 1: int(64); }
return phi(x, na) - phi(x / p, na);
}
proc this(): int(64) {
return phi(n, this.oprms.size) + this.oprms.size: int(64);
}
}
return (new LegendrePi(lmt))();
}
proc main() {
var timer: Timer;
timer.start();
for i in 0 .. 9 {
writeln("π(10**", i, ") = ", countPrimesx(10: uint(64) ** i));
}
timer.stop();
writeln("This took ", timer.elapsed(TimeUnits.milliseconds), " milliseconds.");
}
```

- Output:

π(10**0) = 0 π(10**1) = 4 π(10**2) = 25 π(10**3) = 168 π(10**4) = 1229 π(10**5) = 9592 π(10**6) = 78498 π(10**7) = 664579 π(10**8) = 5761455 π(10**9) = 50847534 This took 276.431 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

**The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm**

Just substitute the following Chapel code for the `countPrimes` function above to enjoy the gain in speed:

```
// tiny Phi Look Up for `a` of small degree...
const tinyPhiPrimes = [ 2, 3, 5, 7, 11, 13 ]; // degree six
const cC = tinyPhiPrimes.size - 1;
proc product(a: [] int): int {
var acc = 1; for v in a { acc *= v; }; return acc >> 1; }
const tinyPhiOddCirc = product(tinyPhiPrimes);
proc tot(a: [] int): int {
var acc = 1; for v in a { acc *= v - 1; }; return acc; }
const tinyPhiOddTot = tot(tinyPhiPrimes);
proc makeTinyLUT(ps: [] int, sz: int): [] uint(32) {
var arr: [0 .. sz - 1] uint(32) = 1;
for p in ps {
if p <= 2 { continue; }
arr[p >> 1] = 0;
for c in ((p * p) >> 1) ..< sz by p { arr[c] = 0; }
}
var acc = 0: uint(32);
for i in 0 ..< sz { acc += arr[i]; arr[i] = acc; }
return arr;
}
const tinyPhiLUT = makeTinyLUT(tinyPhiPrimes, tinyPhiOddCirc);
inline proc tinyPhi(x: uint(64)): int(64) {
const ndx = (x - 1) >> 1; const numtot = ndx / tinyPhiOddCirc: uint(64);
return (numtot * tinyPhiOddTot +
tinyPhiLUT[(ndx - numtot * tinyPhiOddCirc): int]): int(64);
}
proc countPrimes(lmt: uint(64)): int(64) {
if lmt < 169 { // below 169 whose sqrt is 13 is where TinyPhi doesn't work...
if lmt < 3 { if lmt < 2 { return 0; } else { return 1; } }
// adjust for the missing "degree" base primes
if lmt <= 13 {
return ((lmt - 1): int(64) >> 1) + (if (lmt < 9) then 1 else 0); }
return 5 + tinyPhiLUT[(lmt - 1): int >> 1]: int(64);
}
// Chapel doesn't have closures, so emulate them with a class...
class LegendrePi {
var n: uint(64);
var dom: domain(1);
var oprms: [dom] uint(32);
proc init(n: uint(64)) {
// first, an array of odd primes to the square root of n is generated...
this.n = n;
const sqrtn = sqrt(n: real(64)): int(64);
const rtlmt = (sqrtn - 3) / 2; this.dom = {0 .. rtlmt};
this.oprms = 0;
for i in 0 .. rtlmt do this.oprms[i] = (i + i + 3): uint(32);
var i = 0;
for i in (0 ..) { // cull the array
var ci = (i + i) * (i + 3) + 3; if ci > rtlmt { break; }
const bp = i + i + 3;
while (ci <= rtlmt) { this.oprms[ci] = 0; ci += bp; }
}
var psz = 0;
for ti in 0 .. rtlmt { // compress the odd primes array...
const tv = this.oprms[ti];
if tv != 0 { this.oprms[psz] = tv; psz += 1; }
}
this.dom = { 0 ..< psz };
}
proc lvl(pilmt: int, m: uint(64)): int(64) {
var acc = 0: int(64);
for pi in cC ..< pilmt {
const p = this.oprms[pi]: uint(64); const nm = m * p;
if this.n <= nm * p { return acc + (pilmt - pi); }
if pi > cC { acc -= this.lvl(pi, nm); }
const q = this.n / nm; acc += tinyPhi(q);
}
return acc;
}
proc this(): int(64) {
return tinyPhi(this.n) - this.lvl(this.oprms.size, 1)
+ this.oprms.size: int(64);
}
}
return (new LegendrePi(lmt))();
}
```

Use of the above function gets a gain in speed of about a further ten times over the above version due to less recursion and the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

**The Non-Recursive Partial Sieving Algorithm**

Just substitute the following Chapel code for the `countPrimes` function above to enjoy the further gain in speed:

```
const masks = for i in 0 .. 7 do (1 << i): uint(8); // faster bit twiddling
proc countPrimes(lmt: uint(64)): int(64) {
if lmt < 3 { if lmt < 2 { return 0; } else { return 1; } } // odds only!
inline proc half(x: int): int { return (x - 1) >> 1; } // convenience function
inline proc divide(nm: uint(64), d: uint(64)): int {
return (nm: real(64) / d: real(64)): int; } // floating point div faster
const sqrtn = sqrt(lmt: real(64)): uint(64);
const mxndx = (sqrtn - 1): int / 2;
const dom = {0 .. mxndx}; const csz = (mxndx + 8) / 8;
var smalls = for i in dom do i: uint(32);
var roughs = for i in dom do (i + i + 1): uint(32);
var larges = for i in dom do ((lmt / (i + i + 1)) - 1) >> 1;
var cullbuf: [0 ..< csz] uint(8);
// partial sieve loop, adjusting larges/smalls, compressing larges/roughs...
var nobps = 0; var rilmt = mxndx;
for bp in 3: uint(64) .. by 2 {
const i = (bp >> 1): int; const sqri = (i + i) * (i + 1);
if sqri > mxndx { break; } // up to quad root of counting range
if (cullbuf[i >> 3] & masks[i & 7]) != 0 { continue; } // loop not prime
cullbuf[i >> 3] |= masks[i & 7]; // cull bp itself as not a rough
for ci in sqri .. mxndx by bp { // do partial sieving pass for `bp`...
cullbuf[ci >> 3] |= masks[ci & 7]; } // cull all multiples of `bp`
// now adjust `larges` for latest partial sieve pass...
var ori = 0; // compress input rough index to output one
for iri in 0 .. rilmt {
const r = roughs[iri]: uint(64); const rci = (r >> 1): int;
if (cullbuf[rci >> 3] & masks[rci & 7]) != 0 {
continue; } // skip culled roughs in last partial sieving pass
const d = bp: uint(64) * r;
larges[ori] = larges[iri] -
(if d <= sqrtn then
larges[smalls[(d >> 1): int] - nobps]
else smalls[half(divide(lmt, d))]: uint(64)) + nobps;
roughs[ori] = r: uint(32); ori += 1;
}
var si = mxndx; // and adjust `smalls` for latest partial sieve pass...
for bpm in bp .. (sqrtn / bp - 1) | 1 by -2 {
const c = smalls[(bpm >> 1): int] - nobps: uint(32);
const e = ((bpm * bp) >> 1): int;
while si >= e { smalls[si] -= c; si -= 1; }
}
nobps += 1; rilmt = ori - 1;
}
var ans = larges[0]; // answer from larges, adjusting for over subtraction...
for i in 1 .. rilmt { ans -= larges[i]; } // combine!
ans += (rilmt + 1 + 2 * (nobps - 1)) * rilmt / 2; // adjust!
// add final adjustment for pairs of current roughs to cube root of range...
for ri in (1 ..) { // break when reaches cube root of counting range...
const p = roughs[ri]: uint(64); const q = lmt / p;
const ei = smalls[half(divide(q, p))]: int - nobps;
if ei <= ri { break; } // break here when no more pairs!
for ori in ri + 1 .. ei { // for all pairs never the same prime!
ans += smalls[half(divide(q, roughs[ori]))]: int(64); }
// adjust for over subtractions above...
ans -= (ei - ri): uint(64) * (nobps: uint(64) + ri: uint(64) - 1);
}
return ans: int(64) + 1; // add one for only even prime of two!
}
```

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division; for instance, this version can calculate the number of primes to 1e11 in about 22 milliseconds. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in under a minute as long as the computer used has the required free memory of about 800 Megabytes. This Chapel version is about the same speed as the Nim version from which it was translated other than the LLVM back-end used here produces code that is a few percent slower than as that produced by the GCC compiler used as a back-end by Nim, at least for this algorithm.

## CoffeeScript

```
sorenson = require('sieve').primes # Sorenson's extensible sieve from task: Extensible Prime Generator
# put in outer scope to avoid recomputing the cache
memoPhi = {}
primes = []
isqrt = (x) -> Math.floor Math.sqrt x
pi = (n) ->
phi = (x, a) ->
y = memoPhi[[x,a]]
return y unless y is undefined
memoPhi[[x,a]] =
if a is 0 then x
else
p = primes[a - 1]
throw "You need to generate at least #{a} primes." if p is undefined
phi(x, a - 1) - phi(x // p, a - 1)
if n < 2
0
else
a = pi isqrt n
phi(n, a) + a - 1
maxPi = 1e9
gen = sorenson()
primes = while (p = gen.next().value) < isqrt maxPi then p
n = 1
for i in [0..9]
console.log "10^#{i}\t#{pi(n)}"
n *= 10
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## Crystal

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Crystal versions are [translated from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

**The Basic Recursive Legendre Algorithm**

```
require "bit_array"
def count_primes(n : Int64)
if n < 3_i64
return 0_i64 if n < 2_i64
return 1_i64
end
rtlmt = Math.sqrt(n.to_f64).to_i32
mxndx = (rtlmt - 3) // 2
cmpsts = BitArray.new(mxndx + 1)
i = 0
while true
c = (i + i) * (i + 3) + 3
break if c > mxndx
unless cmpsts[i]
bp = i + i + 3
until c > mxndx
cmpsts[c] = true
c += bp
end
end
i += 1
end
oprms = Array(Int32).new(cmpsts.count { |e| !e }, 0)
pi = 0
cmpsts.each_with_index do |e, i|
unless e
oprms[pi] = (i + i + 3).to_i32; pi += 1
end
end
phi = uninitialized Proc(Int64, Int32, Int64) # recursion target!
phi = ->(x : Int64, a : Int32) {
return x - (x >> 1) if a < 1
p = oprms.unsafe_fetch(a - 1)
return 1_i64 if x <= p
phi.call(x, a - 1) - phi.call((x.to_f64 / p.to_f64).to_i64, a - 1)
}
phi.call(n, oprms.size) + oprms.size
end
start_time = Time.monotonic
(0 .. 9).each { |i| puts "π(10**#{i}) = #{count_primes(10_i64**i)}" }
elpsd = (Time.monotonic - start_time).total_milliseconds
puts "This took #{elpsd} milliseconds."
```

- Output:

π(10**0) = 0 π(10**1) = 4 π(10**2) = 25 π(10**3) = 168 π(10**4) = 1229 π(10**5) = 9592 π(10**6) = 78498 π(10**7) = 664579 π(10**8) = 5761455 π(10**9) = 50847534 This took 272.428954 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

**The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm**

Just substitute the following Crystal code for the `count_primes` function above to enjoy the gain in speed:

```
Tiny_Phi_Primes = [ 2, 3, 5, 7, 11, 13 ]
Tiny_Phi_Odd_Circ = Tiny_Phi_Primes.product // 2
Tiny_Phi_Tot = Tiny_Phi_Primes.reduce(1) { |acc, p| acc * (p - 1) }
CC = Tiny_Phi_Primes.size - 1
def make_Tiny_Phi_LUT()
rslt = Array(UInt16).new(Tiny_Phi_Odd_Circ, 1_u16)
Tiny_Phi_Primes.skip(1).each { |bp|
i = (bp - 1) >> 1; rslt[i] = 0; c = (i + i) * (i + 1)
while c < Tiny_Phi_Odd_Circ
rslt[c] = 0; c += bp
end }
acc = 0_u16; i = 0
while i < Tiny_Phi_Odd_Circ
acc += rslt[i]; rslt[i] = acc; i += 1
end
rslt
end
Tiny_Phi_LUT = make_Tiny_Phi_LUT()
@[AlwaysInline]
def tiny_Phi(x : Int64) : Int64
ndx = (x - 1) >> 1; numtot = ndx // Tiny_Phi_Odd_Circ.to_i64
tpli = ndx - numtot * Tiny_Phi_Odd_Circ.to_i64
numtot * Tiny_Phi_Tot.to_i64 +
Tiny_Phi_LUT.unsafe_fetch(tpli).to_i64
end
def count_primes(n : Int64)
if n < 169_i64 # below 169 whose sqrt is 13 is where TinyPhi doesn't work...
return 0_i64 if n < 2_i64
return 1_i64 if n < 3_i64
# adjust for the missing "degree" base primes
return 1 + (n - 1) // 2 if n < 9_i64
return (n - 1) // 2 if n <= 13_i64
return 5 + Tiny_Phi_LUT[(n - 1).to_i32 // 2].to_i64
end
rtlmt = Math.sqrt(n.to_f64).to_i32
mxndx = (rtlmt - 3) // 2
cmpsts = BitArray.new(mxndx + 1)
i = 0
while true
c = (i + i) * (i + 3) + 3
break if c > mxndx
unless cmpsts[i]
bp = i + i + 3
until c > mxndx
cmpsts[c] = true
c += bp
end
end
i += 1
end
oprms = Array(Int32).new(cmpsts.count { |e| !e }, 0)
opi = 0
cmpsts.each_with_index do |e, i|
unless e
oprms[opi] = (i + i + 3).to_i32; opi += 1
end
end
lvl = uninitialized Proc(Int32, Int32, Int64, Int64) # recursion target!
lvl = ->(pilo : Int32, pilmt : Int32, m : Int64) : Int64 {
pi = pilo; answr = 0_i64
while pi < pilmt
p = oprms.unsafe_fetch(pi).to_i64; nm = p * m
return answr + (pilmt - pi) if n <= nm * p
q = (n.to_f64 / nm.to_f64).to_i64; answr += tiny_Phi(q)
answr -= lvl.call(CC, pi, nm) if pi > CC
pi += 1
end
answr
}
tiny_Phi(n) - lvl.call(CC, oprms.size, 1_i64) + oprms.size
end
```

Use of the above function gets a gain in speed of about a further ten times over the above version due to less recursion, the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions, and by using floating point division for integer division because it is faster, although it is limited in precision to 53-bits, one wouldn't want to use these algorithms over a range where that would cause a problem. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

**The Non-Recursive Partial Sieving Algorithm**

Just substitute the following Crystal code for the `count_primes` function above to enjoy the further gain in speed:

```
def count_primes(n : Int64) : Int64
if n < 3
if n < 2
return 0_i64
else
return 1_i64
end
end
half = ->(n : Int64) : Int64 { (n - 1) >> 1 }
divide = ->(n : Int64, d : Int64) : Int64 { (n.to_f64 / d.to_f64).to_i64 }
rtlmt = Math.sqrt(n.to_f64).to_i32; mxndx = (rtlmt - 1) // 2
cmpsts = BitArray.new(mxndx + 1)
smalls = Array(Int32).new(mxndx + 1) { |i| i }
roughs = Array(Int32).new(mxndx + 1) { |i| i + i + 1 }
larges =
Array(Int64).new(mxndx + 1) { |i| ((n // (i + i + 1)).to_i64 - 1) >> 1 }
i = 1; nbps = 0; mxri = mxndx
while true
c = (i + i) * (i + 1); break if c > mxndx
if !cmpsts.unsafe_fetch(i)
bp = i + i + 1; cmpsts.unsafe_put(i, true)
until c > mxndx
cmpsts.unsafe_put(c, true); c += bp
end # partial sieving for bp completed here!
j = 0; ri = 0 # adjust `larges` according to partial sieve...
while j <= mxri
q = roughs.unsafe_fetch(j); qi = q >> 1
if !cmpsts.unsafe_fetch(qi)
d = bp.to_i64 * q.to_i64
larges.unsafe_put(ri, larges.unsafe_fetch(j) -
if d <= rtlmt.to_i64
ndx = smalls.unsafe_fetch(d >> 1) - nbps
larges.unsafe_fetch(ndx)
else
ndx = half.call(divide.call(n, d))
smalls.unsafe_fetch(ndx)
end + nbps)
roughs.unsafe_put(ri, q); ri += 1
end; j += 1
end
si = mxndx; bpm = (rtlmt // bp - 1) | 1
while bpm >= bp # adjust smalls according to partial sieve...
c = smalls.unsafe_fetch(bpm >> 1) - nbps; e = (bpm * bp) >> 1
while si >= e
smalls.unsafe_put(si, smalls.unsafe_fetch(si) - c); si -= 1
end
bpm -= 2
end
mxri = ri - 1; nbps += 1
end; i += 1
end
ans = larges.unsafe_fetch(0); i = 1
while i <= mxri # combine results; adjust for over subtraction base primes...
ans -= larges.unsafe_fetch(i); i += 1
end
ans += (mxri.to_i64 + 1 + 2 * (nbps.to_i64 - 1)) * mxri.to_i64 // 2 # adjust!
ri = 1 # do final phi calculation for pairs of larger primes...
while true # break on condition when up to cube root of range!
p = roughs.unsafe_fetch(ri).to_i64; q = n // p
e = smalls.unsafe_fetch(half.call(divide.call(q, p))) - nbps
break if e <= ri; ori = ri + 1
while ori <= e
ndx = half.call(divide.call(q, roughs.unsafe_fetch(ori).to_i64))
ans += smalls.unsafe_fetch(ndx).to_i64; ori += 1
end
ans -= (e - ri).to_i64 * (nbps.to_i64 + ri.to_i64 - 1); ri += 1
end
ans + 1 # for only even prime of two!
end
```

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in almost two minutes as long as the computer used has the required free memory of about 800 Megabytes. This Crystal version is about twice as slow as the Nim version from which it was translated because Nim's default GCC back-end is better at optimization (at least for this algorithm) than the LLVM back-end used by Crystal and also there is more numeric calculation checking such as numeric overflow used in Crystal that can't be turned off. At that, this version is about the same speed as when translated to Rust, which also uses a LLVM back-end.

## Elm

Currently, Elm does not have linear arrays and has no mutation whatsoever, even array content mutation protected by "monadic function chains" as for Haskell, so many algorithms can't be easily implemented such as the Legendre Partial Sieving algorithm which requires random access to linear arrays whose contents are mutated according to previous passes. However, given a prime number generator based on persistent data structures such as the List version [the Sieve of Eratosthenes task page], one can translate the algorithm of the "bottom-up" recursive version without the "TinyPhi" Look Up Table (LUT) optmization which would need a linear array to implement, as in the following code:

```
module Main exposing (main)
import Browser exposing (element)
import Task exposing (Task, succeed, perform, andThen)
import Html exposing (div, text)
import Time exposing (now, posixToMillis)
type CIS a = CIS a (() -> CIS a) -- infinite Co-Inductive Stream...
uptoCIS2List : comparable -> CIS comparable -> List comparable
uptoCIS2List n cis =
let loop (CIS hd tl) lst =
if hd > n then List.reverse lst
else loop (tl()) (hd :: lst)
in loop cis []
-- require a range of primes by Sieve of Eratosthenes...
primesTreeFolding : () -> CIS Int
primesTreeFolding() =
let
merge (CIS x xtl as xs) (CIS y ytl as ys) =
case compare x y of
LT -> CIS x <| \ () -> merge (xtl()) ys
EQ -> CIS x <| \ () -> merge (xtl()) (ytl())
GT -> CIS y <| \ () -> merge xs (ytl())
pmult bp =
let adv = bp + bp
pmlt p = CIS p <| \ () -> pmlt (p + adv)
in pmlt (bp * bp)
allmlts (CIS bp bptl) =
CIS (pmult bp) <| \ () -> allmlts (bptl())
pairs (CIS frst tls) =
let (CIS scnd tlss) = tls()
in CIS (merge frst scnd) <| \ () -> pairs (tlss())
cmpsts (CIS (CIS hd tl) tls) =
CIS hd <| \ () -> merge (tl()) <| cmpsts <| pairs (tls())
testprm n (CIS hd tl as cs) =
if n < hd then CIS n <| \ () -> testprm (n + 2) cs
else testprm (n + 2) (tl())
oddprms() =
CIS 3 <| \ () -> testprm 5 <| cmpsts <| allmlts <| oddprms()
in CIS 2 <| \ () -> oddprms()
countPrimesTo : Float -> Float -- only use integral values!
countPrimesTo n =
if n < 3 then if n < 2 then 0 else 1 else
let nnf = toFloat (floor n) -- erase fractional part!
sqrtn = sqrt nnf |> truncate
oprms = primesTreeFolding() |> uptoCIS2List sqrtn |> List.drop 1
opsz = List.length oprms
lvl opi opilmt plst m acc =
if opi >= opilmt then acc else
case plst of
[] -> acc -- should never happen
(op :: optl) ->
let opl = toFloat op
nm = m * opl in
if nnf <= nm * opl then acc + toFloat (opilmt - opi) else
let q = nnf / nm |> floor |> toFloat
nacc = acc + q - toFloat (floor (q / 2))
sacc = if opi <= 0 then 0 else lvl 0 opi oprms nm 0
in lvl (opi + 1) opilmt optl m (nacc - sacc)
in nnf - toFloat (floor (nnf / 2)) - lvl 0 opsz oprms 1 0 + toFloat opsz
-- run the required task tests...
timemillis : () -> Task Never Int -- a side effect function
timemillis() = now |> andThen (\ t -> succeed (posixToMillis t))
test : () -> Cmd Msg -- side effect function chain (includes "perform")...
test() =
timemillis()
|> andThen (\ strt ->
let rsltstrs = List.range 0 9 |> List.map ( \ n ->
"π(10^" ++ String.fromInt n ++ ") = " ++
String.fromFloat (countPrimesTo (toFloat (10^n))))
in timemillis()
|> andThen (\ stop ->
succeed (List.append rsltstrs ["This took "
++ String.fromInt (stop - strt)
++ " milliseconds."])))
|> perform Done
-- following code has to do with outputting to a web page using MUV/TEA...
type alias Model = List String
type Msg = Done Model
main : Program () Model Msg
main = -- starts with empty list of strings; views model of filled list...
element { init = \ _ -> ( [], test() )
, update = \ (Done mdl) _ -> ( mdl , Cmd.none )
, subscriptions = \ _ -> Sub.none
, view = div [] << List.map (div [] << List.singleton << text) }
```

- Output:

π(10^0) = 0 π(10^1) = 4 π(10^2) = 25 π(10^3) = 168 π(10^4) = 1229 π(10^5) = 9592 π(10^6) = 78498 π(10^7) = 664579 π(10^8) = 5761455 π(10^9) = 50847534 This took 341 milliseconds.

## Erlang

```
-mode(native).
-define(LIMIT, 1000000000).
main(_) ->
put(primes, array:from_list(primality:sieve(floor(math:sqrt(?LIMIT))))),
ets:new(memphi, [set, named_table, protected]),
output(0, 9).
nthprime(N) -> array:get(N - 1, get(primes)).
output(A, B) -> output(A, B, 1).
output(A, B, _) when A > B -> ok;
output(A, B, N) ->
io:format("10^~b ~b~n", [A, pi(N)]),
output(A + 1, B, N * 10).
pi(N) ->
Primes = get(primes),
Last = array:get(array:size(Primes) - 1, Primes),
if
N =< Last -> small_pi(N);
true ->
A = pi(floor(math:sqrt(N))),
phi(N, A) + A - 1
end.
phi(X, 0) -> X;
phi(X, A) ->
case ets:lookup(memphi, {X, A}) of
[] ->
Phi = phi(X, A-1) - phi(X div nthprime(A), A-1),
ets:insert(memphi, {{X, A}, Phi}),
Phi;
[{{X, A}, Phi}] -> Phi
end.
% Use binary search to count primes that we have already listed.
small_pi(N) ->
Primes = get(primes),
small_pi(N, Primes, 0, array:size(Primes)).
small_pi(_, _, L, H) when L >= (H - 1) -> L + 1;
small_pi(N, Primes, L, H) ->
M = (L + H) div 2,
P = array:get(M, Primes),
if
N > P -> small_pi(N, Primes, M, H);
N < P -> small_pi(N, Primes, 0, M);
true -> M + 1
end.
```

- Output:

10^0 1 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## F#

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following F# versions are [translated from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

**The Basic Recursive Legendre Algorithm**

```
let countPrimes (lmt: uint64) =
if lmt < 3UL then (if lmt < 2UL then 0L else 1L) else
let sqrtlmt = lmt |> float |> sqrt |> uint64
let mxndx = (sqrtlmt - 3UL) / 2UL |> int
let oprms =
let cb = Array.init (mxndx + 1) <| fun i -> uint32 (i + i + 3)
let rec loopi i =
let sqri = (i + i) * (i + 3) + 3
if sqri > mxndx then () else
if cb.[i] = 0u then loopi (i + 1) else
let bp = i + i + 3
let rec cull c = if c > mxndx then () else cb.[c] <- 0u; cull (c + bp)
cull sqri; loopi (i + 1)
loopi 0; cb |> Array.filter ((<>) 0u)
let rec phi x a =
if a <= 0 then x - (x >>> 1) |> int64 else
let na = a - 1 in let p = uint64 oprms.[na]
if x <= p then 1L else phi x na - phi (x / p) na
phi lmt oprms.Length + int64 oprms.Length
let strt = System.DateTime.Now.Ticks
{ 0 .. 9 } |> Seq.iter (fun i ->
printfn "π(10**%d) = %d" i (countPrimes (uint64(10. ** i))))
let elpsd = (System.DateTime.Now.Ticks - strt) / 10000L
printfn "This took %d milliseconds." elpsd
```

- Output:

π(10**0) = 0 π(10**1) = 4 π(10**2) = 25 π(10**3) = 168 π(10**4) = 1229 π(10**5) = 9592 π(10**6) = 78498 π(10**7) = 664579 π(10**8) = 5761455 π(10**9) = 50847534 This took 525 milliseconds.

The above time as run on an Intel i5-6500 (3.6 GHz single-threaded boost) isn't particularly fast but includes some DotNet initialization and JIT compilation overhead, and is about as fast as a highly optimized page-segmented wheel-factorized Sieve of Eratosthenes; Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

**The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm**

Just substitute the following F# code for the `countPrimes` function above to enjoy the gain in speed:

```
let TinyPhiPrimes = [| 2; 3; 5; 7; 11; 13 |]
let TinyPhiDeg = TinyPhiPrimes.Length - 1
let TinyPhiOddCirc = (TinyPhiPrimes |> Seq.reduce (*)) / 2
let TinyPhiTot = TinyPhiPrimes |> Seq.fold (fun s p -> s * (p - 1)) 1
let TinyPhiLUT =
let cb = Array.init TinyPhiOddCirc (fun i -> 1u)
TinyPhiPrimes |> Seq.skip 1
|> Seq.iter (fun bp ->
cb.[(bp - 1) >>> 1] <- 0u
{ (bp * bp - 1) >>> 1 .. bp .. TinyPhiOddCirc - 1}
|> Seq.iter (fun c -> cb.[c] <- 0u) )
let rec loopi i acc =
if i >= TinyPhiOddCirc then () else
let nacc = acc + cb[i] in cb.[i] <- nacc; loopi (i + 1) nacc
loopi 0 0u; cb
let tinyPhi (x: uint64): int64 =
let ndx = (x - 1UL) >>> 1 |> int64
let numtots = ndx / int64 TinyPhiOddCirc
let li = ndx - numtots * int64 TinyPhiOddCirc |> int
numtots * int64 TinyPhiTot + int64 TinyPhiLUT.[li]
let countPrimes (lmt: uint64) =
if lmt < 169UL then // below 169 whose sqrt is 13 is where TinyPhi doesn't work...
( if lmt < 2UL then 0L else
if lmt < 3UL then 1L else
// adjust for the missing "degree" base primes
if lmt < 9UL then 1L + int64 (lmt - 1UL) / 2L else
if lmt <= 13UL then int64 (lmt - 1UL) / 2L else
5L + int64 TinyPhiLUT.[int (lmt - 1UL) / 2]) else
let sqrtlmt = lmt |> float |> sqrt |> uint64
let mxndx = (sqrtlmt - 3UL) / 2UL |> int
let oprms =
let cb = Array.init (mxndx + 1) <| fun i -> uint32 (i + i + 3)
let rec loopi i =
let sqri = (i + i) * (i + 3) + 3
if sqri > mxndx then () else
if cb.[i] = 0u then loopi (i + 1) else
let bp = i + i + 3
let rec cull c = if c > mxndx then () else cb.[c] <- 0u; cull (c + bp)
cull sqri; loopi (i + 1)
loopi 0; cb |> Array.filter ((<>) 0u)
let rec lvl pilmt m =
let rec looppi pi acc =
if pi >= pilmt then acc else
let p = uint64 oprms.[pi] in let nm = p * m
if lmt <= nm * p then acc + int64 (pilmt - pi) else
let nacc = if pi <= TinyPhiDeg then acc else acc - lvl pi nm
looppi (pi + 1) (nacc + tinyPhi (lmt / nm))
looppi TinyPhiDeg 0L
tinyPhi lmt - lvl oprms.Length 1UL + int64 oprms.Length
```

Use of the above function gets a gain in speed of about a further ten times for far larger ranges than this over the above version due to less recursion, the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

**The Non-Recursive Partial Sieving Algorithm**

Just substitute the following F# code for the `countPrimes` function above to enjoy the further gain in speed; note that rather than using a recursive function loop for the partial sieving culling pass as per the above two version, this version uses a Seq iteration which is slower but more concise in expression, but as sieving is a negligible part of the total execution time, it doesn't matter and conciseness was considered to be more important; this version may appear to be recursive to those unfamiliar with Functional Programming (FP) implementations but FP generally turns all function recursions in tail-call position (which all of these are) into the same compiled code as a simple loop in other languages:

```
let masks = Array.init 8 ((<<<) 1uy) // quick bit twiddling
let countPrimes (lmt: uint64): int64 =
if lmt < 3UL then (if lmt < 2UL then 0L else 1L) else
let inline half x = (x - 1) >>> 1
let inline divide nm d = (float nm) / (float d) |> int
let sqrtlmt = lmt |> float |> sqrt |> uint64
let mxndx = (sqrtlmt - 1UL) / 2UL |> int in let cbsz = (mxndx + 8) / 8
let cullbuf = Array.zeroCreate cbsz
let smalls = Array.init (mxndx + 1) uint32
let roughs = Array.init (mxndx + 1) <| fun i -> uint32 (i + i + 1)
let larges = Array.init (mxndx + 1) <| fun i ->
int64 (lmt / uint64 (i + i + 1) - 1UL) / 2L
let rec loopbp bp nobps rilmt =
let i = int (bp - 1UL) >>> 1 in let sqri = (i + i) * (i + 1)
if sqri > mxndx then nobps, rilmt else
if (cullbuf.[i >>> 3] &&& masks.[i &&& 7]) <> 0uy then
loopbp (bp + 2UL) nobps rilmt else
let w = i >>> 3 in cullbuf.[w] <- cullbuf.[w] ||| masks.[i &&& 7] // cull bp
{ sqri .. int bp .. mxndx } |> Seq.iter (fun c -> // cull multiples of bp...
let w = c >>> 3 in cullbuf.[w] <- cullbuf.[w] ||| masks.[c &&& 7] )
// adjust `larges for last partial loop pass;
// compress larges/roughs for current partial sieve pass...
let rec loopri iri ori =
if iri > rilmt then ori - 1 else
let r = uint64 roughs.[iri] in let sri = int (r >>> 1)
if (cullbuf.[sri >>> 3] &&& masks.[sri &&& 7]) <> 0uy then
loopri (iri + 1) ori else // skip for roughs culled this pass!
let d = bp * r
larges.[ori] <- larges.[iri] -
( if d <= sqrtlmt then
larges.[int smalls.[int (d >>> 1)] - nobps]
else let ndx = (half << divide lmt) d
int64 smalls.[ndx] ) + int64 nobps
roughs.[ori] <- uint32 r; loopri (iri + 1) (ori + 1)
// adjust `smalls` for last partial loop pass...
let rec loopbpm bpm mxsi =
if bpm < bp then () else
let c = smalls.[int (bpm >>> 1)] - uint32 nobps
let ei = (bpm * bp) >>> 1 |> int
let rec loopsi si =
if si < ei then si else smalls.[si] <- smalls.[si] - c; loopsi (si - 1)
loopbpm (bpm - 2UL) (loopsi mxsi)
let nrilmt = loopri 0 0
loopbpm ((sqrtlmt / bp - 1UL) ||| 1UL) mxndx
loopbp (bp + 2UL) (nobps + 1) nrilmt
// accumulate result so far; compensate for over subtraction...
let numobps, mxri = loopbp 3UL 0 mxndx
let rec smr i acc =
if i > mxri then // adjust accumulated answer!
acc + (int64 mxri + 1L + 2L * int64 (numobps - 1)) * int64 mxri / 2L
else smr (i + 1) (acc - larges.[i])
let ans0 = smr 1 larges.[0]
// finally, add result from pairs of rough primes up to cube root of range,
// where they are two different primes; compensating for over addition...
let rec loopri ri acc =
let p = uint64 roughs.[ri] in let q = lmt / p
let ei = int smalls.[(half << divide q) p] - numobps
if ei <= ri then acc else
let rec loopori ori oacc =
if ori > ei then oacc else
let ndx = (half << divide q) (uint64 roughs.[ori])
loopori (ori + 1) (oacc + int64 smalls.[ndx])
let nacc = loopori (ri + 1) acc // subtract over addition of base primes:
loopri (ri + 1) (nacc - int64 (ei - ri) * (int64 numobps + int64 ri - 1L))
loopri 1 ans0 + 1L // add one for only even prime of two!
```

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version (although one can't see that gain for these trivial ranges as it is buried in the constant overheads) due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division; for instance, this version can calculate the number of primes to 1e11 in about a hundred milliseconds. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in about a hundred seconds as long as the computer used has the required free memory of about 800 Megabytes. This F# version is about twice as slow as the Nim version from which it was translated due to the benefits of native code compilation and more optimizations for Nim rather than JIT compilation for F#. At that, this version is about the same speed as when translated to Rust and Crystal once the range is increased where constant overheads aren't a factor, which both use native code compilers.

## Go

### Memoized

This runs in about 1.9 seconds which is surprisingly slow for Go. However, Go has the same problem as Wren in not being able to use lists for map keys. I tried using a 2-element array for the key but this was a second slower than the Cantor function.

The alternative of sieving a billion numbers and then counting them takes about 4.8 seconds using the present library function though this could be cut to 1.2 seconds using a third party library such as primegen.

```
package main
import (
"fmt"
"log"
"math"
"rcu"
)
func cantorPair(x, y int) int {
if x < 0 || y < 0 {
log.Fatal("Arguments must be non-negative integers.")
}
return (x*x + 3*x + 2*x*y + y + y*y) / 2
}
func pi(n int) int {
if n < 2 {
return 0
}
if n == 2 {
return 1
}
primes := rcu.Primes(int(math.Sqrt(float64(n))))
a := len(primes)
memoPhi := make(map[int]int)
var phi func(x, a int) int // recursive closure
phi = func(x, a int) int {
if a < 1 {
return x
}
if a == 1 {
return x - (x >> 1)
}
pa := primes[a-1]
if x <= pa {
return 1
}
key := cantorPair(x, a)
if v, ok := memoPhi[key]; ok {
return v
}
memoPhi[key] = phi(x, a-1) - phi(x/pa, a-1)
return memoPhi[key]
}
return phi(n, a) + a - 1
}
func main() {
for i, n := 0, 1; i <= 9; i, n = i+1, n*10 {
fmt.Printf("10^%d %d\n", i, pi(n))
}
}
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

### Non-memoized

Much quicker than before (0.53 seconds) and, of course, uses less memory.

```
package main
import (
"fmt"
"math"
"rcu"
)
func pi(n int) int {
if n < 2 {
return 0
}
if n == 2 {
return 1
}
primes := rcu.Primes(int(math.Sqrt(float64(n))))
a := len(primes)
var phi func(x, a int) int // recursive closure
phi = func(x, a int) int {
if a < 1 {
return x
}
if a == 1 {
return x - (x >> 1)
}
pa := primes[a-1]
if x <= pa {
return 1
}
return phi(x, a-1) - phi(x/pa, a-1)
}
return phi(n, a) + a - 1
}
func main() {
for i, n := 0, 1; i <= 9; i, n = i+1, n*10 {
fmt.Printf("10^%d %d\n", i, pi(n))
}
}
```

- Output:

Same as first version.

### Iterative, partial sieving

This non-memoized, non-recursive version is much quicker than the first two versions, running in about 2 milliseconds which is the same as V.

```
package main
import (
"fmt"
"math"
"time"
)
var masks = [8]uint8{1, 2, 4, 8, 16, 32, 64, 128}
func half(n int) int { return (n - 1) >> 1 }
func divide(nm, d uint64) int { return int(float64(nm) / float64(d)) }
func countPrimes(n uint64) int64 {
if n < 9 {
if n < 2 {
return 0
} else {
return (int64(n) + 1) / 2
}
}
rtlmt := int(math.Sqrt(float64(n)))
mxndx := (rtlmt - 1) / 2
arrlen := mxndx + 1
smalls := make([]uint32, arrlen)
roughs := make([]uint32, arrlen)
larges := make([]int64, arrlen)
for i := uint32(0); i < uint32(arrlen); i++ {
smalls[i] = i
roughs[i] = i + i + 1
larges[i] = int64((n/uint64(i+i+1) - 1) / 2)
}
cullbuflen := (mxndx + 8) / 8
cullbuf := make([]uint8, cullbuflen)
nbps := 0
rilmt := arrlen
for i := 1; ; i++ {
sqri := (i + i) * (i + 1)
if sqri > mxndx {
break
}
if (cullbuf[i>>3] & masks[i&7]) != 0 {
continue
}
cullbuf[i>>3] |= masks[i&7]
bp := i + i + 1
for c := sqri; c < arrlen; c += bp {
cullbuf[c>>3] |= masks[c&7]
}
nri := 0
for ori := 0; ori < rilmt; ori++ {
r := int(roughs[ori])
rci := r >> 1
if (cullbuf[rci>>3] & masks[rci&7]) != 0 {
continue
}
d := r * bp
t := int64(0)
if d <= rtlmt {
t = larges[int(smalls[d>>1])-nbps]
} else {
t = int64(smalls[half(divide(n, uint64(d)))])
}
larges[nri] = larges[ori] - t + int64(nbps)
roughs[nri] = uint32(r)
nri++
}
si := mxndx
for pm := (rtlmt/bp - 1) | 1; pm >= bp; pm -= 2 {
c := smalls[pm>>1]
e := (pm * bp) >> 1
for ; si >= e; si-- {
smalls[si] -= (c - uint32(nbps))
}
}
rilmt = nri
nbps++
}
ans := larges[0] + int64(((rilmt + 2*(nbps-1)) * (rilmt - 1) / 2))
for ri := 1; ri < rilmt; ri++ {
ans -= larges[ri]
}
for ri := 1; ; ri++ {
p := uint64(roughs[ri])
m := n / p
ei := int(smalls[half(int(m/p))]) - nbps
if ei <= ri {
break
}
ans -= int64((ei - ri) * (nbps + ri - 1))
for sri := ri + 1; sri < ei+1; sri++ {
ans += int64(smalls[half(divide(m, uint64(roughs[sri])))])
}
}
return ans + 1
}
func main() {
start := time.Now()
for i, n := uint64(0), uint64(1); i <= 9; i, n = i+1, n*10 {
fmt.Printf("10^%d %d\n", i, countPrimes(n))
}
elapsed := time.Since(start).Microseconds()
fmt.Printf("\nTook %d microseconds\n", elapsed)
}
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534 Took 1862 microseconds

## Haskell

With Memoization:

```
{-# OPTIONS_GHC -O2 -fllvm -Wno-incomplete-patterns #-}
{-# LANGUAGE DeriveFunctor #-}
import Data.Time.Clock.POSIX ( getPOSIXTime ) -- for timing
import Data.Int ( Int64 )
import Data.Bits ( Bits( shiftL, shiftR ) )
data Memo a = EmptyNode | Node a (Memo a) (Memo a)
deriving Functor
memo :: Integral a => Memo p -> a -> p
memo (Node a l r) n
| n == 0 = a
| odd n = memo l (n `div` 2)
| otherwise = memo r (n `div` 2 - 1)
nats :: Integral a => Memo a
nats = Node 0 ((+1).(*2) <$> nats) ((*2).(+1) <$> nats)
memoize :: Integral a => (a -> b) -> a -> b
memoize f = memo (f <$> nats)
memoize2 :: (Integral a, Integral b) => (a -> b -> c) -> a -> b -> c
memoize2 f = memoize (memoize . f)
memoList :: [b] -> Integer -> b
memoList = memo . mkList
where
mkList [] = EmptyNode -- never used; makes complete
mkList (x:xs) = Node x (mkList l) (mkList r)
where (l,r) = split xs
split [] = ([],[])
split [x] = ([x],[])
split (x:y:xs) = let (l,r) = split xs in (x:l, y:r)
isqrt :: Integer -> Integer
isqrt n = go n 0 (q `shiftR` 2)
where
q = head $ dropWhile (< n) $ iterate (`shiftL` 2) 1
go z r 0 = r
go z r q = let t = z - r - q
in if t >= 0
then go t (r `shiftR` 1 + q) (q `shiftR` 2)
else go z (r `shiftR` 1) (q `shiftR` 2)
primes :: [Integer]
primes = 2 : _Y ((3:) . gaps 5 . _U . map(\p-> [p*p, p*p+2*p..])) where
_Y g = g (_Y g) -- = g (g (g ( ... ))) non-sharing multistage fixpoint combinator
gaps k s@(c:cs) | k < c = k : gaps (k+2) s -- ~= ([k,k+2..] \\ s)
| otherwise = gaps (k+2) cs -- when null(s\\[k,k+2..])
_U ((x:xs):t) = x : (merge xs . _U . pairs) t -- tree-shaped folding big union
pairs (xs:ys:t) = merge xs ys : pairs t
merge xs@(x:xt) ys@(y:yt) | x < y = x : merge xt ys
| y < x = y : merge xs yt
| otherwise = x : merge xt yt
phi :: Integer -> Integer -> Integer
phi = memoize2 phiM
where
phiM x 0 = x
phiM x a = phi x (a-1) - phi (x `div` p a) (a - 1)
p = memoList (undefined : primes)
legendrePi :: Integer -> Integer
legendrePi n
| n < 2 = 0
| otherwise = phi n a + a - 1
where a = legendrePi (floor (sqrt (fromInteger n)))
main :: IO ()
main = do
strt <- getPOSIXTime
mapM_ (\n -> putStrLn $ show n ++ "\t" ++ show (legendrePi (10^n))) [0..9]
stop <- getPOSIXTime
let elpsd = round $ 1e3 * (stop - strt) :: Int64
putStrLn $ "This last took " ++ show elpsd ++ " milliseconds."
```

0 0 1 4 2 25 3 168 4 1229 5 9592 6 78498 7 664579 8 5761455 9 50847534 This last took 20383 milliseconds.

The above code reveals all of the problems with the usual Haskell developer with overuse of the multi-precision `Integer` type when there is no way that the range of a 64-bit integer will ever be exceeded, overuse of lists, developed and tested using the interpreter, etc., such that, even compiled using the LLVM back-end, the above code is about ten times slower that any other compiled language for a memoized version.

### Non Memoized Haskell Versions

There doesn't seem to be much point to improving the above memoized version even though one could gain a factor of about ten in doing so as the need for memoization seems to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Haskell versions are [translated from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

**The Basic Recursive Legendre Algorithm**

```
{-# OPTIONS_GHC -O2 -fllvm #-}
{-# LANGUAGE FlexibleContexts, BangPatterns #-}
import Data.Time.Clock.POSIX ( getPOSIXTime ) -- for timing
import Data.Int ( Int64 )
import Data.Word ( Word32, Word64 )
import Data.Bits ( Bits( (.&.), (.|.), shiftL, shiftR ) )
import Control.Monad ( unless, when, forM_ )
import Control.Monad.ST ( ST, runST )
import Data.Array.ST ( runSTUArray )
import Data.Array.Base ( UArray(..), IArray(unsafeAt), listArray, elems, assocs,
MArray( unsafeNewArray_, newArray, unsafeRead, unsafeWrite ),
STUArray, unsafeFreezeSTUArray, castSTUArray )
countPrimes :: Word64 -> Int64
countPrimes n =
if n < 3 then (if n < 2 then 0 else 1) else
let sqrtn = truncate $ sqrt $ fromIntegral n
qdrtn = truncate $ sqrt $ fromIntegral sqrtn
rtlmt = (sqrtn - 3) `div` 2
qrtlmt = (qdrtn - 3) `div` 2
oddPrimes@(UArray _ _ psz _) = runST $ do -- UArray of odd primes...
cmpsts <- newArray (0, rtlmt) False :: ST s (STUArray s Int Bool)
forM_ [ 0 .. qrtlmt ] $ \ i -> do
t <- unsafeRead cmpsts i
unless t $ do
let sqri = (i + i) * (i + 3) + 3
bp = i + i + 3
forM_ [ sqri, sqri + bp .. rtlmt ] $ \ c ->
unsafeWrite cmpsts c True
fcmpsts <- unsafeFreezeSTUArray cmpsts
let !numoprms = sum $ [ 1 | False <- elems fcmpsts ]
prms = [ fromIntegral $ i + i + 3 | (i, False) <- assocs fcmpsts ]
return $ listArray (0, numoprms - 1) prms :: ST s (UArray Int Word32)
phi x a =
if a < 1 then x - (x `shiftR` 1) else
let na = a - 1
p = fromIntegral $ unsafeAt oddPrimes na in
if x < p then 1 else
phi x na - phi (x `div` p) na
in fromIntegral (phi n psz) + fromIntegral psz
main :: IO ()
main = do
strt <- getPOSIXTime
mapM_ (\n -> putStrLn $ show n ++ "\t" ++ show (countPrimesx (10^n))) [ 0 .. 9 ]
stop <- getPOSIXTime
let elpsd = round $ 1e3 * (stop - strt) :: Int64
putStrLn $ "This took " ++ show elpsd ++ " milliseconds."
```

- Output:

0 0 1 4 2 25 3 168 4 1229 5 9592 6 78498 7 664579 8 5761455 9 50847534 This took 273 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

Although this is much faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

**The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm**

```
cTinyPhiPrimes :: [Int]
cTinyPhiPrimes = [ 2, 3, 5, 7, 11, 13 ]
cC :: Int
cC = length cTinyPhiPrimes - 1
cTinyPhiOddCirc :: Int
cTinyPhiOddCirc = product cTinyPhiPrimes `div` 2
cTinyPhiTot :: Int
cTinyPhiTot = product [ p - 1 | p <- cTinyPhiPrimes ]
cTinyPhiLUT :: UArray Int Word32
cTinyPhiLUT = runSTUArray $ do
ma <- newArray (0, cTinyPhiOddCirc - 1) 1
forM_ (drop 1 cTinyPhiPrimes) $ \ bp -> do
let i = (bp - 1) `shiftR` 1
let sqri = (i + i) * (i + 1)
unsafeWrite ma i 0
forM_ [ sqri, sqri + bp .. cTinyPhiOddCirc - 1 ] $ \ c -> unsafeWrite ma c 0
let tot i acc =
if i >= cTinyPhiOddCirc then return ma else do
v <- unsafeRead ma i
if v == 0 then do unsafeWrite ma i acc; tot (i + 1) acc
else do let nacc = acc + 1
unsafeWrite ma i nacc; tot (i + 1) nacc
tot 0 0
tinyPhi :: Word64 -> Int64
tinyPhi n =
let on = (n - 1) `shiftR` 1
numcyc = on `div` fromIntegral cTinyPhiOddCirc
rem = fromIntegral $ on - numcyc * fromIntegral cTinyPhiOddCirc
in fromIntegral numcyc * fromIntegral cTinyPhiTot +
fromIntegral (unsafeAt cTinyPhiLUT rem)
countPrimes :: Word64 -> Int64
countPrimes n =
if n < 3 then (if n < 2 then 0 else 1) else
let sqrtn = truncate $ sqrt $ fromIntegral n
qdrtn = truncate $ sqrt $ fromIntegral sqrtn
rtlmt = (sqrtn - 3) `div` 2
qrtlmt = (qdrtn - 3) `div` 2
oddPrimes@(UArray _ _ psz _) = runST $ do -- UArray of odd primes...
cmpsts <- newArray (0, rtlmt) False :: ST s (STUArray s Int Bool)
forM_ [ 0 .. qrtlmt ] $ \ i -> do
t <- unsafeRead cmpsts i
unless t $ do
let sqri = (i + i) * (i + 3) + 3
bp = i + i + 3
forM_ [ sqri, sqri + bp .. rtlmt ] $ \ c ->
unsafeWrite cmpsts c True
fcmpsts <- unsafeFreezeSTUArray cmpsts
let !numoprms = sum $ [ 1 | False <- elems fcmpsts ]
prms = [ fromIntegral $ i + i + 3 | (i, False) <- assocs fcmpsts ]
return $ listArray (0, numoprms - 1) prms :: ST s (UArray Int Word32)
lvl pi pilmt !m !acc =
if pi >= pilmt then acc else
let p = fromIntegral $ unsafeAt oddPrimes pi
nm = m * p in
if n <= nm * p then acc + fromIntegral (pilmt - pi) else
let !q = fromIntegral $ n `div` nm
!nacc = acc + tinyPhi q
!sacc = if pi <= cC then 0 else lvl cC pi nm 0
in lvl (pi + 1) pilmt m $ nacc - sacc
in tinyPhi n - lvl cC psz 1 0 + fromIntegral psz
```

Use of the above function gets a gain in speed of about a further fifteen times over the above version due to less recursion and the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a little over ten seconds.

**The Non-Recursive Partial Sieving Algorithm**

Just substitute the following Haskell code for the `countPrimes` function above to enjoy the further gain in speed; this version may appear to be recursive to those unfamiliar with Functional Programming (FP) implementations but FP generally turns all function recursions in tail-call position (which all of these are) into the same compiled code as a simple loop in other languages:

```
countPrimes :: Word64 -> Int64
countPrimes n =
if n < 3 then (if n < 2 then 0 else 1) else
let
{-# INLINE divide #-}
divide :: Word64 -> Word64 -> Int
divide nm d = truncate $ (fromIntegral nm :: Double) / fromIntegral d
{-# INLINE half #-}
half :: Int -> Int
half x = (x - 1) `shiftR` 1
rtlmt = floor $ sqrt (fromIntegral n :: Double)
mxndx = (rtlmt - 1) `div` 2
(!nbps, !nrs, !smalls, !roughs, !larges) = runST $ do
mss <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Word32)
let msscst =
castSTUArray :: STUArray s Int Word32 -> ST s (STUArray s Int Int64)
mdss <- msscst mss -- for use in adjing counts LUT
forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mss i (fromIntegral i)
mrs <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Word32)
forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mrs i (fromIntegral i * 2 + 1)
mls <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int64)
forM_ [ 0 .. mxndx ] $ \ i ->
let d = fromIntegral (i + i + 1)
in unsafeWrite mls i (fromIntegral (divide n d - 1) `div` 2)
cmpsts <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Bool)
let loop i !cbpi !rlmti =
let sqri = (i + i) * (i + 1) in
if sqri > mxndx then do
fss <- unsafeFreezeSTUArray mss
frs <- unsafeFreezeSTUArray mrs
fls <- unsafeFreezeSTUArray mls
return (cbpi, rlmti + 1, fss, frs, fls)
else do
v <- unsafeRead cmpsts i
if v then loop (i + 1) cbpi rlmti else do
unsafeWrite cmpsts i True -- cull current bp so not a "k-rough"!
let bp = i + i + 1
-- partial cull by current base prime, bp...
cull c = if c > mxndx then return () else do
unsafeWrite cmpsts c True; cull (c + bp)
-- adjust `larges` according to partial sieve...
part ri nri = -- old "rough" index to new one...
if ri > rlmti then return (nri - 1) else do
r <- unsafeRead mrs ri -- "rough" always odd!
t <- unsafeRead cmpsts (fromIntegral r `shiftR` 1)
if t then part (ri + 1) nri else do -- skip newly culled
olv <- unsafeRead mls ri
let m = fromIntegral r * fromIntegral bp
adjv <- if m <= fromIntegral rtlmt then do
let ndx = fromIntegral m `shiftR` 1
sv <- unsafeRead mss ndx
unsafeRead mls (fromIntegral sv - cbpi)
else do
sv <- unsafeRead mss (half (divide n m))
return (fromIntegral sv)
unsafeWrite mls nri (olv - (adjv - fromIntegral cbpi))
unsafeWrite mrs nri r; part (ri + 1) (nri + 1)
!pm0 = ((rtlmt `div` bp) - 1) .|. 1 -- max base prime mult
adjc lmti pm = -- adjust smalls according to partial sieve:
if pm < bp then return () else do
c <- unsafeRead mss (pm `shiftR` 1)
let ac = c - fromIntegral cbpi -- correction
bi = (pm * bp) `shiftR` 1 -- start array index
adj si = if si > lmti then adjc (bi - 1) (pm - 2)
else do ov <- unsafeRead mss si
unsafeWrite mss si (ov - ac)
adj (si + 1)
adj bi
dadjc lmti pm =
if pm < bp then return () else do
c <- unsafeRead mss (pm `shiftR` 1)
let ac = c - fromIntegral cbpi -- correction
bi = (pm * bp) `shiftR` 1 -- start array index
ac64 = fromIntegral ac :: Int64
dac = (ac64 `shiftL` 32) .|. ac64
dbi = (bi + 1) `shiftR` 1
dlmti = (lmti - 1) `shiftR` 1
dadj dsi = if dsi > dlmti then return ()
else do dov <- unsafeRead mdss dsi
unsafeWrite mdss dsi (dov - dac)
dadj (dsi + 1)
when (bi .&. 1 /= 0) $ do
ov <- unsafeRead mss bi
unsafeWrite mss bi (ov - ac)
dadj dbi
when (lmti .&. 1 == 0) $ do
ov <- unsafeRead mss lmti
unsafeWrite mss lmti (ov - ac)
adjc (bi - 1) (pm - 2)
cull sqri; nrlmti <- part 0 0
dadjc mxndx pm0
loop (i + 1) (cbpi + 1) nrlmti
loop 1 0 mxndx
!ans0 = unsafeAt larges 0 - -- combine all counts; each includes nbps...
sum [ unsafeAt larges i | i <- [ 1 .. nrs - 1 ] ]
-- adjust for all the base prime counts subracted above...
!adj = (nrs + 2 * (nbps - 1)) * (nrs - 1) `div` 2
!adjans0 = ans0 + fromIntegral adj
loopr ri !acc = -- o final phi calculation for pairs of larger primes...
let r = fromIntegral (unsafeAt roughs ri)
q = n `div` r
lmtsi = half (fromIntegral (q `div` r))
lmti = fromIntegral (unsafeAt smalls lmtsi) - nbps
addcnt pi !ac =
if pi > lmti then ac else
let p = fromIntegral (unsafeAt roughs pi)
ci = half (fromIntegral (divide q p))
in addcnt (pi + 1) (ac + fromIntegral (unsafeAt smalls ci))
in if lmti <= ri then acc else -- break when up to cube root of range!
-- adjust for the `nbps`'s over added in the `smalls` counts...
let !adj = fromIntegral ((lmti - ri) * (nbps + ri - 1))
in loopr (ri + 1) (addcnt (ri + 1) acc - adj)
in loopr 1 adjans0 + 1 -- add one for only even prime of two!
```

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division. This version of prime counting function might be considered to be reasonably useful for counting primes up to 1e16 in just over a minute as long as the computer used has the required free memory of about 800 Megabytes, counting the primes to 1e11 in under 30 milliseconds. This Haskell version is about twenty percent slower than the Nim version from which it was translated because Nim's default GCC back-end is better at optimization (at least for this algorithm) than the LLVM back-end used by Haskell here.

## J

This is "almost a primitive" in J:

```
require'format/printf'
{{echo '10^%d: %d' sprintf y;p:inv 10^y}}"0 i.10
10^0: 0
10^1: 4
10^2: 25
10^3: 168
10^4: 1229
10^5: 9592
10^6: 78498
10^7: 664579
10^8: 5761455
10^9: 50847534
```

A complete implementation of the legendre prime counting function would be `(1&p: + p:inv)`:

```
(1&p: + p:inv) 10
4
(1&p: + p:inv) 11
5
```

But we can simplify that to p:inv when we know that primes are not being tested.

## Java

```
import java.util.*;
public class LegendrePrimeCounter {
public static void main(String[] args) {
LegendrePrimeCounter counter = new LegendrePrimeCounter(1000000000);
for (int i = 0, n = 1; i < 10; ++i, n *= 10)
System.out.printf("10^%d\t%d\n", i, counter.primeCount((n)));
}
private List<Integer> primes;
public LegendrePrimeCounter(int limit) {
primes = generatePrimes((int)Math.sqrt((double)limit));
}
public int primeCount(int n) {
if (n < 2)
return 0;
int a = primeCount((int)Math.sqrt((double)n));
return phi(n, a) + a - 1;
}
private int phi(int x, int a) {
if (a == 0)
return x;
if (a == 1)
return x - (x >> 1);
int pa = primes.get(a - 1);
if (x <= pa)
return 1;
return phi(x, a - 1) - phi(x / pa, a - 1);
}
private static List<Integer> generatePrimes(int limit) {
boolean[] sieve = new boolean[limit >> 1];
Arrays.fill(sieve, true);
for (int p = 3, s = 9; s < limit; p += 2) {
if (sieve[p >> 1]) {
for (int q = s; q < limit; q += p << 1)
sieve[q >> 1] = false;
}
s += (p + 1) << 2;
}
List<Integer> primes = new ArrayList<>();
if (limit > 2)
primes.add(2);
for (int i = 1; i < sieve.length; ++i) {
if (sieve[i])
primes.add((i << 1) + 1);
}
return primes;
}
}
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## JavaScript

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following JavaScript versions are [translated and improved from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

**The Basic Recursive Legendre Algorithm**

```
"use strict";
function countPrimesTo(lmt) {
if (lmt < 3) { if (lmt < 2) return 0; else return 1; }
const sqrtlmt = Math.sqrt(lmt) >>> 0;
const oprms = function() {
const mxndx = (sqrtlmt - 3) >>> 1;
const arr = new Float64Array(mxndx + 1);
for (let i = 0 >>> 0; i <= mxndx; ++i) arr[i] = (i + i + 3) >>> 0;
let bp = 3 >>> 0;
while (true) {
let i = (bp - 3) >>> 1; let sqri = ((i + i) * (i + 3) + 3) >>> 0;
if (sqri > mxndx) break;
if (arr[i] != 0) for (; sqri <= mxndx; sqri += bp) arr[sqri] = 0;
bp += 2;
}
return arr.filter(v => v != 0);
}();
function phi(x, a) {
if (a <= 0) return x - Math.trunc(x / 2);
const na = (a - 1) >>> 0; const p = oprms[na];
if (x <= p) return 1;
return phi(x, na) - phi(Math.trunc(x / p), na);
}
return phi(lmt, oprms.length) + oprms.length;
}
const start = Date.now();
for (let i = 0; i <= 9; ++i) console.log(`π(10**${i}) =`, countPrimesTo(10**i));
const elpsd = Date.now() - start;
console.log("This took", elpsd, "milliseconds.")
```

- Output:

π(10**0) = 0 π(10**1) = 4 π(10**2) = 25 π(10**3) = 168 π(10**4) = 1229 π(10**5) = 9592 π(10**6) = 78498 π(10**7) = 664579 π(10**8) = 5761455 π(10**9) = 50847534 This took 712 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost) using node.js version 17.3.1.

Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

**The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm**

Just substitute the following JavaScript code for the `countPrimesTo` function above to enjoy the gain in speed:

```
const TinyPhiPrimes = [ 2, 3, 5, 7, 11, 13 ];
const TinyPhiOddDegree = TinyPhiPrimes.length - 1;
const TinyPhiOddCirc = TinyPhiPrimes.reduce((acc, p) => acc * p) / 2;
const TinyPhiTot = TinyPhiPrimes.reduce((acc, p) => acc * (p - 1), 1)
const TinyPhiLUT = function() {
const arr = new Uint16Array(TinyPhiOddCirc); arr.fill(1);
for (const p of TinyPhiPrimes) {
if (p <= 2) continue; arr[p >> 1] = 0;
for (let c = (p * p) >> 1; c < TinyPhiOddCirc; c += p) arr[c] = 0 >>> 0; }
for (let i = 0 | 0, acc = 0 | 0; i < TinyPhiOddCirc; ++i) {
acc += arr[i]; arr[i] = acc; }
return arr; }();
function tinyPhi(x) {
const ndx = Math.trunc(( x - 1) / 2);
const numtots = Math.trunc(ndx / TinyPhiOddCirc);
const rem = (ndx - numtots * TinyPhiOddCirc) >>> 0;
return numtots * TinyPhiTot + TinyPhiLUT[rem];
}
function countPrimesTo(lmt) {
if (lmt < 169) {
if (lmt < 3) { if (lmt < 2) return 0; else return 1; }
// adjust for the missing "degree" base primes
if (lmt <= 13) return ((lmt - 1) >>> 1) + (lmt < 9 ? 1 : 0);
return 5 + TinyPhiLUT[(lmt - 1) >>> 1];
}
const sqrtlmt = Math.sqrt(lmt) >>> 0;
const oprms = function() {
const mxndx = (sqrtlmt - 3) >>> 1;
const arr = new Float64Array(mxndx + 1);
for (let i = 0 >>> 0; i <= mxndx; ++i) arr[i] = (i + i + 3) >>> 0;
let bp = 3 >>> 0;
while (true) {
let i = (bp - 3) >>> 1; let sqri = ((i + i) * (i + 3) + 3) >>> 0;
if (sqri > mxndx) break;
if (arr[i] != 0) for (; sqri <= mxndx; sqri += bp) arr[sqri] = 0;
bp += 2;
}
return arr.filter(v => v != 0); }();
function lvl(pilmt, m) {
let ans = 0;
for (let pi = TinyPhiOddDegree; pi < pilmt; ++pi) {
const p = oprms[pi]; const nm = m * p;
if (lmt <= nm * p) return ans + pilmt - pi;
ans += tinyPhi(Math.trunc(lmt / nm));
if (pi > TinyPhiOddDegree) ans -= lvl(pi, nm);
}
return ans;
}
return tinyPhi(lmt) - lvl(oprms.length, 1) + oprms.length;
}
```

Use of the above function gets a gain in speed of about a further fifteen times over the above version due to less recursion and the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

**The Non-Recursive Partial Sieving Algorithm**

Just substitute the following JavaScript code for the `countPrimesTo` function above to enjoy the gain in speed:

```
const masks = new Uint8Array(8); // faster than bit twiddling!
for (let i = 0; i < 8; ++i) masks[i] = (1 << i) >>> 0;
function countPrimesTo(lmt) {
if (lmt < 3) { if (lmt < 2) return 0; else return 1; }
function half(x) { return (x - 1) >>> 1; }
function divide(nm, d) { return (nm / d) >>> 0; }
const sqrtlmt = Math.trunc(Math.sqrt(lmt));
const mxndx = (sqrtlmt - 1) >>> 1; const cbsz = (mxndx + 8) >>> 3;
const cullbuf = new Uint8Array(cbsz);
const smalls = new Uint32Array(mxndx + 1);
for (let i = 0; i <= mxndx; ++i) smalls[i] = i >>> 0;
const roughs = new Uint32Array(mxndx + 1);
for (let i = 0; i <= mxndx; ++i) roughs[i] = (i + i + 1) >>> 0;
const larges = new Float64Array(mxndx + 1);
for (let i = 0; i <= mxndx; ++i) larges[i] =
Math.trunc((Math.trunc(lmt / (i + i + 1)) - 1) / 2);
// partial sieve loop, adjusting larges/smalls, compressing larges/roughs...
let nobps = 0 >>> 0; let rilmt = mxndx; let bp = 3 >>> 0;
while (true) { // break when square root is reached
const i = bp >>> 1; let sqri = ((i + i) * (i + 1)) >>> 0;
if (sqri > mxndx) break; // partial sieving pass if bp is prime:
if ((cullbuf[i >> 3] & masks[i & 7]) == 0) {
cullbuf[i >> 3] |= masks[i & 7]; // cull bp!
for (; sqri <= mxndx; sqri += bp)
cullbuf[sqri >> 3] |= masks[sqri & 7]; // cull bp mults!
// now adjust `larges` for latest partial sieve pass...
var ori = 0 // compress input rough index to output one
for (let iri = 0; iri <= rilmt; ++iri) {
const r = roughs[iri]; const rci = r >>> 1; // skip roughs just culled!
if ((cullbuf[rci >> 3] & masks[rci & 7]) != 0) continue;
const d = bp * r;
larges[ori] = larges[iri] -
( (d <= sqrtlmt) ? larges[smalls[d >>> 1] - nobps]
: smalls[half(divide(lmt, d))] ) +
nobps; // base primes count over subtracted!
roughs[ori++] = r;
}
let si = mxndx // and adjust `smalls` for latest partial sieve pass...
for (let bpm = (sqrtlmt / bp - 1) | 1; bpm >= bp; bpm -= 2) {
const c = smalls[bpm >>> 1] - nobps;
const ei = ((bpm * bp) >>> 1);
while (si >= ei) smalls[si--] -= c;
}
nobps++; rilmt = ori - 1;
}
bp += 2;
}
// combine results to here; correcting for over subtraction in combining...
let ans = larges[0]; for (let i = 1; i <= rilmt; ++i) ans -= larges[i];
ans += Math.trunc((rilmt + 1 + 2 * (nobps - 1)) * rilmt / 2);
// add final adjustment for pairs of current roughs to cube root of range...
let ri = 0
while (true) { // break when reaches cube root of counting range...
const p = roughs[++ri]; const q = Math.trunc(lmt / p);
const ei = smalls[half(divide(q, p))] - nobps;
if (ei <= ri) break; // break here when no more pairs!
for (let ori = ri + 1; ori <= ei; ++ori)
ans += smalls[half(divide(q, roughs[ori]))];
ans -= (ei - ri) * (nobps + ri - 1);
}
return ans + 1; // add one for only even prime of two!
}
```

The above code enjoys quite a large gain in speed of about a further ten times (and increasing with range) over the immediately preceding version for larger "non-trivial" ranges (since there is an appreciable environment overhead in initialization and JIT compilation) due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division; for instance, this version can calculate the number of primes to 1e11 in about 125 milliseconds. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in under three minutes as long as the computer used has the required free memory of about 800 Megabytes. This JavaScript version is about three times as slow as the Nim version from which it was translated for large ranges where the initialization overhead is not significant due to being run by the JavaScript engine and JIT compiled.

## jq

**Adapted from Wren**

**Works with gojq, the Go implementation of jq** (*)

The following implementation freely uses the following conveniences of jq syntax:

- the jq expression {$x} is an abbreviation for {"x" : $x}
- the jq expression {x} is an abbreviation for {"x" : .x}

The key-value pairs can be similarly specified in JSON objects with multiple keys.

(*) gojq struggles to get beyond [5,9592] without using huge amounts of memory; jq becomes excessively slow after [7,664579].

**Preliminaries**

```
# For the sake of the accuracy of integer arithmetic when using gojq:
def power($b): . as $in | reduce range(0;$b) as $i (1; . * $in);
# Input: $n, which is assumed to be positive integer,
# Output: an array of primes less than or equal to $n (e.g. 10|eratosthenes #=> [2,3,5,7]
def eratosthenes:
# erase(i) sets .[i*j] to false for integral j > 1
def erase(i):
if .[i] then
reduce range(2; (1 + length) / i) as $j (.; .[i * $j] = false)
else .
end;
(. + 1) as $n
| (($n|sqrt) / 2) as $s
| [null, null, range(2; $n)]
| reduce (2, 1 + (2 * range(1; $s))) as $i (.; erase($i))
| map(select(.)) ;
```

**Phi and the Legendre Counting Function**

```
def legendre:
(sqrt | floor + 1 | eratosthenes) as $primes
# Input: {x, a}
# Output: {phi: phi(x,a), memo} where .memo might have been updated
| def phi:
. as {x: $x, a: $a, memo: $memo}
| if $a == 0 then {phi: $x, $memo}
else "\($x),\($a)" as $ix
| $memo[ $ix ] as $m
| if $m then {phi: $m, $memo}
else .a += -1
| phi as {phi: $phi1, memo: $memo}
| (($x / $primes[$a - 1])|floor) as $x
| ({$x, a, $memo} | phi) as {phi: $phi2, memo: $memo}
| ($phi1 - $phi2) as $phi
| {$phi, $memo}
| .memo[$ix] = $phi
end
end;
def l:
. as $n
| if . < 2 then 0
else ($n|sqrt|floor|l) as $a
| ({x: $n, $a, memo: {}} | phi).phi + $a - 1
end;
l;
def task:
range(0;10)
| . as $i
| [., (10|power($i)|legendre)]
;
task
```

- Output:

[0,0] [1,4] [2,25] [3,168] [4,1229] [5,9592] [6,78498] [7,664579] <terminated>

## Julia

### Using the Julia Primes library

This would be faster with a custom sieve that only counted instead of making a bitset, instead of the more versatile library function.

```
using Primes
function primepi(N)
delta = round(Int, N^0.8)
return sum(i -> count(primesmask(i, min(i + delta - 1, N))), 1:delta:N)
end
@time for power in 0:9
println("10^", rpad(power, 5), primepi(10^power))
end
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534 1.935556 seconds (1.64 k allocations: 415.526 MiB, 2.15% gc time)

### Using the method given in the task

Run twice to show the benefits of precomputing the library prime functions.

```
using Primes
const maxpi = 1_000_000_000
const memoφ = Dict{Vector{Int}, Int}()
const pₐ = primes(isqrt(maxpi))
function π(n)
function φ(x, a)
if !haskey(memoφ, [x, a])
memoφ[[x, a]] = a == 0 ? x : φ(x, a - 1) - φ(x ÷ pₐ[a], a - 1)
end
return memoφ[[x, a]]
end
n < 2 && return 0
a = π(isqrt(n))
return φ(n, a) + a - 1
end
@time for i in 0:9
println("10^", rpad(i, 5), π(10^i))
end
@time for i in 0:9
println("10^", rpad(i, 5), π(10^i))
end
```

- Output:

10^0 1 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534 12.939211 seconds (69.44 M allocations: 3.860 GiB, 27.98% gc time, 1.58% compilation time) 10^0 1 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534 0.003234 seconds (421 allocations: 14.547 KiB)

### nonrecursive method

```
function countprimes(n)
n < 3 && return typeof(n)(n > 1)
rtlmt = isqrt(n)
mxndx = (rtlmt - 1) ÷ 2
smalls::Array{UInt32} = [i for i in 0:mxndx+1]
roughs::Array{UInt32} = [i + i + 1 for i in 0:mxndx+1]
larges::Array{Int64} = [(n ÷ (i + i + 1) - 1) ÷ 2 for i in 0:mxndx+1]
cmpsts = falses(mxndx + 1)
bp, npc, mxri = 3, 0, mxndx
@inbounds while true
i = bp >> 1
sqri = (i + i) * (i + 1)
sqri > mxndx && break
if !cmpsts[i + 1]
cmpsts[i + 1] = true
for c in sqri:bp:mxndx
cmpsts[c + 1] = true
end
ri = 0
for k in 0:mxri
q = roughs[k + 1]
qi = q >> 1
cmpsts[qi + 1] && continue
d = UInt64(bp) * UInt64(q)
larges[ri + 1] = larges[k + 1] + npc -
(d <= rtlmt ? larges[smalls[d >> 1 + 1] - npc + 1]
: smalls[(Int(floor(n / d)) - 1) >> 1 + 1])
roughs[ri + 1] = q
ri += 1
end
m = mxndx
@simd for k in (rtlmt ÷ bp - 1) | 1 : -2 : bp
c = smalls[k >> 1 + 1] - npc
ee = (k * bp) >> 1
while m >= ee
smalls[m + 1] -= c
m -= 1
end
end
mxri = ri - 1
npc += 1
end
bp += 2
end
result = larges[1]
@simd for i in 2:mxri+1
result -= larges[i]
end
result += (mxri + 1 + 2 * (npc - 1)) * mxri ÷ 2
for j in 1:mxri
p = UInt64(roughs[j + 1])
m = n ÷ p
ee = smalls[(Int(floor(m / p)) - 1) >> 1 + 1] - npc
ee <= j && break
for k in j+1:ee
result += smalls[(Int(floor(m / roughs[k + 1])) - 1) >> 1 + 1]
end
result -= (ee - j) * (npc + j - 1)
end
return result + 1
end
for i in 0:14
println("π(10^$i) = ", countprimes(10^i))
end
@time countprimes(10^13)
@time countprimes(10^14)
```

- Output:

π(10^0) = 0 π(10^1) = 4 π(10^2) = 25 π(10^3) = 168 π(10^4) = 1229 π(10^5) = 9592 π(10^6) = 78498 π(10^7) = 664579 π(10^8) = 5761455 π(10^9) = 50847534 π(10^10) = 455052511 π(10^11) = 4118054813 π(10^12) = 37607912018 π(10^13) = 346065536839 π(10^14) = 3204941750802 0.894891 seconds (13 allocations: 48.442 MiB, 1.06% gc time) 4.479385 seconds (13 allocations: 153.185 MiB, 0.12% gc time)

## Kotlin

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Kotlin versions are [translated and improved from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

**The Basic Recursive Legendre Algorithm**

```
fun countPrimes(lmt: Long): Long {
if (lmt < 3) return if (lmt < 2) 0 else 1
val sqrtlmt = Math.sqrt(lmt.toDouble()).toInt()
fun makePrimes(): IntArray {
val mxndx = (sqrtlmt - 3) / 2
val arr = IntArray(mxndx + 1, { it + it + 3})
var i = 0
while (true) {
val sqri = (i + i) * (i + 3) + 3
if (sqri > mxndx) break
if (arr[i] != 0) {
val bp = i + i + 3
for (c in sqri .. mxndx step bp) arr[c] = 0
}
i++
}
return arr.filter { it != 0 }.toIntArray()
}
val oprms = makePrimes()
fun phi(x: Long, a: Int): Long {
if (a <= 0) return x - (x shr 1)
val na = a - 1; val p = oprms[na].toLong()
if (x <= p) return 1
return phi(x, na) - phi(x / p, na)
}
return phi(lmt, oprms.size) + oprms.size.toLong()
}
fun main() {
val strt = System.currentTimeMillis()
for (i in 0 .. 9) {
val arg = Math.pow(10.toDouble(), i.toDouble()).toLong()
println("π(10**$i) = ${countPrimes(arg)}")
}
val stop = System.currentTimeMillis()
println("This took ${stop - strt} milliseconds.")
}
```

- Output:

π(10**0) = 0 π(10**1) = 4 π(10**2) = 25 π(10**3) = 168 π(10**4) = 1229 π(10**5) = 9592 π(10**6) = 78498 π(10**7) = 664579 π(10**8) = 5761455 π(10**9) = 50847534 This took 480 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

**The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm**

Just substitute the following Kotlin code for the `countPrimes` function above to enjoy the gain in speed:

```
val cTinyPhiPrimes = intArrayOf(2, 3, 5, 7, 11, 13)
val cTinyPhiDeg = cTinyPhiPrimes.size - 1
val cTinyPhiOddCirc = cTinyPhiPrimes.reduce(Int::times) / 2
val cTinyPhiTot = cTinyPhiPrimes.fold(1) { s, p -> s * (p - 1) }
fun makeTPLUT(): IntArray {
val arr = IntArray(cTinyPhiOddCirc) { _ -> 1 }
for (bp in cTinyPhiPrimes.drop(1)) {
arr[bp shr 1] = 0
for (c in ((bp * bp) shr 1) until cTinyPhiOddCirc step bp) arr[c] = 0 }
var acc = 0
for (i in 0 until cTinyPhiOddCirc) { acc += arr[i]; arr[i] = acc }
return arr
}
val cTinyPhiLUT = makeTPLUT()
fun tinyPhi(x: Long): Long {
val ndx = (x - 1) shr 1; val numtots = ndx / cTinyPhiOddCirc.toLong()
val rem = (ndx - numtots * cTinyPhiOddCirc.toLong()).toInt()
return numtots * cTinyPhiTot.toLong() + cTinyPhiLUT[rem].toLong()
}
fun countPrimes(lmt: Long): Long {
if (lmt < 169) {
if (lmt < 3) return if (lmt < 2) 0 else 1
// adjust for the missing "degree" base primes
if (lmt <= 13)
return ((lmt - 1).toLong() shr 1) + (if (lmt < 9) 1 else 0);
return 5.toLong() + cTinyPhiLUT[(lmt - 1).toInt() shr 1].toLong();
}
val sqrtlmt = Math.sqrt(lmt.toDouble()).toInt()
fun makePrimes(): IntArray {
val mxndx = (sqrtlmt - 3) / 2
val arr = IntArray(mxndx + 1, { it + it + 3})
var i = 0
while (true) {
val sqri = (i + i) * (i + 3) + 3
if (sqri > mxndx) break
if (arr[i] != 0) {
val bp = i + i + 3
for (c in sqri .. mxndx step bp) arr[c] = 0
}
i++
}
return arr.filter { it != 0 }.toIntArray()
}
val oprms = makePrimes()
fun lvl(pilmt: Int, m: Long): Long {
var acc = 0.toLong()
for (pi in cTinyPhiDeg until pilmt) {
val p = oprms[pi].toLong(); val nm = m * p
if (lmt <= nm * p) return acc + (pilmt - pi).toLong()
val q = (lmt.toDouble() / nm.toDouble()).toLong(); acc += tinyPhi(q)
if (pi > cTinyPhiDeg) acc -= lvl(pi, nm)
}
return acc
}
return tinyPhi(lmt) - lvl(oprms.size, 1) + oprms.size.toLong()
}
```

Use of the above function gets a gain in speed of about a further twenty times over the above version due to less recursion and the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions, which have also been converted to floating point divisions because they are faster although with a lost of precision for counting ranges that one would likely not use this implementation for. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

**The Non-Recursive Partial Sieving Algorithm**

The following Kotlin code enjoys the further gain in speed:

```
import java.util.BitSet
fun countPrimes(lmt: Long): Long {
if (lmt < 3) return if (lmt < 2) 0 else 1 // odds only!
fun half(x: Int): Int = (x - 1) shr 1
fun divide(nm: Long, d: Long): Int = (nm.toDouble() / d.toDouble()).toInt()
val sqrtlmt = Math.sqrt(lmt.toDouble()).toLong()
val mxndx = ((sqrtlmt - 1) shr 1).toInt()
val cmpsts = BitSet(mxndx + 1)
val smalls = IntArray(mxndx + 1) { it }
val roughs = IntArray(mxndx + 1) { it + it + 1 }
val larges = LongArray(mxndx + 1) { (lmt / (it + it + 1).toLong() - 1) shr 1 }
// partial sieve loop, adjusting larges/smalls, compressing larges/roughs...
var nobps = 0; var rilmt = mxndx; var bp = 3.toLong()
while (true) {
val i = (bp shr 1).toInt(); val sqri = (i + i) * (i + 1)
if (sqri > mxndx) break
if (!cmpsts.get(i)) { // condition for partial sieving pass for bp is prime
cmpsts.set(i) // cull bp!
for (c in sqri .. mxndx step bp.toInt()) cmpsts.set(c) // cull bp mults!
// now adjust `larges` for latest partial sieve pass...
var ori = 0 // compress input rough index to output one
for (iri in 0 .. rilmt) {
val r = roughs[iri]; val rci = (r shr 1)
if (cmpsts.get(rci)) continue // skip roughs culled in this sieve pass!
val d = bp * r.toLong()
larges[ori] = larges[iri] -
( if (d <= sqrtlmt)
larges[smalls[(d shr 1).toInt()] - nobps]
else smalls[half(divide(lmt, d))].toLong() ) +
nobps.toLong() // base primes count over subtracted!
roughs[ori++] = r
}
var si = mxndx // and adjust `smalls` for latest partial sieve pass...
for (bpm in (sqrtlmt / bp - 1) or 1 downTo bp step 2) {
val c = smalls[(bpm shr 1).toInt()] - nobps
val ei = ((bpm * bp) shr 1).toInt()
while (si >= ei) smalls[si--] -= c
}
nobps++; rilmt = ori - 1
}
bp += 2
}
// combine results to here; correcting for over subtraction in combining...
var ans = larges[0]; for (i in 1 .. rilmt) ans -= larges[i]
ans += (rilmt.toLong() + 1 + 2 * (nobps.toLong() - 1)) * rilmt.toLong() / 2
// add final adjustment for pairs of current roughs to cube root of range...
var ri = 0
while (true) { // break when reaches cube root of counting range...
val p = roughs[++ri].toLong(); val q = lmt / p
val ei = smalls[half(divide(q, p))] - nobps
if (ei <= ri) break // break here when no more pairs!
for (ori in ri + 1 .. ei)
ans += smalls[half(divide(q, roughs[ori].toLong()))].toLong()
ans -= (ei - ri).toLong() * (nobps.toLong() + ri.toLong() - 1)
}
return ans + 1 // add one for only even prime of two!
}
fun main() {
val strt = System.currentTimeMillis()
for (i in 0 .. 9) {
val arg = Math.pow(10.toDouble(), i.toDouble()).toLong()
println("π(10**$i) = ${countPrimes(arg)}")
}
val stop = System.currentTimeMillis()
println("This took ${stop - strt} milliseconds.")
}
```

The above code enjoys quite a large gain in speed of about a further ten times (and increasing with range) over the immediately preceding version for larger "non-trivial" ranges (since there is an appreciable environment overhead in initialization and JIT compilation) due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division; for instance, this version can calculate the number of primes to 1e11 in about 82 milliseconds. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in about two minutes as long as the computer used has the required free memory of about 800 Megabytes. This Kotlin version is about twice as slow as the Nim version from which it was translated for large ranges where the initialization overhead is not significant due to being run under the Java Virtual Machine (JVM) environment and JIT compiled.

## Mathematica/Wolfram Language

```
ClearAll[Phi,pi]
$RecursionLimit = 10^6;
Phi[x_, 0] := x
Phi[x_, a_] := Phi[x, a] = Phi[x, a - 1] - Phi[Floor[x/Prime[a]], a - 1]
pi[n_] := Module[{a}, If[n < 2, 0, a = pi[Floor[Sqrt[n]]]; Phi[n, a] + a - 1]]
Scan[Print[pi[10^#]] &, Range[0,9]]
```

- Output:

0 4 25 168 1229 9592 78498 664579 5761455 50847534

## Nim

The program runs in 2.2 seconds on our laptop. Using int32 instead of naturals (64 bits on our 64 bits computer) saves 0.3 second. Using an integer rather than a tuple as key (for instance `x shl 32 or a`

) saves 0.4 second.

```
import math, strutils, sugar, tables
const
N = 1_000_000_000
S = sqrt(N.toFloat).int
var composite: array[3..S, bool]
for n in countup(3, S, 2):
if n * n > S: break
if not composite[n]:
for k in countup(n * n, S, 2 * n):
composite[k] = true
# Prime list. Add a dummy zero to start at index 1.
let primes = @[0, 2] & collect(newSeq, for n in countup(3, S, 2): (if not composite[n]: n))
var cache: Table[(Natural, Natural), Natural]
proc phi(x, a: Natural): Natural =
if a == 0: return x
let pair = (x, a)
if pair in cache: return cache[pair]
result = phi(x, a - 1) - phi(x div primes[a], a - 1)
cache[pair] = result
proc π(n: Natural): Natural =
if n <= 2: return 0
let a = π(sqrt(n.toFloat).Natural)
result = phi(n, a) + a - 1
var n = 1
for i in 0..9:
echo "π(10^$1) = $2".format(i, π(n))
n *= 10
```

- Output:

π(10^0) = 0 π(10^1) = 4 π(10^2) = 25 π(10^3) = 168 π(10^4) = 1229 π(10^5) = 9592 π(10^6) = 78498 π(10^7) = 664579 π(10^8) = 5761455 π(10^9) = 50847534

### Non-Memoized Versions

The above code takes a huge amount of memory to run, especially if using the full size "Naturals" which not using them will severely limit the usable counting range (which of course is limited anyway by the huge memory use by the memoization). As per the comments on the task, memoization is not strictly needed, as the exponential growth in execution time can be limited by a very simple optimization that stops splitting of the recursive "tree" when there are no more contributions to be had by further "leaves" to the right of the "tree", as per the following code:

```
# compile with: nim c -d:danger -t:-march=native --gc:arc
from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt
let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))
proc countPrimes(n: int64): int64 =
if n < 3:
return if n < 2: 0 else: 1
else:
let rtlmt = n.float64.sqrt.int
let mxndx = (rtlmt - 1) div 2
let sz = (mxndx + 8) div 8
var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0(sz))
for i in 1 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) != 0: continue
let sqri = (i + i) * (i + 1)
if sqri > mxndx: break
let bp = i + i + 1
for c in countup(sqri, mxndx, bp):
let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
var pisqrt = 0'i64
for i in 0 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) == 0: pisqrt += 1
var primes = cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * pisqrt.int))
var j = 0
for i in 0 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) == 0: primes[j] = (i + i + 1).uint32; j += 1
proc phi(x: int64; a: int): int64 =
if a <= 1:
return if a < 1: x else: x - (x shr 1)
let p = primes[a - 1].int64
if x <= p: return 1 # very simple one-line optimization that limits exponential growth!
return phi(x, a - 1) - phi((x.float64 / p.float64).int64, a - 1)
result = phi(n, pisqrt.int) + pisqrt - 1
cmpsts.dealloc; primes.dealloc
let nstrt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let nelpsd = (getMonoTime() - nstrt).inMilliseconds
echo "This took ", nelpsd, " milliseconds."
```

- Output:

π(10^0) = 0 π(10^1) = 4 π(10^2) = 25 π(10^3) = 168 π(10^4) = 1229 π(10^5) = 9592 π(10^6) = 78498 π(10^7) = 664579 π(10^8) = 5761455 π(10^9) = 50847534 This took 194 milliseconds.

As shown in the output, the above program runs without memoization about 12 times faster and using a huge amount less memory than the previous memoized algorithm. Other than not requiring memoization, the implementation also substitutes floating point division for integer division so as to be faster for many CPU architectures (although only being of 53-bits accuracy instead of 64-bits; which limited precision will limit the range over which the primes can be accurately counted, but as this algorithm is only reasonably useful to about ten to the fourteenth power anyway, that shouldn't be a problem.

**Using Look Up Table for "TinyPhi" and Decreasing Recursion**

A further optimization that greatly reduces the number of operations and thus the run time for these small ranges is to use a pre-computed "TinyPhi" Look Up Table (LUT) array for the "leaves" where the "degree" is small (such as degree of six for the smallest primes used as 13 as used here, which then forms a repeating wheel pattern of 30030 values); this would reduce the "width" of the tree at its widest "base" to eliminate many divisions and recursions and thus reduce the execution time greatly, although the effect is more for smaller ranges such as this as for greater ranges. As well, one can compute the "tree" starting at the lower left, which then reduces some of the recursions to a simple loop and can eliminate the recursions used for succeeding "ones" according to the optimization eliminating the need for more recursion. This is because "Phi"/φ can be calculated by the given "top down" recursion expression given in the task which is derived from the "bottom up" expression as follows:

(where denotes the floor function).

The following Nim code implements these changes:

```
# compile with: nim c -d:danger -t:-march=native --gc:arc
from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt
let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))
const TinyPhiPrimes = [2, 3, 5, 7, 11, 13]
const TinyPhiCirc = 3 * 5 * 7 * 11 * 13
const TinyPhiRes = 2 * 4 * 6 * 10 * 12
const CC = 6
proc makeTinyPhiLUT(): array[TinyPhiCirc, uint16] =
for i in 0 .. TinyPhiCirc - 1: result[i] = 1
for i in 1 .. 6:
if result[i] == 0: continue
result[i] = 0; let bp = i + i + 1
let sqri = (i + i) * (i + 1)
for c in countup(sqri, TinyPhiCirc - 1, bp): result[c] = 0
var acc = 0'u16;
for i in 0 .. TinyPhiCirc - 1: acc += result[i]; result[i] = acc
const TinyPhiLUT = makeTinyPhiLUT()
proc tinyPhi(x: int64): int64 {.inline.} =
let ndx = (x - 1) div 2; let numtot = ndx div TinyPhiCirc.int64
return numtot * TinyPhiRes.int64 + TinyPhiLUT[(ndx - numtot * TinyPhiCirc.int64).int].int64
proc countPrimes(n: int64): int64 =
if n < 169: # below 169 whose sqrt is 13 is where TinyPhi doesn't work...
if n < 3: return if n < 2: 0 else: 1
# adjust for the missing "degree" base primes
if n <= 13: return (n - 1) div 2 + (if n < 9: 1 else: 0)
return 5 + TinyPhiLUT[(n - 1).int div 2].int64
let rtlmt = n.float64.sqrt.int
let mxndx = (rtlmt - 1) div 2
var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0((mxndx + 8) div 8))
for i in 1 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) != 0: continue
let sqri = (i + i) * (i + 1)
if sqri > mxndx: break
let bp = i + i + 1
for c in countup(sqri, mxndx, bp):
let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
var pisqrt = 0'i64
for i in 0 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) == 0: pisqrt += 1
let primes = cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * pisqrt.int))
var j = 0
for i in 0 .. mxndx:
if (cmpsts[i shr 3] and masksp[i and 7]) == 0: primes[j] = (i + i + 1).uint32; j += 1
var phi = tinyPhi(n)
proc lvl(m, mbf: int64; mxa: int) = # recurse from bottom left of "tree"...
for a in CC .. mxa:
let p = primes[a].int64
if m < p * p: phi += mbf * (mxa - a + 1).int64; return # rest of level all ones!
let nm = (m.float64 / p.float64).int64; phi += mbf * tinyPhi(nm)
if a > CC: lvl(nm, -mbf, a - 1) # split
# finished level!
lvl(n, -1, pisqrt.int - 1); result = phi + pisqrt - 1
cmpsts.dealloc; primes.dealloc
let strt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "This took ", elpsd, " milliseconds."
```

The results of the above code are identical to the previous version except it is almost 20 times faster due to the reduced recursion; however, the gain due to the "TinyPhi" LUT is most effective for relatively trivial counting ranges such as this and will have less and less benefit as ranges get larger, although the reduced recursion due to reordering is a constant factor gain across the range.

Note that the optimization to eliminate the need for memoization has been changed to "if m < p * p: phi += mbf * (mxa - a + 1).int64; return # rest of level all ones!", which has exactly the same effect as the previous optimization but stops the recursion for all "parent" branches of the same "level".

Larger "TinyPhi" LUT's can be used for some added benefit (especially for these smaller ranges) although they quickly get quite huge where the benefit is reduced due to lesser cache associativity; for instance the LUT for a "TinyPhi" of degree eight can a wheel circumference of 9699690 values which can be reduced by a factor of two by only considering odd values and another factor of two using the fact that the LUT is symmetric approaching the middle from the beginning and end of the LUT.

**Reducing Operations by using Partial Sieving**

There has been some work adapting the Legendre theory to the use of partial sieving as is known in the competitive programming communities, although there doesn't seem to be a link to a mathematical paper deriving the basis of that algorithm. The result of that work is a code algorithm that still uses memory proportional to the square root of the range to be counted but greatly reduces the computational complexity from O(n / (log n)^2) to O(n^(3/4) / (log n)^2). Since a mathematical paper supporting the algorithm has not been found, a brief explanation is provided here (if someone knows of such a paper, please edit in a link):

It turns out that there is another way to think of the "Phi"/φ(m, a)/inclusion-exclusion expression from which the task expression and the above implementations have been derived - that expresses the count of values left after the range to m has been partially sieved/partially culled by the primes up to the "ath" one. Now, Legendre would likely not have considered thinking of the inclusion/exclusion principle that way as he made no real practical use of his work other than some trivial examples showing that it worked and could calculate the number of primes up to a range only knowing the primes up to the square root of that range, as one has to be able to see that this can be used repeatedly to adjust Look Up Tables (LUT's) of partial counts with each partial sieve and, of course, there were no computers in his time (1800's).

The LUT's used in the following code are as follows:

- The simple "smalls" LUT is a table of the counts of remaining values after culling of base primes up to a "kth prime" value up to the square root of the counting range (may as well be odd values since two is the only even prime), which could be updated by repeated scanning of the sieve buffer but is more efficiently updated by decrementing the values between each cull value by the constant difference. To make this easier, this "smalls" LUT includes the count of the base primes used up to the current state and doesn't include the one that is normally included in a "Phi" count.
- Another important table is the "roughs" LUT where the "k-roughs" term wasn't used until post 2000 or so, but is the remaining (may as well be odd) values in a range (in this case to the square root of the counting range) after elimination of all multiples of primes up to the kth one. This table will start as half (because odds-only) the square root of the counting range in length but will be reduced in effective length by each partial sieving pass by each base prime value so that only prime values above the last base prime value remain (plus the first value of one) so will be approximately sqrt n / log n in length.
- The final "larges" LUT is the most important one and perhaps the hardest to grasp as it contains the partial counts of values up to the entire counting range for each of the current "roughs" values so thus has an effective length that is synchronized and thus reduced with the above "roughs" LUT. It is initialized with the number of odd un-culled values up to and including each current "k-rough" value, and like the "smalls" LUT, its values are adjusted for each partial sieve pass. The updated values of this "larges" LUT cannot be determined only by reference to the remaining count after culling but the offset to be subtracted must be determined by look up of the previous values in this same LUT of all prime products smaller than the square root of the counting range.
- These three LUT's change until partial sieving is complete, which since they represent values up to the square root of the counting range, happens after all culling by base prime values up to the quad root of the counting range.
- As the Sieve of Eratosthenes sieving is only to the range of the square root of the counting range, it is a negligible part of the total execution time and hasn't been much optimized.

In order to understand the updating of the "larges" LUT as per the above, one needs to understand the relationship between the φ recursion as per the task description and partial sieving as follows:

- The number of (odd) values remaining after a partial sieve and the "phi" expression are closely related as they only differ by the number of base primes and one.
- As per the previous implementation, the "tree" is being adjusted and built from the bottom up, so the initial state of the LUT's represents the bottom of the tree just after partial sieving by the only even base prime of two, and every succeeding partial sieve pass need to build the immediate "parent" node counts based on the values available in the "child" nodes as per the current state of the LUT's.
- The relationship between the "parents" and the "children" for a given count limit is given by the relationship "φ(x, a) = φ(x, a−1) − φ(⌊x/pa⌋, a−1)", which in words says that the "larges" LUT entry for a given count for a "parent" level adjustment will have a new value of the old value minus the looked up count of that value divided by the current partial sieve base prime cull value for the partial sieving pass.
- There are two different means of obtaining the adjustment to be subtracted from the old "larges" count value: for products of the two values less than or equal to the square root of the counting table, the index of the product is looked up in the "smalls" LUT and that index is then used to find the adjustment in the "larges" LUT; for products larger than that limit, the quotient will be less than that limit so can be converted to an index and directly looked up in the "smalls" LUT.
- Since the "larges" LUT needs to be treated in the same way as to offsets and scaling as the "smalls" LUT and the latter includes the count of current base primes used and doesn't include the one, the "larges" LUT needs to do the same. When adjusting for a partial sieving pass by taking a difference, the offset of the current base primes count is cancelled out and needs to be added back in after the difference.
- The synced reduction in effective length of the "roughs" and "larges" LUT's due to culled values for each partial sieving pass is effected in the same loop that does the partial sieving.
- During the partial sieving passes loop, it is impossible for the "non-branch splitting" optimization that makes memoization unnecessary to occur, so that check is not applied during this loop, thus saving a little execution time.
- After partial sieving is complete (when the base primes reach the cube root of the counting range), the loop could continue without sieving to determine the rest of the "parent" values up to the top level with "a" equal to the last base prime up to the square root of the counting range, but that would be less efficient as to having to inject a check for the "non-splitting termination optimization", to determine that the base primes have reached a level where sieving isn't necessary, and a slightly less efficient compensation for the differences eliminating the base primes count and restoring on an operation-by-operation basis, so this loop is terminated and a new loop does this final phase.
- This partial sieving and LUT building loop is the source of the major part of the execution complexity, which for very large ranges will tend toward a O(n^(3/4)/((log n)^2)) execution complexity as can be seen that we are using products of the base primes up the the cube root of the counting range and what will tend toward the number of primes up to the square root of the counting range.

Since we no longer need to use partial sieving and adjustment of the LUT's, at this stage the result of the current state of computation can be combined into a variable. As the zeroth index of "larges" LUT contains the count of the value one "rough" (also the zeroth value in the LUT), it is the result of only one base prime at a time and is therefore included and thus additive where all the other effective values in the "larges" LUT are the result of the product of two co-primes and therefore excluded and subtracted from the zeroth value. This will result in an over subtraction of the number of base primes up to the quad root of the counting range, so that will need to be added back in by the number of effective indexes greater than zero in the "larges" LUT minus one just as this offset is compensated as the values were adjusted and will have to be further adjusted as we add further inclusions to the final result.

At this stage, an answer has been obtained but it is not quite correct as to the inclusions due to the base primes/lowest prime factors between the quad root of the counting range and the square root of the counting range have not yet been added, as follows:

- Again, these are pairs of (this time prime as partial sieving has been completed) factors with the base primes as described above.
- It is easy to demonstrate why there cannot be more than two prime factors used since the smallest prime factor used will be just over the quad root of the counting range, and if there were three successive prime factors used, the quotient would be less than the quad root of the counting range which would be a breaking/non-splitting condition and this is the best case for three prime factors so three prime factors cannot be used in this loop
- This second non-sieving loop will never reach the top end of the range for two large successive prime factors since when the prime factors start at about the cube root of the counting range, they will hit the terminating non-splitting condition and the loop will terminate. This termination is equivalent in effect to the non-splitting optimization so that "tree" splitting is not necessary but since the count LUT's (and the result accumulator variable) don't contain the one offset, no further compensation of adding a computed number of ones needs to be made.
- Since there are always two prime factors as per the above, their compensation will always be positive as in included/added; this is due to that effectively they are being double subtracted as in once as an equivalent compensation to the "larges" values and once for the final transfer of the "larges" values to the result accumulator.
- The product of the pairs of prime factors will always be greater than the square root of the counting range so the quotient will always be less than the square root of the counting range and only the "smalls" LUT needs to be used for these compensations.
- As the compensations for each pair are each adding the count of base primes up to the quad root of the counting range to the accumulator, this needs to be compensated by subtracting a multiple of the number of compensations minus one times the number of base primes used.
- Since this second loop terminates early, there are slightly less operations than in the first partial sieving loop as the base prime multiplier will never exceed about the cube root of the counting range so that the first loop controls the computational complexity more than this second loop.

Finally, we add one to the final answer to account for the only even prime of two.

The reason that this algorithm can reduce the number of operations as compared to the "standard" recursive Legendre algorithm in that the product of some combinations of multiple primes is compensated in one operation rather than many "branching"/recursive operations as in the recursive or semi-recursive algorithms.

One might question the validity of this "partial sieving" optimization considering the task explanation, but the Legendre prime counting function was likely never used practically at the time it was invented other than for trivial examples showing that it worked as mentioned above since there were no electronic computers; the follow-on work by Meissel was used by him to hand calculate the number of primes to a billion (1e9) over the course of some number of years in about 1870, and he most likely would have used partial sieving as otherwise he would have done over five million long divisions or over 600 thousand long divisions even with use of a "TinyPhi" Look Up Table of degree six, which would not be possible in his lifetime at something like a hundred long division calculations a day. D. H. Lehmer did not use partial sieving in his computer work on prime counting in 1959, but that was because he was focused on fitting a version of the Meissel algorithm into the memory limitations of the computers of that day (a few thousand words of RAM memory) such that he had to pre-calculate and store the primes values required on a digital magnetic tape that had to be played several times as the calculation proceeded and thus didn't fully investigate (or fully understand) all of the optimizations as used by Meissel, using the brute mathematical computational strength of the computer to overcome the lack of optimizations. So partial sieving as an optimization technique would likely have been known (by Meissel) but not applied until Legarias, Miller, and Odlyzko (LMO) investigated a way to apply it to a computer algorithm. It could have been applied to both the Legendre and Meissel algorithms much sooner other than, at least for the Legendre algorithm, the need for memory proportional to the square root of the counting range would have limited the useful range to counting ranges of 1e12 or so when computers of the day had at most several Megabytes of RAM memory. One of the biggest advantages of LMO compared to even this optimization of Legendre is it greatly reduces memory requirements such that in 1985 it was possible to determine the count of primes to 4e16 on a mainframe computer with only a few megabytes of RAM memory, **which took 1730 minutes** even multi-threaded (two threads).

In short, the defining characteristic of prime counting functions of the Legendre type is only requiring a sieve to the square root of the counting range, of functions of the Meissel type that they require a sieve to the power of two thirds of the counting range, of functions of the Meissel-Lehmer type that they require a sieve to the power of three quarters of the counting range, with later Meissel variants such as that of Deleglise and Gourdon being tuned variants that are between the Legendre requirement and the Meissel requirement in order to balance the time spent counting and the time spent sieving, with all of them decreasing the number of "Phi" operations as the sieving range increases. All of these algorithms can have the improvement of the LMO "partial sieving" technique applied to them (although LMO is as applied to Meissel) in "splitting" the calculation to above some limit and below that limit, with LMO splitting between values above and below the counting range to the power of two thirds and this algorithm splitting at the square root of the counting range. The combination of "partial sieving" and this "splitting" reduces the computational complexity by combining many of the divisors of products of more than one prime into one operation. This algorithm has approximately the computational complexity of Meissel-Lehmer without the increased sieving range because these operations have been combined into the calculation of "Phi"/φ so that that the calculations of "P2" and "P3" are not necessary and therefore the sieving of the counting range to the three quarters power is not necessary.

A Nim implementation of the above described algorithm is as follows:

```
# compile with: nim c -d:danger -t:-march=native --gc:arc
from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt
let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))
# non-recursive Legendre prime counting function for a range `n`...
# this has O(n^(3/4)/((log n)^2)) time complexity; O(n^(1/2)) space complexity.
proc countPrimes(n: int64): int64 =
if n < 3: # can't odd sieve for value less than 3!
return if n < 2: 0 else: 1
else:
proc half(n: int): int {.inline.} = (n - 1) shr 1 # convenience conversion to index
# dividing using float64 is faster than int64 for some CPU's...
# precision limits range to maybe 1e16!
proc divide(nm, d: int64): int {.inline.} = (nm.float64 / d.float64).int
let rtlmt = n.float64.sqrt.int # precision limits range to maybe 1e16!
let mxndx = (rtlmt - 1) div 2
var smalls = # current accumulated counts of odd primes 1 to sqrt range
cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * (mxndx + 1)))
# initialized for no sieving whatsoever other than odds-only - partial sieved by 2:
# 0 odd primes to 1; 1 odd prime to 3, etc....
for i in 0 .. mxndx: smalls[i] = i.uint32
var roughs = # current odd k-rough numbers up to sqrt of range; k = 2
cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * (mxndx + 1)))
# initialized to all odd positive numbers 1, 3, 5, ... sqrt range...
for i in 0 .. mxndx: roughs[i] = (i + i + 1).uint32
# array of current phi counts for above roughs...
# these are not strictly `phi`'s since they also include the
# count of base primes in order to match the above `smalls` definition!
var larges = # starts as size of counts just as `roughs` so they align!
cast[ptr[UncheckedArray[int64]]](alloc(sizeof(int64) * (mxndx + 1)))
# initialized for current roughs after accounting for even prime of two...
for i in 0 .. mxndx: larges[i] = ((n div (i + i + 1) - 1) div 2).int64
# cmpsts is a bit-packed boolean array representing
# odd composite numbers from 1 up to rtlmt used for sieving...
# initialized as "zeros" meaning all odd positives are potentially prime
# note that this array starts at (and keeps) 1 to match the algorithm even
# though 1 is not a prime, as 1 is important in computation of phi...
var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0((mxndx + 8) div 8))
# number of found base primes and current highest used rough index...
var npc = 0; var mxri = mxndx
for i in 1 .. mxndx: # start at index for 3; i will never reach mxndx...
let sqri = (i + i) * (i + 1) # computation of square index!
if sqri > mxndx: break # stop partial sieving due to square index limit!
if (cmpsts[i shr 3] and masksp[i and 7]) != 0'u8: continue # if not prime
# culling the base prime from cmpsts means it will never be found again
cmpsts[i shr 3] = cmpsts[i shr 3] or masksp[i and 7] # cull base prime
let bp = i + i + 1 # base prime from index!
for c in countup(sqri, mxndx, bp): # SoE culling of all bp multiples...
let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
# partial sieving to current base prime is now completed!
var ri = 0 # to keep track of current used roughs index!
for k in 0 .. mxri: # processing over current roughs size...
# q is not necessarily a prime but may be a
# product of primes not yet culled by partial sieving;
# this is what saves operations compared to recursive Legendre:
let q = roughs[k].int; let qi = q shr 1 # index of always odd q!
# skip over values of `q` already culled in the last partial sieve:
if (cmpsts[qi shr 3] and masksp[qi and 7]) != 0'u8: continue
# since `q` cannot be equal to bp due to cull of bp and above skip;
let d = bp * q # `d` is odd product of some combination of odd primes!
# the following computation is essential to the algorithm's speed:
# see above description in the text for how this works:
larges[ri] = larges[k] -
(if d <= rtlmt: larges[smalls[d shr 1].int - npc]
else: smalls[half(divide(n, d.int64))].int64) + npc.int64
# eliminate rough values that have been culled in partial sieve:
# note that `larges` and `roughs` indices relate to each other!
roughs[ri] = q.uint32; ri += 1 # update rough value; advance rough index
var m = mxndx # adjust `smalls` counts for the newly culled odds...
# this is faster than recounting over the `cmpsts` array for each loop...
for k in countdown(((rtlmt div bp) - 1) or 1, bp, 2): # k always odd!
# `c` is correction from current count to desired count...
# `e` is end limit index no correction is necessary for current cull...
let c = smalls[k shr 1] - npc.uint32; let e = (k * bp) shr 1
while m >= e: smalls[m] -= c; m -= 1 # correct over range down to `e`
mxri = ri - 1; npc += 1 # set next loop max roughs index; count base prime
# now `smalls` is a LUT of odd prime accumulated counts for all odd primes;
# `roughs` is exactly the "k-roughs" up to the sqrt of range with `k` the
# index of the next prime above the quad root of the range;
# `larges` is the partial prime counts for each of the `roughs` values...
# note that `larges` values include the count of the odd base primes!!!
# `cmpsts` are never used again!
# the following does the top most "phi tree" calculation:
result = larges[0] # the answer to here is all valid `phis`
for i in 1 .. mxri: result -= larges[i] # combined here by subtraction
# compensate for the included odd base prime counts over subracted above:
result += ((mxri + 1 + 2 * (npc - 1)) * mxri div 2).int64
# This loop adds the counts due to the products of the `roughs` primes,
# of which we only use two different ones at a time, as all the
# combinations with lower primes than the cube root of the range have
# already been computed and included with the previous major loop...
# see text description above for how this works...
for j in 1 .. mxri: # for all `roughs` (now prime) not including one:
let p = roughs[j].int64; let m = n div p # `m` is the `p` quotient
# so that the end limit `e` can be calculated based on `n`/(`p`^2)
let e = smalls[half((m div p).int)].int - npc
# following break test equivalent to non-memoization/non-splitting optmization:
if e <= j: break # stop at about `p` of cube root of range!
for k in j + 1 .. e: # for all `roughs` greater than `p` to end limit:
result += smalls[half(divide(m, roughs[k].int64))].int64
# compensate for all the extra base prime counts just added!
result -= ((e - j) * (npc + j - 1)).int64
result += 1 # include the count for the only even prime of two
smalls.dealloc; roughs.dealloc; larges.dealloc; cmpsts.dealloc
let strt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "This took ", elpsd, " milliseconds."
```

The output of the above code is the same as the previous version except that it is almost ten times faster still and the execution time is almost too small to be measured for this trivial range (for prime counting functions) of only a billion; it takes a little over a second to calculate the count of primes up to 1e13 and just over 30 seconds to calculate the number of primes up to 1e16 on a modern CPU.

The above program uses about 250,000 long integer divisions to calculate the count of primes to a billion as here, which is small enough to be able to calculate it by hand in several man-years of work. The Meissel-LMO algorithm with usual basic optimizations greatly reduces the number of integer long divisions (to only about 17 thousand with a "TinyPhi" LUT of degree eight) so as to be even more possible to hand calculate in only a few man-months, although still subject to math errors as Meissel made in this computation to a billion in the about 1870's. Meissel did not have our advantage in electronic computational power, but just using partial sieving and even a small degree of "TinyPhi" table would have made it easily possible to hand calculate in a few man-years.

The advantage of this "k-roughs" algorithm is its relative simplicity, but that comes at the cost of fairly high memory use due to using Legendre such that it is only really usable to about 1e16 even for a modern computer. Another disadvantage is that each partial sieving pass must be completed before being able to proceed with the next, which precludes using page segmentation and the ease of introducing multi-threading that provides. One could use page segmented sieving as for the the LMO algorithm but if one were to add that much added code complexity, one may as well implement full LMO and also enjoy the reduction in memory use. Once one has page-segmented sieving, one can then easily convert the code to multi-threading for gains in speed proportional to the number of effective threads for a given computer.

As counting range is increased above these trivial ranges, a better algorithm such as that of LMO will likely perform much better than this and use much less memory (to proportional to the cube root of the range), although at the cost of greater code complexity and more lines of code (about 500 LOC). The trade-off between algorithms of the Legendre type and those of the Meissel type is that the latter decreases the time complexity but at the cost of more time spent sieving; modern follow-on work to LMO produces algorithms which are able to tune the balance between Legendre-type and Meissel-type algorithm in trading off the execution time costs between the different parts of the given algorithm for the best in performance, with the latest in Kim Walisch's tuning of Xavier Gourdon's work being about five to ten times faster than LMO (plus includes multi-threading abilities).

## Pascal

```
// Rosetta Code task "Legendre prime counting function".
// Solution for Free Pascal (Lazarus) or Delphi.
program LegendrePrimeCount;
{$IFDEF FPC} // Free Pascal
{$MODE Delphi}
{$ELSE} // Delphi
{$APPTYPE CONSOLE}
{$ENDIF}
// Optimization needs to be on if the program is to finish in a reasonable
// length of time. See "Comments on Task" on the Rosetta Code website.
{$DEFINE SIMPLE_OPTIMIZATION}
uses SysUtils, Types;
{-------------------------------------------------------------------
Function to return an array of primes up to the passed-in limit.
Uses a straightforward Eratosthenes sieve.
TIntegerDynArray must be 0-based. To be compatible with Rosetta Code,
the first prime 2 is at result[1], and result[0] is not used.
}
function FindPrimes( limit : integer) : Types.TIntegerDynArray;
var
deleted : array of boolean;
j, k, p, resultSize : integer;
begin
if (limit < 2) then begin
SetLength( result, 1);
exit;
end;
SetLength( deleted, limit + 1); // 0..limit
deleted[0] := true;
for j := 1 to limit do deleted[j] := false;
p := 2;
while (p*p <= limit) do begin
j := 2*p;
while (j <= limit) do begin
deleted[j] := true;
inc( j, p);
end;
repeat inc(p)
until (p > limit) or (not deleted[p]);
end;
resultSize := 0;
for j := 0 to limit do
if not deleted[j] then inc( resultSize);
SetLength( result, resultSize);
k := 0;
for j := 0 to limit do begin
if not deleted[j] then begin
result[k] := j;
inc(k);
end;
end;
end;
{-----------------------------------------------------------------------------
Function to count primes up to the passed-in limit, by Legendre's method.
Iterative, using a stack. Each item in the stack is a term phi(x,a) along
with a sign. If the top item on the stack can be evaluated easily, it is
popped off and its value is added to the result. Else the top item is
replaced by two items according to the formual in the task description.
}
function CountPrimes( n : integer) : integer;
type
TPhiTerm = record
IsNeg : boolean;
x : integer;
a : integer;
end;
const
STACK_SIZE = 100; // 10 is enough for n = 10^9
var
primes : Types.TIntegerDynArray;
nrPrimes : integer;
stack : array [0..STACK_SIZE - 1] of TPhiTerm;
sp : integer; // stack pointer, points to first free entry
tos : TPhiTerm; // top of stack
begin
primes := FindPrimes( Trunc( Sqrt( n + 0.5)));
nrPrimes := Length( primes) - 1; // primes[0] is not used
result := nrPrimes - 1; // initialize total
// Push initial entry onto stack
with stack[0] do begin
IsNeg := false;
x := n;
a := nrPrimes;
end;
sp := 1;
while sp > 0 do begin
tos := stack[sp - 1];
{$IFDEF SIMPLE_OPTIMIZATION}
// Using optimization described in "Comments on Task"
if tos.x = 0 then begin // top of stack = 0
dec(sp); // pop top of stack, no change to result
end
else if (tos.a > 0) and (tos.x < primes[tos.a]) then begin // top of stack = 1
dec( sp); // pop top of stack, update result
if tos.IsNeg then dec( result)
else inc( result);
end
else if tos.a = 0 then begin
{$ELSE}
// Using only the task description, i.e. only phi(x,0) = x
if tos.a = 0 then begin
{$ENDIF}
dec( sp); // pop top of stack, update result
if tos.IsNeg then dec( result, tos.x)
else inc( result, tos.x);
end
else begin
// Replace top of stack by two items as in the task description,
// namely phi(x, a - 1) and -phi(x div primes[a], a - 1)
if (sp >= STACK_SIZE) then
raise SysUtils.Exception.Create( 'Legendre phi stack overflow');
with stack[sp - 1] do begin
IsNeg := tos.isNeg;
x := tos.x;
a := tos.a - 1;
end;
with stack[sp] do begin
IsNeg := not tos.IsNeg;
x := tos.x div primes[tos.a];
a := tos.a - 1;
end;
inc(sp);
end;
end;
end;
{-----------------------------------------------------------
Main routine
}
var
power, limit, count : integer;
begin
WriteLn( 'Limit Count');
limit := 1;
for power := 0 to 9 do begin
if power > 0 then limit := 10*limit;
count := CountPrimes( limit);
WriteLn( SysUtils.Format( '10^%d %10d', [power, count]))
end;
end.
```

- Output:

Limit Count 10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## Perl

```
#!/usr/bin/perl
use strict; # https://rosettacode.org/wiki/Legendre_prime_counting_function
use warnings;
no warnings qw(recursion);
use ntheory qw( nth_prime prime_count );
my (%cachephi, %cachepi);
sub phi
{
return $cachephi{"@_"} //= do {
my ($x, $aa) = @_;
$aa <= 0 ? $x : phi($x, $aa - 1) - phi(int $x / nth_prime($aa), $aa - 1) };
}
sub pi
{
return $cachepi{$_[0]} //= do {
my $n = shift;
$n < 2 ? 0 : do{ my $aa = pi(int sqrt $n); phi($n, $aa) + $aa - 1 } };
}
print "e n Legendre ntheory\n",
"- - -------- -------\n";
for (1 .. 9)
{
printf "%d %12d %10d %10d\n", $_, 10**$_, pi(10**$_), prime_count(10**$_);
}
```

- Output:

e n Legendre ntheory - - -------- ------- 1 10 4 4 2 100 25 25 3 1000 168 168 4 10000 1229 1229 5 100000 9592 9592 6 1000000 78498 78498 7 10000000 664579 664579 8 100000000 5761455 5761455 9 1000000000 50847534 50847534

## Phix

with javascript_semantics -- -- While a phix dictionary can handle keys of {x,a}, for this -- task performance was dreadful (7,612,479 entries, >3mins), -- so instead memophix maps known x to an index to memophia -- which holds the full [1..a] for each x, dropping to a much -- more respectable (albeit not super-fast) 14s -- constant memophix = new_dict() sequence memophia = {} -- 1..a (max 3401) for each x function phi(integer x, a) if a=0 then return x end if integer adx = getd(x,memophix), res if adx=NULL then memophia = append(memophia,repeat(-1,a)) adx = length(memophia) setd(x,adx,memophix) else object ma = memophia[adx] integer l = length(ma) if a>l then memophia[adx] = 0 -- kill refcount memophia[adx] = ma & repeat(-1,a-l) else res = ma[a] if res>=0 then return res end if end if ma = 0 -- kill refcount end if res = phi(x, a-1) - phi(floor(x/get_prime(a)), a-1) memophia[adx][a] = res return res end function function pi(integer n) if n<2 then return 0 end if integer a = pi(floor(sqrt(n))) return phi(n, a) + a - 1 end function atom t0 = time() for i=0 to iff(platform()=JS?8:9) do printf(1,"10^%d %d\n",{i,pi(power(10,i))}) -- printf(1,"10^%d %d\n",{i,length(get_primes_le(power(10,i)))}) end for ?elapsed(time()-t0)

The commented-out length(get_primes_le()) gives the same results, in about the same time, and in both cases re-running the main loop a second time finishes near-instantly.

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534 "14.0s"

(It is about 4 times slower under pwa/p2js so output is limited to 10^8, unless you like staring at a blank screen for 52s)

### Non-recursive partial sieve

with javascript_semantics requires("1.0.2") -- (for in, tagstart) function half(integer n) return floor((n-1)/2)+1 end function // convenience convert to idx function count_primes(atom n) // non-recursive Legendre prime counting function for a range `n`... // has O(n^(3/4)/((log n)^2)) time complexity; O(n^(1/2)) space complexity. if n<3 then return iff(n<2?0:1) end if // can't odd sieve for n less than 3! integer sqrtn = trunc(sqrt(n)), // (actual limit) mxndx = floor((sqrtn-1)/2) // odds-only limit -- -- smalls is the current accumulated counts of odd primes 1 to sqrt(n), initialized -- to odds-only sieving, ie {0,1,2,3,4...} meaning 0 odd primes to 1, 1 o.p to 3,... -- -- roughs is the current odd k-rough numbers up to sqrt of range; k = 2 -- initialized to all odd positive numbers 1, 3, 5, 7, 9, 11, ... sqrt(n) -- -- larges is an array of current phi counts for the above roughs... except they are -- not strictly `phi`'s since they also include primes, to match `smalls` above! -- initialized for current roughs after accounting for the even prime of two... -- -- composite is a flag array representing odd numbers 1..sqrtn, for sieving. -- initialized false, meaning all positive odd numbers are potentially prime -- note that this array starts at (and keeps) 1 to match the algorithm even -- though 1 is not actually a prime, as 1 is important in computation of phi... -- sequence smalls = tagset(mxndx,0), roughs = tagstart(1,mxndx+1,2), larges = sq_floor_div(sq_sub(sq_div(n,roughs),1),2), composite = repeat(false,mxndx+1) integer bp = 3, // 'current' base prime nbp = 0, // number of base primes found mxri = mxndx, // current highest used rough index i = 2, sqri = 4 // index and square (index-1) limit // partial sieve loop, adjusting larges/smalls, compressing larges/roughs... while sqri<=mxndx do // partial sieve to square index limit if not composite[i] then // cull from composite so they will never be found again composite[i] = true // cull bp and multiples for c=sqri+1 to mxndx+1 by bp do composite[c] = true end for // partial sieving to current base prime is now completed! // now adjust `larges` for latest partial sieve pass... integer ori = 0 // compress input rough index(k) to output one for k,q in roughs to mxri+1 do // q is not necessarily prime but may be a product of primes not yet // culled by partial sieving (saves ops cmprd to recursive Legendre) // skip over values of `q` already culled in the last partial sieve: integer qi = floor(q/2)+1; // index of always odd q! if not composite[qi] then // since `q` cannot be equal to bp due to cull of bp and above skip; atom d = bp*q, // `d` is odd product of some combination of odd primes! // the following computation is essential to the algorithm's speed, // see the Nim entry for the full details of how this works dadj = iff(d<=sqrtn ? larges[smalls[half(d)]-nbp+1] : smalls[half(floor(n/d))]) ori += 1 larges[ori] = larges[k]-dadj+nbp // base primes count over subtracted! // eliminate rough values that have been culled in partial sieve: // note that `larges` and `roughs` indices relate to each other! roughs[ori] = q end if end for integer m = mxndx // and adjust `smalls` for latest partial sieve pass... // this is faster than recounting over the `composite` array for each loop... for k=(sqrtn/bp-1)||1 to bp by -2 do // k always odd! // `c` is correction from current count to desired count... // `e` is end limit index no correction is necessary for current cull... integer c = smalls[half(k)]-nbp, e = floor((k*bp)/2) while m>=e do -- smalls[m+1] -= c -- (grr, js! [I have a plan, working on it]) m += 1 smalls[m] -= c -- m -= 1 m -= 2 end while end for nbp += 1 // increase number of found base primes mxri = ori-1 // advance rough index for later end if bp += 2 sqri = (i+i)*(i+1) i += 1 end while // now `smalls` is a LUT of odd prime accumulated counts for all odd primes; // `roughs` is exactly the "k-roughs" up to the sqrt of range with `k` (erm, // mxri?) the index of the next prime above the quad root of the range; // `larges` is the partial prime counts for each of the `roughs` values... // note that `larges` values include the count of the odd base primes!!! // - and `composite` is never used again! // the following does the top-most "phi tree" calculation: // the answer to here is all valid `phis`, combined here by subtraction, // + compensate for included odd base prime counts over subracted above: atom result = larges[1] - sum(larges[2..mxri+1]) + trunc((mxri+1 + 2*(nbp-1))*mxri/2) + 1 // include the only even prime, ie 2 // This loop adds the counts due to the products of the `roughs` primes, // of which we only use two different ones at a time, as all the // combinations with lower primes than the cube root of the range have // already been computed and included with the previous major loop... // see text description in the Nim entry for how this works... for ri,p in roughs from 2 do // for all `roughs` (now prime) bar '1': atom m = trunc(n/p), // `m` is the `p` quotient // so that the end limit `e` can be calculated based on `n`/(`p`^2) e = smalls[half(floor(m/p))+1]-nbp+1 // the following test is equivalent to non-splitting optmization: if e<=ri then exit end if // quit when no more pairs! - aka stop // at about `p` of cube root of range! for k=ri+1 to e do // for all `roughs` greater than `p` to limit: result += smalls[half(floor(m/roughs[k]))]; end for // compensate for all the extra base prime counts just added! result -= (e-ri)*(nbp+ri-2) end for return result end function atom t = time() constant expected = {0,4,25,168,1229,9592,78498,664579,5761455, 50847534,455052511,4118054813,37607912018, 346065536839,3204941750802} for i=0 to iff(platform()=JS?11:14) do -- (sp: keep js under 2s) atom c = count_primes(power(10,i)) assert(c==expected[i+1]) string e = elapsed(time()-t,0.1," (%s)") printf(1,"10^%d = %d%s\n",{i,c,e}) end for printf(1,"\nTook %s\n",elapsed(time()-t))

- Output:

10^0 = 0 10^1 = 4 10^2 = 25 10^3 = 168 10^4 = 1229 10^5 = 9592 10^6 = 78498 10^7 = 664579 10^8 = 5761455 10^9 = 50847534 10^10 = 455052511 (0.3s) 10^11 = 4118054813 (1.7s) 10^12 = 37607912018 (7.9s) 10^13 = 346065536839 (40.8s) 10^14 = 3204941750802 (3 minutes and 44s) Took 3 minutes and 44s

About 15 times slower than Julia on the same box (gulp, earmarked for further investigation in 2.0.0).

[At the last minute, I nicked the idea of using a simple bool array instead of those bit-fields from Julia.]

A copy of this code [up to 1e9] has also found it's way into tests\t68forin.exw

## Picat

Performance is surprisingly good, considering that the Picat's library SoE is very basic and that I recompute the base primes each time pi() is called. This version takes 23 seconds on my laptop, outperforming the CoffeeScript version which takes 30 seconds.

```
table(+, +, nt)
phi(X, 0, _) = X.
phi(X, A, Primes) = phi(X, A - 1, Primes) - phi(X // Primes[A], A - 1, Primes).
pi(N) = Count, N < 2 => Count = 0.
pi(N) = Count =>
M = floor(sqrt(N)),
A = pi(M),
Count = phi(N, A, math.primes(M)) + A - 1.
main =>
N = 1,
foreach (K in 0..9)
writef("10^%w %w%n", K, pi(N)),
N := N * 10.
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## PicoLisp

Uses code from the Sieve of Eratosthenes to generate primes.

```
(setq Legendre-Max (** 10 9))
(load "plcommon/eratosthenes.l") # see task "Sieve of Eratosthenes, 2x3x5x7 wheel version.
# Create an index tree of the first N primes up to √Legendre-Max
(setq Sieve (sieve (sqrt Legendre-Max)))
(balance 'Primes (mapcar 'cons (range 1 (length Sieve)) Sieve))
(de prime (N) (cdr (lup Primes N)))
(de ϕ (X A)
(cache '(NIL) (cons X A)
(if (=0 A)
X
(- (ϕ X (dec A)) (ϕ (/ X (prime A)) (dec A))))))
(de π (N)
(if (< N 2)
0
(let A (π (sqrt N))
(+ (ϕ N A) A -1))))
(for N 10
(prinl "10\^" (dec N) "^I" (π (** 10 (dec N)))))
(bye)
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## Python

Using `primesieve`

module (`pip install primesieve`

).

```
from primesieve import primes
from math import isqrt
from functools import cache
p = primes(isqrt(1_000_000_000))
@cache
def phi(x, a):
res = 0
while True:
if not a or not x:
return x + res
a -= 1
res -= phi(x//p[a], a) # partial tail recursion
def legpi(n):
if n < 2: return 0
a = legpi(isqrt(n))
return phi(n, a) + a - 1
for e in range(10):
print(f'10^{e}', legpi(10**e))
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

## Raku

Seems like an awful lot of pointless work. Using prime sieve anyway, why not just use it?

```
use Math::Primesieve;
my $sieve = Math::Primesieve.new;
say "10^$_\t" ~ $sieve.count: exp($_,10) for ^10;
say (now - INIT now) ~ ' elapsed seconds';
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534 0.071464489 elapsed seconds

## Ruby

```
require 'prime'
def pi(n)
@pr = Prime.each(Integer.sqrt(n)).to_a
a = @pr.size
case n
when 0,1 then 0
when 2 then 1
else phi(n,a) + a - 1
end
end
def phi(x,a)
case a
when 0 then x
when 1 then x-(x>>1)
else
pa = @pr[a-1]
return 1 if x <= pa
phi(x, a-1)- phi(x/pa, a-1)
end
end
(0..9).each {|n| puts "10E#{n} #{pi(10**n)}" }
```

- Output:

10E0 0 10E1 4 10E2 25 10E3 168 10E4 1229 10E5 9592 10E6 78498 10E7 664579 10E8 5761455 10E9 50847534

## Sidef

```
func legendre_phi(x, a) is cached {
return x if (a <= 0)
__FUNC__(x, a-1) - __FUNC__(idiv(x, a.prime), a-1)
}
func legendre_prime_count(n) is cached {
return 0 if (n < 2)
var a = __FUNC__(n.isqrt)
legendre_phi(n, a) + a - 1
}
print("e n Legendre builtin\n",
"- - -------- -------\n")
for n in (1 .. 9) {
printf("%d %12d %10d %10d\n", n, 10**n,
legendre_prime_count(10**n), prime_count(10**n))
assert_eq(legendre_prime_count(10**n), prime_count(10**n))
}
```

- Output:

e n Legendre builtin - - -------- ------- 1 10 4 4 2 100 25 25 3 1000 168 168 4 10000 1229 1229 5 100000 9592 9592 6 1000000 78498 78498 7 10000000 664579 664579

Alternative implementation of the Legendre phi function, by counting k-rough numbers <= n.

```
func rough_count (n,k) {
# Count of k-rough numbers <= n.
func (n,p) is cached {
if (p > n.isqrt) {
return 1
}
if (p == 2) {
return (n >> 1)
}
if (p == 3) {
var t = idiv(n,3)
return (t - (t >> 1))
}
var u = 0
var t = idiv(n,p)
for (var q = 2; q < p; q.next_prime!) {
var v = __FUNC__(t - (t % q), q)
if (v == 1) {
u += prime_count(q, p-1)
break
}
u += v
}
t - u
}(n*k, k)
}
func legendre_phi(n, a) {
rough_count(n, prime(a+1))
}
func legendre_prime_count(n) is cached {
return 0 if (n < 2)
var a = __FUNC__(n.isqrt)
legendre_phi(n, a) + a - 1
}
print("e n Legendre builtin\n",
"- - -------- -------\n")
for n in (1 .. 9) {
printf("%d %12d %10d %10d\n", n, 10**n,
legendre_prime_count(10**n), prime_count(10**n))
assert_eq(legendre_prime_count(10**n), prime_count(10**n))
}
```

## Swift

```
import Foundation
extension Numeric where Self: Strideable {
@inlinable
public func power(_ n: Self) -> Self {
return stride(from: 0, to: n, by: 1).lazy.map({_ in self }).reduce(1, *)
}
}
func eratosthenes(limit: Int) -> [Int] {
guard limit >= 3 else {
return limit < 2 ? [] : [2]
}
let ndxLimit = (limit - 3) / 2 + 1
let bufSize = ((limit - 3) / 2) / 32 + 1
let sqrtNdxLimit = (Int(Double(limit).squareRoot()) - 3) / 2 + 1
var cmpsts = Array(repeating: 0, count: bufSize)
for ndx in 0..<sqrtNdxLimit where (cmpsts[ndx >> 5] & (1 << (ndx & 31))) == 0 {
let p = ndx + ndx + 3
var cullPos = (p * p - 3) / 2
while cullPos < ndxLimit {
cmpsts[cullPos >> 5] |= 1 << (cullPos & 31)
cullPos += p
}
}
return (-1..<ndxLimit).compactMap({i -> Int? in
if i < 0 {
return 2
} else {
if cmpsts[i >> 5] & (1 << (i & 31)) == 0 {
return .some(i + i + 3)
} else {
return nil
}
}
})
}
let primes = eratosthenes(limit: 1_000_000_000)
func φ(_ x: Int, _ a: Int) -> Int {
struct Cache {
static var cache = [String: Int]()
}
guard a != 0 else {
return x
}
guard Cache.cache["\(x),\(a)"] == nil else {
return Cache.cache["\(x),\(a)"]!
}
Cache.cache["\(x),\(a)"] = φ(x, a - 1) - φ(x / primes[a - 1], a - 1)
return Cache.cache["\(x),\(a)"]!
}
func π(n: Int) -> Int {
guard n > 2 else {
return 0
}
let a = π(n: Int(Double(n).squareRoot()))
return φ(n, a) + a - 1
}
for i in 0..<10 {
let n = 10.power(i)
print("π(10^\(i)) = \(π(n: n))")
}
```

- Output:

π(10^0) = 0 π(10^1) = 4 π(10^2) = 25 π(10^3) = 168 π(10^4) = 1229 π(10^5) = 9592 π(10^6) = 78498 π(10^7) = 664579 π(10^8) = 5761455 π(10^9) = 50847534

## Vlang

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Currently, anonymous functions and closure recursion doesn't work very well (which is why the V language is still in beta) so those slower algorithms don't work properly. However, the "partial sieving" Nim version has been [translated from the non-memoizing Nim version](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how it works:

**The Non-Memoizing Non-Recursive "Partial-Sieving" Algorithm**

```
// compile with: v -cflags -march=native -cflags -O3 -prod MagicCount.v
import time
import math { sqrt }
const masks = [ u8(1), 2, 4, 8, 16, 32, 64, 128 ] // faster than bit twiddling!
[inline]
fn half(n u64) i64 { return i64((n - 1) >>> 1) } // convenience function!
[inline] // floating point divide faster than integer divide!
fn divide(nm u64, d u64) u64 { return u64(f64(nm) / f64(d)) }
[direct_array_access]
fn count_primes_to(n u64) i64 {
if n < u64(9) { return if n < i64(2) { i64(0) } else { i64((n + 1) / 2) } }
rtlmt := u64(sqrt(f64(n)))
mxndx := i64((rtlmt - 1) / 2)
arrlen := int(mxndx + 1)
mut smalls := []u32{ len: arrlen, cap: arrlen, init: u32(it) }
mut roughs := []u32{ len: arrlen, cap: arrlen, init: u32(it + it + 1) }
mut larges := []i64{ len: arrlen, cap: arrlen,
init: i64(((n / u64(it + it + 1) - 1) / 2)) }
cullbuflen := int((mxndx + 8) / 8)
mut cullbuf := []u8{len: cullbuflen, cap: cullbuflen, init: u8(0)}
mut nbps := i64(0) // do partial sieving for each odd base prime to quad root...
mut rilmt := arrlen // number of roughs start with full size!
for i := i64(1); ; i++ { // start with 3; breaks when partial sieving done...
sqri := (i + i) * (i + 1)
if sqri > mxndx { break } // breaks here!
if (cullbuf[i >>> 3] & masks[i & 7]) != u8(0) { continue } // not prime!
cullbuf[i >>> 3] |= masks[i & 7] // cull bp rep!
bp := u64(i + i + 1) // partial sieve by bp...
for c := sqri; c < arrlen; c += i64(bp) { cullbuf[c >>> 3] |= masks[c & 7] }
mut nri := 0 // transfer from `ori` to `nri` indexes:
for ori in 0 .. rilmt { // update `roughs` and `larges` for partial sieve...
r := roughs[ori]
rci := i64(r >>> 1) // skip recently culled in last partial sieve...
if (cullbuf[rci >>> 3] & masks[rci & 7]) != u8(0) { continue }
d := u64(r) * u64(bp)
larges[nri] = larges[ori] -
(if d <= rtlmt { larges[i64(smalls[d >>> 1]) - nbps] }
else { i64(smalls[half(divide(n, d))]) })
+ nbps // adjust for over subtraction of base primes!
roughs[nri] = r // compress for culled `larges` and `roughs`!
nri++ // advance output `roughs` index!
}
mut si := mxndx // update `smalls` for partial sieve...
for pm := ((rtlmt / bp) - 1) | 1; pm >= bp; pm -= 2 {
c := smalls[pm >>> 1]
e := (pm * bp) >>> 1
for ; si >= e; si-- { smalls[si] -= (c - u32(nbps)) }
}
rilmt = nri // new rough size limit!
nbps++ // for one partial sieve base prime pass!
}
// combine results so far; adjust for over subtraction of base prime count...
mut ans := larges[0] + i64(((rilmt + 2 * (nbps - 1)) * (rilmt - 1) / 2))
for ri in 1 .. rilmt { ans -= larges[ri] } // combine results so far!
// add quotient for product of pairs of primes, quad root to cube root...
for ri := 1; ; ri++ { // break controlled below
p := u64(roughs[ri])
m := n / p
ei := int(smalls[half(u64(m / p))]) - nbps
if ei <= ri { break } // break out when only multiple equal to `p`
ans -= i64((ei - ri) * (nbps + ri - 1)) // adjust for over addition below!
for sri in ri + 1 .. ei + 1 { // add for products of 1st and 2nd primes...
ans += i64(smalls[half(divide(m, u64(roughs[sri])))]) }
}
return ans + 1 // add one for only even prime of two!
}
start_time := time.now()
mut pow := u64(1)
for i in 0 .. 10 {
println("π(10**$i) = ${count_primes_to(pow)}")
pow *= 10 }
duration := (time.now() - start_time).milliseconds()
println("This took $duration milliseconds.")
```

- Output:

π(10**0) = 0 π(10**1) = 4 π(10**2) = 25 π(10**3) = 168 π(10**4) = 1229 π(10**5) = 9592 π(10**6) = 78498 π(10**7) = 664579 π(10**8) = 5761455 π(10**9) = 50847534 This took 2 milliseconds.

The above code is about as fast as the same algorithm in the fastest of languages such as C/C++/Nim due to the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)); it can compute the number of primes to 1e11 in about 22 milliseconds on the same machine as shown where it is run on an Intel i5-6500 (3.6 GHz single-thread boosted). The above code can compute the number of primes to 1e16 in about a minute, although there is a precision/accuracy problem for counting ranges above about 9e15 due to the speed-up optimization of using floating point division operations (precise only to 53-bits, which is about this limit).

## Wren

### Memoized

This runs in about 4.5 seconds which is not too bad for the Wren interpreter. As map keys cannot be lists, the Cantor pairing function has been used to represent [x, a] which is considerably faster than using a string based approach for memoization.

To sieve a billion numbers and then count the primes up to 10^k would take about 53 seconds in Wren so, as expected, the Legendre method represents a considerable speed up.

```
import "./math" for Int
var pi = Fn.new { |n|
if (n < 3) return (n < 2) ? 0 : 1
var primes = Int.primeSieve(n.sqrt.floor)
var a = primes.count
var memoPhi = {}
var phi // recursive closure
phi = Fn.new { |x, a|
if (a <= 1) return (a < 1) ? x : x - (x >> 1)
var pa = primes[a-1]
if (x <= pa) return 1
var key = Int.cantorPair(x, a)
if (memoPhi.containsKey(key)) return memoPhi[key]
return memoPhi[key] = phi.call(x, a-1) - phi.call((x/pa).floor, a-1)
}
return phi.call(n, a) + a - 1
}
var n = 1
for (i in 0..9) {
System.print("10^%(i) %(pi.call(n))")
n = n * 10
}
```

- Output:

10^0 0 10^1 4 10^2 25 10^3 168 10^4 1229 10^5 9592 10^6 78498 10^7 664579 10^8 5761455 10^9 50847534

### Non-memoized

Inspired by the non-memoized Nim version.

The memoized version requires a cache of around 6.5 million map entries, which at 8 bytes each (all numbers are 64 bit floats in Wren) for both the key and the value, equates in round figures to memory usage of 104 MB on top of that needed for the prime sieve. The following version strips out memoization and, whilst somewhat slower at 5.4 seconds, may be preferred in a constrained memory environment.

```
import "./math" for Int
var pi = Fn.new { |n|
if (n < 3) return (n < 2) ? 0 : 1
var primes = Int.primeSieve(n.sqrt.floor)
var a = primes.count
var phi // recursive closure
phi = Fn.new { |x, a|
if (a <= 1) return (a < 1) ? x : x - (x >> 1)
var pa = primes[a-1]
if (x <= pa) return 1
return phi.call(x, a-1) - phi.call((x/pa).floor, a-1)
}
return phi.call(n, a) + a - 1
}
var n = 1
for (i in 0..9) {
System.print("10^%(i) %(pi.call(n))")
n = n * 10
}
```

- Output:

Same as first version.

### Iterative, partial sieving

This non-memoized, non-recursive version is far quicker than the first two versions, running in about 114 milliseconds. Nowhere near as quick as V, of course, but acceptable for Wren which is interpreted and only has one built-in numeric type, a 64 bit float.

```
import "./math" for Int
var masks = [1, 2, 4, 8, 16, 32, 64, 128]
var half = Fn.new { |n| (n - 1) >> 1 }
var countPrimes = Fn.new { |n|
if (n < 9) return (n < 2) ? 0 : ((n + 1)/2).floor
var rtlmt = n.sqrt.floor
var mxndx = Int.quo(rtlmt - 1, 2)
var arrlen = mxndx + 1
var smalls = List.filled(arrlen, 0)
var roughs = List.filled(arrlen, 0)
var larges = List.filled(arrlen, 0)
for (i in 0...arrlen) {
smalls[i] = i
roughs[i] = i + i + 1
larges[i] = Int.quo(Int.quo(n, i + i + 1) - 1 , 2)
}
var cullbuflen = Int.quo(mxndx + 8, 8)
var cullbuf = List.filled(cullbuflen, 0)
var nbps = 0
var rilmt = arrlen
var i = 1
while (true) {
var sqri = (i + i) * (i + 1)
if (sqri > mxndx) break
if ((cullbuf[i >> 3] & masks[i & 7]) != 0) {
i = i + 1
continue
}
cullbuf[i >> 3] = cullbuf[i >> 3] | masks[i & 7]
var bp = i + i + 1
var c = sqri
while (c < arrlen) {
cullbuf[c >> 3] = cullbuf[c >> 3] | masks[c & 7]
c = c + bp
}
var nri = 0
for (ori in 0...rilmt) {
var r = roughs[ori]
var rci = r >> 1
if ((cullbuf[rci >> 3] & masks[rci & 7]) != 0) continue
var d = r * bp
var t = (d <= rtlmt) ? larges[smalls[d >> 1] - nbps] :
smalls[half.call(Int.quo(n, d))]
larges[nri] = larges[ori] - t + nbps
roughs[nri] = r
nri = nri + 1
}
var si = mxndx
var pm = ((rtlmt/bp).floor - 1) | 1
while (pm >= bp) {
var c = smalls[pm >> 1]
var e = (pm * bp) >> 1
while (si >= e) {
smalls[si] = smalls[si] - c + nbps
si = si - 1
}
pm = pm - 2
}
rilmt = nri
nbps = nbps + 1
i = i + 1
}
var ans = larges[0] + ((rilmt + 2 * (nbps - 1)) * (rilmt - 1) / 2).floor
for (ri in 1...rilmt) ans = ans - larges[ri]
var ri = 1
while (true) {
var p = roughs[ri]
var m = Int.quo(n, p)
var ei = smalls[half.call(Int.quo(m, p))] - nbps
if (ei <= ri) break
ans = ans - (ei - ri) * (nbps + ri - 1)
for (sri in (ri + 1..ei)) {
ans = ans + smalls[half.call(Int.quo(m, roughs[sri]))]
}
ri = ri + 1
}
return ans + 1
}
var n = 1
for (i in 0..9) {
System.print("10^%(i) %(countPrimes.call(n))")
n = n * 10
}
```

- Output:

Same as first version.