Legendre prime counting function

From Rosetta Code
Task
Legendre prime counting function
You are encouraged to solve this task according to the task description, using any language you may know.

The prime-counting function π(n) computes the number of primes not greater than n. Legendre was the first mathematician to create a formula to compute π(n) based on the inclusion/exclusion principle.

To calculate:

Define

φ(x, 0) = x
φ(x, a) = φ(x, a−1) − φ(⌊x/pa⌋, a−1), where pa is the ath prime number.

then

π(n) = 0 when n < 2
π(n) = φ(n, a) + a - 1, where a = π(√n), n ≥ 2

The Legendre formula still requires the use of a sieve to enumerate primes; however it's only required to sieve up to the √n, and for counting primes, the Legendre method is generally much faster than sieving up to n.

Task

Calculate π(n) for values up to 1 billion. Show π(n) for n = 1, 10, 100, ... 109.

For this task, you may refer to a prime number sieve (such as the Sieve of Eratosthenes or the extensible sieve) in an external library to enumerate the primes required by the formula. Also note that it will be necessary to memoize the results of φ(x, a) in order to have reasonable performance, since the recurrence relation would otherwise take exponential time.

Comments on Task

Regarding "it will be necessary to memoize the results of φ(x, a)", it will have exponential time performance without memoization only if a very small optimization that should be obvious is not done: it should be obvious that one can stop "splitting" the φ(x, a) "tree" when 'x' is zero, and even before that since the real meaning of the "phi"/φ function is to produce a count of all of the values greater than zero (including one) up to `x` that have been culled of all multiples of the primes up to and including the `pa` prime value, if `x` is less than or equal to `pa`, then the whole "tree" must result in a value of just one. If this minor (and obvious) optimization is done, the "exponential time" performance goes away, memoization is not absolutely necessary (saving the overhead in time and space of doing the memoization), and the time complexity becomes O(n/(log n2) and the space complexity becomes O(n1/2/log n) as they should be.

This is the problem when non-mathematician programmers blindly apply such a general formula as the recursive Legendre one without doing any work to understand it or even doing some trivial hand calculations to better understand how it works just because they have very powerful computers which mask the limitations: a few minutes of hand calculation would make it obvious that there is no need to "split"/recursively call for "phi" nodes where the first argument is zero, and someone with a mathematics interest would then investigate to see if that limit can be pushed a little further as here to the range of nodes whose result will always be one. Once there is no exponential growth of the number of "nodes", then there is no need for memoization as usually implemented with a hash table at a huge cost of memory overhead and constant time computation per operation.

As to "the Legendre method is generally much faster than sieving up to n.", while the number of operations for the Legendre algorithm is about a factor of `log n` squared less than the number of operations for odds-only Sieve of Eratosthenes (SoE) sieving, those operations are "divide" operations which are generally much slower than the simple array access and addition operations used in sieving and the SoE can be optimized further using wheel factorization so that the required time for a given range can be less for a fully optimized SoE than for the common implementation of the Legendre algorithm; the trade off is that a fully optimized SoE is at least 500 lines of code, whereas the basic version of the Legendre prime counting algorithm is only about 40 to 50 lines of code (depending somewhat on the language used).

Also note that the Legendre prime counting function was never used practically at the time it was invented other than to demonstrate that it would find the count of primes to a trivial range only knowing the primes up to the square root of that range and there were too many operations (especially long integer division operations) to actually use it for any reasonably range even with this optimization (about 250 thousand divisions to count primes to ten million), but the follow-on work by Meissel in the 1800's definitely would have used this optimization and others in order to hand calculate the number of primes to a billion (1e9) in about ten years. Even with this optimization, Meissel would have had to hand calculate over five million divisions, so certainly used other Look Up Tables (LUT's) although certainly not caching of Phi/φ values in order to reduce the work to something possible in this amount of time. A "TinyPhi" LUT table for the first six primes of thirteen and less would have reduced the amount of work Meissel did to about 600 thousand divisions, but even that would have been perhaps too much and it is very likely that he also used "partial sieving" techniques, although that would have meant that as well as a table of the primes up to a million, he would have also needed 161 other tables of that range to a million sieved by the primes up to 13, 17, 19, to 997; however, that extra work in building these tables (which might have been done mechanically) would pay off in reducing the number of divisions to about seven thousand so the divisions become a minor problem possible to do over months and the majority of the time would be spent producing the partial sieving tables up to a million.

The reason that Meissel refined the Legendre method would have been that, even applying all of the optimizations including "partial sieving", he would still have had to do about three and a half million divisions to count the primes to a billion even if the number of primes and "partial sieve tables" only needed to be known to about 32 thousand, where his "Meissel" algorithm reduced the number of divisions to only a few thousand as per the above. Without a computer, he could never have completed the calculation of the number of primes to a billion using an optimized Legendre algorithm where he could using his modification. However, modern computers make (reasonably) quick work of integer divisions so that optimized algorithms of the Legendre type become moderately useful although at the cost of memory use as compared to Meissel type algorithms.

11l

Translation of: Python
F primes_up_to_limit(Int limit)
   [Int] r
   I limit >= 2
      r.append(2)

   V isprime = [1B] * ((limit - 1) I/ 2)
   V sieveend = Int(sqrt(limit))
   L(i) 0 .< isprime.len
      I isprime[i]
         Int p = i * 2 + 3
         r.append(p)
         I i <= sieveend
            L(j) ((p * p - 3) >> 1 .< isprime.len).step(p)
               isprime[j] = 0B
   R r

V p = primes_up_to_limit(Int(sqrt(1'000'000'000)))

F phi(x, =a)
   F phi_cached(x, a)
      [(Int, Int) = Int] :cache
      I (x, a) C :cache
         R :cache[(x, a)]
      V r = phi(x, a)
      :cache[(x, a)] = r
      R r

   V res = 0
   L
      I a == 0 | x == 0
         R x + res

      a--
      res -= phi_cached(x I/ :p[a], a)

F legpi(n)
   I n < 2
      R 0

   V a = legpi(Int(sqrt(n)))
   R phi(n, a) + a - 1

L(e) 10
   print(‘10^’e‘ ’legpi(10 ^ e))
Output:
10^0 0
10^1 4
10^2 25
10^3 168
10^4 1229
10^5 9592
10^6 78498
10^7 664579
10^8 5761455
10^9 50847534

C

Translation of: Vlang

Using fixed width types so requires C99 or later.

Surprisingly only half as fast as Go on the same machine (GCC -O3 Ubuntu 22.04) for a billion numbers.

However, the position is reversed if we let it run through to a trillion numbers - about 100 ms for C compared to 160 ms for Go.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

const uint8_t masks[8] = {1, 2, 4, 8, 16, 32, 64, 128};

#define half(n) ((int64_t)((n) - 1) >> 1)

#define divide(nm, d) ((uint64_t)((double)nm / (double)d))

int64_t countPrimes(uint64_t n) {
    if (n < 9) return (n < 2) ? 0 : ((int64_t)n + 1) / 2;    
    uint64_t rtlmt = (uint64_t)sqrt((double)n);
    int64_t mxndx = (int64_t)((rtlmt - 1) / 2);
    int arrlen = (int)(mxndx + 1);
    uint32_t *smalls = malloc(arrlen * 4);
    uint32_t *roughs = malloc(arrlen * 4);
    int64_t *larges  = malloc(arrlen * 8);
    for (int i = 0; i < arrlen; ++i) {
        smalls[i] = (uint32_t)i;
        roughs[i] = (uint32_t)(i + i + 1);
        larges[i] = (int64_t)((n/(uint64_t)(i + i + 1) - 1) / 2);
    }
    int cullbuflen = (int)((mxndx + 8) / 8);
    uint8_t *cullbuf = calloc(cullbuflen, 1);
    int64_t nbps = 0;
    int rilmt = arrlen;
    for (int64_t i = 1; ; ++i) {
        int64_t sqri = (i + i) * (i + 1);
        if (sqri > mxndx) break;
        if (cullbuf[i >> 3] & masks[i & 7]) continue;
        cullbuf[i >> 3] |= masks[i & 7];
        uint64_t bp = (uint64_t)(i + i + 1);
        for (int64_t c = sqri; c < (int64_t)arrlen; c += (int64_t)bp) {
            cullbuf[c >> 3] |= masks[c & 7];
        }
        int nri = 0;
        for (int ori = 0; ori < rilmt; ++ori) {
            uint32_t r = roughs[ori];
            int64_t rci = (int64_t)(r >> 1);
            if (cullbuf[rci >> 3] & masks[rci & 7]) continue;
            uint64_t d = (uint64_t)r * bp;
            int64_t t = (d <= rtlmt) ? larges[(int64_t)smalls[d >> 1] - nbps] :
                                       (int64_t)smalls[half(divide(n, d))];
            larges[nri] = larges[ori] - t + nbps;
            roughs[nri] = r;
            nri++;
        }
        int64_t si = mxndx;
        for (uint64_t pm = (rtlmt/bp - 1) | 1; pm >= bp; pm -= 2) {
            uint32_t c = smalls[pm >> 1];
            uint64_t e = (pm * bp) >> 1;
            for ( ; si >= (int64_t)e; --si) smalls[si] -= c - (uint32_t)nbps;                           
        }
        rilmt = nri;
        nbps++;
    }
    int64_t ans = larges[0] + (int64_t)((rilmt + 2*(nbps - 1)) * (rilmt - 1) / 2);
    int ri, sri;
    for (ri = 1; ri < rilmt; ++ri) ans -= larges[ri];
    for (ri = 1; ; ++ri) {
        uint64_t p = (uint64_t)roughs[ri];
        uint64_t m = n / p;
        int ei = (int)smalls[half((uint64_t)m/p)] - nbps;
        if (ei <= ri) break;
        ans -= (int64_t)((ei - ri) * (nbps + ri - 1));
        for (sri = ri + 1; sri < ei + 1; ++sri) {
            ans += (int64_t)smalls[half(divide(m, (uint64_t)roughs[sri]))];
        }
    }
    free(smalls);
    free(roughs);
    free(larges);
    free(cullbuf);
    return ans + 1;
}

int main() {
    uint64_t n;
    int i;
    clock_t start = clock();
    for (i = 0, n = 1; i < 10; ++i, n *= 10) {
        printf("10^%d %ld\n", i, countPrimes(n));
    }
    clock_t end = clock();
    printf("\nTook %f seconds\n", (double) (end - start) / CLOCKS_PER_SEC);
    return 0;
}
Output:
10^0 0
10^1 4
10^2 25
10^3 168
10^4 1229
10^5 9592
10^6 78498
10^7 664579
10^8 5761455
10^9 50847534

Took 0.003843 seconds

C++

#include <cmath>
#include <iostream>
#include <vector>

std::vector<int> generate_primes(int limit) {
    std::vector<bool> sieve(limit >> 1, true);
    for (int p = 3, s = 9; s < limit; p += 2) {
        if (sieve[p >> 1]) {
            for (int q = s; q < limit; q += p << 1)
                sieve[q >> 1] = false;
        }
        s += (p + 1) << 2;
    }
    std::vector<int> primes;
    if (limit > 2)
        primes.push_back(2);
    for (int i = 1; i < sieve.size(); ++i) {
        if (sieve[i])
            primes.push_back((i << 1) + 1);
    }
    return primes;
}

class legendre_prime_counter {
public:
    explicit legendre_prime_counter(int limit);
    int prime_count(int n);
private:
    int phi(int x, int a);
    std::vector<int> primes;
};

legendre_prime_counter::legendre_prime_counter(int limit) :
    primes(generate_primes(static_cast<int>(std::sqrt(limit)))) {}

int legendre_prime_counter::prime_count(int n) {
    if (n < 2)
        return 0;
    int a = prime_count(static_cast<int>(std::sqrt(n)));
    return phi(n, a) + a - 1;
}

int legendre_prime_counter::phi(int x, int a) {
    if (a == 0)
        return x;
    if (a == 1)
        return x - (x >> 1);
    int pa = primes[a - 1];
    if (x <= pa)
        return 1;
    return phi(x, a - 1) - phi(x / pa, a - 1);
}

int main() {
    legendre_prime_counter counter(1000000000);
    for (int i = 0, n = 1; i < 10; ++i, n *= 10)
        std::cout << "10^" << i << "\t" << counter.prime_count(n) << '\n';
}
Output:
10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

Chapel

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Chapel versions are [translated and improved from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

The Basic Recursive Legendre Algorithm

Translation of: Nim
// compile with --fast for maximum speed...

use Time;

proc countPrimes(lmt: uint(64)): int(64) {
  if lmt < 9 { // when there are no odd primes less than square root...
    if lmt < 3 { if lmt < 2 { return 0; } else { return 1; } }
    return (lmt - (lmt >> 1)): int(64);
  }

  // Chapel doesn't have closures, so emulate them with a class...
  class LegendrePi {
    var n: uint(64);
    var dom: domain(1);
    var oprms: [dom] uint(32);
    proc init(n: uint(64)) {      
      // first, an array of odd primes to the square root of n is generated...
      this.n = n;
      const sqrtn = sqrt(n: real(64)): int(64);
      const rtlmt = (sqrtn - 3) / 2; this.dom = {0 .. rtlmt};
      this.oprms = 0;
      for i in 0 .. rtlmt do this.oprms[i] = (i + i + 3): uint(32);
      var i = 0;
      for i in (0 ..) { // cull the array
        var ci = (i + i) * (i + 3) + 3; if ci > rtlmt { break; }
        const bp = i + i + 3;
        while (ci <= rtlmt) { this.oprms[ci] = 0; ci += bp; }
      }
      var psz = 0;
      for ti in 0 .. rtlmt { // compress the odd primes array...
        const tv = this.oprms[ti];
        if tv != 0 { this.oprms[psz] = tv; psz += 1; }
      }
      this.dom = { 0 ..< psz };
    }
    proc phi(x: uint(64), a: int): int(64) {
      if a <= 0 { return (x - (x >> 1)): int(64); } // take care of prime of 2
      const na = a - 1; const p = this.oprms[na]: uint(64);
      if x <= p { return 1: int(64); }
      return phi(x, na) - phi(x / p, na);
    }
    proc this(): int(64) {
      return phi(n, this.oprms.size) + this.oprms.size: int(64);
    }
  }
  return (new LegendrePi(lmt))();
}

proc main() {
  var timer: Timer;
  timer.start();

  for i in 0 .. 9 {
    writeln("π(10**", i, ") = ", countPrimesx(10: uint(64) ** i));
  }

  timer.stop();

  writeln("This took ", timer.elapsed(TimeUnits.milliseconds), " milliseconds.");
}
Output:
π(10**0) = 0
π(10**1) = 4
π(10**2) = 25
π(10**3) = 168
π(10**4) = 1229
π(10**5) = 9592
π(10**6) = 78498
π(10**7) = 664579
π(10**8) = 5761455
π(10**9) = 50847534
This took 276.431 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm

Just substitute the following Chapel code for the `countPrimes` function above to enjoy the gain in speed:

Translation of: Nim
// tiny Phi Look Up for `a` of small degree...
const tinyPhiPrimes = [ 2, 3, 5, 7, 11, 13 ]; // degree six
const cC = tinyPhiPrimes.size - 1;
proc product(a: [] int): int {
  var acc = 1; for v in a { acc *= v; }; return acc >> 1; }
const tinyPhiOddCirc = product(tinyPhiPrimes);
proc tot(a: [] int): int {
  var acc = 1; for v in a { acc *= v - 1; }; return acc; }
const tinyPhiOddTot = tot(tinyPhiPrimes);
proc makeTinyLUT(ps: [] int, sz: int): [] uint(32) {
  var arr: [0 .. sz - 1] uint(32) = 1;
  for p in ps {
    if p <= 2 { continue; }
    arr[p >> 1] = 0;
    for c in ((p * p) >> 1) ..< sz by p { arr[c] = 0; }
  }
  var acc = 0: uint(32);
  for i in 0 ..< sz { acc += arr[i]; arr[i] = acc; }
  return arr;
}
const tinyPhiLUT = makeTinyLUT(tinyPhiPrimes, tinyPhiOddCirc);
inline proc tinyPhi(x: uint(64)): int(64) {
  const ndx = (x - 1) >> 1; const numtot = ndx / tinyPhiOddCirc: uint(64);
  return (numtot * tinyPhiOddTot +
            tinyPhiLUT[(ndx - numtot * tinyPhiOddCirc): int]): int(64);
}

proc countPrimes(lmt: uint(64)): int(64) {
  if lmt < 169 { // below 169 whose sqrt is 13 is where TinyPhi doesn't work...
    if lmt < 3 { if lmt < 2 { return 0; } else { return 1; } }
    // adjust for the missing "degree" base primes
    if lmt <= 13 {
      return ((lmt - 1): int(64) >> 1) + (if (lmt < 9) then 1 else 0); }
    return 5 + tinyPhiLUT[(lmt - 1): int >> 1]: int(64);
  }

  // Chapel doesn't have closures, so emulate them with a class...
  class LegendrePi {
    var n: uint(64);
    var dom: domain(1);
    var oprms: [dom] uint(32);
    proc init(n: uint(64)) {      
      // first, an array of odd primes to the square root of n is generated...
      this.n = n;
      const sqrtn = sqrt(n: real(64)): int(64);
      const rtlmt = (sqrtn - 3) / 2; this.dom = {0 .. rtlmt};
      this.oprms = 0;
      for i in 0 .. rtlmt do this.oprms[i] = (i + i + 3): uint(32);
      var i = 0;
      for i in (0 ..) { // cull the array
        var ci = (i + i) * (i + 3) + 3; if ci > rtlmt { break; }
        const bp = i + i + 3;
        while (ci <= rtlmt) { this.oprms[ci] = 0; ci += bp; }
      }
      var psz = 0;
      for ti in 0 .. rtlmt { // compress the odd primes array...
        const tv = this.oprms[ti];
        if tv != 0 { this.oprms[psz] = tv; psz += 1; }
      }
      this.dom = { 0 ..< psz };
    }
    proc lvl(pilmt: int, m: uint(64)): int(64) {
      var acc = 0: int(64);
      for pi in cC ..< pilmt {
        const p = this.oprms[pi]: uint(64); const nm = m * p;
        if this.n <= nm * p { return acc + (pilmt - pi); }
        if pi > cC { acc -= this.lvl(pi, nm); }
        const q = this.n / nm; acc += tinyPhi(q);
      }
      return acc;
    }
    proc this(): int(64) {
      return tinyPhi(this.n) - this.lvl(this.oprms.size, 1)
               + this.oprms.size: int(64);
    }
  }
  return (new LegendrePi(lmt))();
}

Use of the above function gets a gain in speed of about a further ten times over the above version due to less recursion and the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

The Non-Recursive Partial Sieving Algorithm

Just substitute the following Chapel code for the `countPrimes` function above to enjoy the further gain in speed:

Translation of: Nim
const masks = for i in 0 .. 7 do (1 << i): uint(8); // faster bit twiddling

proc countPrimes(lmt: uint(64)): int(64) {
  if lmt < 3 { if lmt < 2 { return 0; } else { return 1; } } // odds only!
  inline proc half(x: int): int { return (x - 1) >> 1; } // convenience function
  inline proc divide(nm: uint(64), d: uint(64)): int {
    return (nm: real(64) / d: real(64)): int; } // floating point div faster
  const sqrtn = sqrt(lmt: real(64)): uint(64);
  const mxndx = (sqrtn - 1): int / 2;
  const dom = {0 .. mxndx}; const csz = (mxndx + 8) / 8;
  var smalls = for i in dom do i: uint(32);
  var roughs = for i in dom do (i + i + 1): uint(32);
  var larges = for i in dom do ((lmt / (i + i + 1)) - 1) >> 1;
  var cullbuf: [0 ..< csz] uint(8);

  // partial sieve loop, adjusting larges/smalls, compressing larges/roughs...
  var nobps = 0; var rilmt = mxndx;
  for bp in 3: uint(64) .. by 2 {
    const i = (bp >> 1): int; const sqri = (i + i) * (i + 1);
    if sqri > mxndx { break; } // up to quad root of counting range
    if (cullbuf[i >> 3] & masks[i & 7]) != 0 { continue; } // loop not prime
    cullbuf[i >> 3] |= masks[i & 7]; // cull bp itself as not a rough
    for ci in sqri .. mxndx by bp { // do partial sieving pass for `bp`...
      cullbuf[ci >> 3] |= masks[ci & 7]; } // cull all multiples of `bp`

    // now adjust `larges` for latest partial sieve pass...
    var ori = 0; // compress input rough index to output one
    for iri in 0 .. rilmt {
      const r = roughs[iri]: uint(64); const rci = (r >> 1): int;
      if (cullbuf[rci >> 3] & masks[rci & 7]) != 0 {
        continue; } // skip culled roughs in last partial sieving pass
      const d = bp: uint(64) * r;
      larges[ori] = larges[iri] -
                      (if d <= sqrtn then
                         larges[smalls[(d >> 1): int] - nobps]
                       else smalls[half(divide(lmt, d))]: uint(64)) + nobps;
      roughs[ori] = r: uint(32); ori += 1;
    }

    var si = mxndx; // and adjust `smalls` for latest partial sieve pass...
    for bpm in bp .. (sqrtn / bp - 1) | 1 by -2 {
      const c = smalls[(bpm >> 1): int] - nobps: uint(32);
      const e = ((bpm * bp) >> 1): int;
      while si >= e { smalls[si] -= c; si -= 1; }
    }

    nobps += 1; rilmt = ori - 1;
  }

  var ans = larges[0]; // answer from larges, adjusting for over subtraction...
  for i in 1 .. rilmt { ans -= larges[i]; } // combine!
  ans += (rilmt + 1 + 2 * (nobps - 1)) * rilmt / 2; // adjust!

  // add final adjustment for pairs of current roughs to cube root of range...
  for ri in (1 ..) { // break when reaches cube root of counting range...
    const p = roughs[ri]: uint(64); const q = lmt / p;
    const ei = smalls[half(divide(q, p))]: int - nobps;
    if ei <= ri { break; } // break here when no more pairs!
    for ori in ri + 1 .. ei { // for all pairs never the same prime!
      ans += smalls[half(divide(q, roughs[ori]))]: int(64); }
    // adjust for over subtractions above...
    ans -= (ei - ri): uint(64) * (nobps: uint(64) + ri: uint(64) - 1);
  }

  return ans: int(64) + 1; // add one for only even prime of two!
}

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division; for instance, this version can calculate the number of primes to 1e11 in about 22 milliseconds. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in under a minute as long as the computer used has the required free memory of about 800 Megabytes. This Chapel version is about the same speed as the Nim version from which it was translated other than the LLVM back-end used here produces code that is a few percent slower than as that produced by the GCC compiler used as a back-end by Nim, at least for this algorithm.

CoffeeScript

sorenson = require('sieve').primes  # Sorenson's extensible sieve from task: Extensible Prime Generator

# put in outer scope to avoid recomputing the cache
memoPhi = {}
primes = []

isqrt = (x) -> Math.floor Math.sqrt x

pi = (n) ->
    phi = (x, a) ->
        y = memoPhi[[x,a]]
        return y unless y is undefined

        memoPhi[[x,a]] =
            if a is 0 then x
            else
                p = primes[a - 1]
                throw "You need to generate at least #{a} primes." if p is undefined
                phi(x, a - 1) - phi(x // p, a - 1)

    if n < 2
        0
    else
        a = pi isqrt n
        phi(n, a) + a - 1

maxPi = 1e9
gen = sorenson()
primes = while (p = gen.next().value) < isqrt maxPi then p

n = 1
for i in [0..9]
    console.log "10^#{i}\t#{pi(n)}"
    n *= 10
Output:
10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

Crystal

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Crystal versions are [translated from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

The Basic Recursive Legendre Algorithm

require "bit_array"

def count_primes(n : Int64)
  if n < 3_i64
    return 0_i64 if n < 2_i64
    return 1_i64
  end
  rtlmt = Math.sqrt(n.to_f64).to_i32
  mxndx = (rtlmt - 3) // 2
  cmpsts = BitArray.new(mxndx + 1)
  i = 0
  while true
    c = (i + i) * (i + 3) + 3
    break if c > mxndx
    unless cmpsts[i]
      bp = i + i + 3
      until c > mxndx
        cmpsts[c] = true
        c += bp
      end
    end
    i += 1
  end
  oprms = Array(Int32).new(cmpsts.count { |e| !e }, 0)
  pi = 0
  cmpsts.each_with_index do |e, i|
    unless e
      oprms[pi] = (i + i + 3).to_i32; pi += 1
    end
  end
  phi = uninitialized Proc(Int64, Int32, Int64) # recursion target!
  phi = ->(x : Int64, a : Int32) {
    return x - (x >> 1) if a < 1
    p = oprms.unsafe_fetch(a - 1)
    return 1_i64 if x <= p
    phi.call(x, a - 1) - phi.call((x.to_f64 / p.to_f64).to_i64, a - 1)
  }
  phi.call(n, oprms.size) + oprms.size
end

start_time = Time.monotonic
(0 .. 9).each { |i| puts "π(10**#{i}) = #{count_primes(10_i64**i)}" }
elpsd = (Time.monotonic - start_time).total_milliseconds

puts "This took #{elpsd} milliseconds."
Output:
π(10**0) = 0
π(10**1) = 4
π(10**2) = 25
π(10**3) = 168
π(10**4) = 1229
π(10**5) = 9592
π(10**6) = 78498
π(10**7) = 664579
π(10**8) = 5761455
π(10**9) = 50847534
This took 272.428954 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm

Just substitute the following Crystal code for the `count_primes` function above to enjoy the gain in speed:

Tiny_Phi_Primes = [ 2, 3, 5, 7, 11, 13 ]
Tiny_Phi_Odd_Circ = Tiny_Phi_Primes.product // 2
Tiny_Phi_Tot = Tiny_Phi_Primes.reduce(1) { |acc, p| acc * (p - 1) }
CC = Tiny_Phi_Primes.size - 1
def make_Tiny_Phi_LUT()
  rslt = Array(UInt16).new(Tiny_Phi_Odd_Circ, 1_u16)
  Tiny_Phi_Primes.skip(1).each { |bp|
      i = (bp - 1) >> 1; rslt[i] = 0; c = (i + i) * (i + 1)
      while c < Tiny_Phi_Odd_Circ
        rslt[c] = 0; c += bp
      end }
  acc = 0_u16; i = 0
  while i < Tiny_Phi_Odd_Circ
    acc += rslt[i]; rslt[i] = acc; i += 1
  end
  rslt
end
Tiny_Phi_LUT = make_Tiny_Phi_LUT()
@[AlwaysInline]
def tiny_Phi(x : Int64) : Int64
  ndx = (x - 1) >> 1; numtot = ndx // Tiny_Phi_Odd_Circ.to_i64
  tpli = ndx - numtot * Tiny_Phi_Odd_Circ.to_i64
  numtot * Tiny_Phi_Tot.to_i64 +
    Tiny_Phi_LUT.unsafe_fetch(tpli).to_i64
end

def count_primes(n : Int64)
  if n < 169_i64 # below 169 whose sqrt is 13 is where TinyPhi doesn't work...
    return 0_i64 if n < 2_i64
    return 1_i64 if n < 3_i64
    # adjust for the missing "degree" base primes
    return 1 + (n - 1) // 2 if n < 9_i64
    return (n - 1) // 2 if n <= 13_i64
    return 5 + Tiny_Phi_LUT[(n - 1).to_i32 // 2].to_i64
  end
  rtlmt = Math.sqrt(n.to_f64).to_i32
  mxndx = (rtlmt - 3) // 2
  cmpsts = BitArray.new(mxndx + 1)
  i = 0
  while true
    c = (i + i) * (i + 3) + 3
    break if c > mxndx
    unless cmpsts[i]
      bp = i + i + 3
      until c > mxndx
        cmpsts[c] = true
        c += bp
      end
    end
    i += 1
  end
  oprms = Array(Int32).new(cmpsts.count { |e| !e }, 0)
  opi = 0
  cmpsts.each_with_index do |e, i|
    unless e
      oprms[opi] = (i + i + 3).to_i32; opi += 1
    end
  end
  lvl = uninitialized Proc(Int32, Int32, Int64, Int64) # recursion target!
  lvl = ->(pilo : Int32, pilmt : Int32, m : Int64) : Int64 {
    pi = pilo; answr = 0_i64
    while pi < pilmt
      p = oprms.unsafe_fetch(pi).to_i64; nm = p * m
      return answr + (pilmt - pi) if n <= nm * p
      q = (n.to_f64 / nm.to_f64).to_i64; answr += tiny_Phi(q)
      answr -= lvl.call(CC, pi, nm) if pi > CC
      pi += 1
    end
    answr
  }
  tiny_Phi(n) - lvl.call(CC, oprms.size, 1_i64) + oprms.size
end

Use of the above function gets a gain in speed of about a further ten times over the above version due to less recursion, the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions, and by using floating point division for integer division because it is faster, although it is limited in precision to 53-bits, one wouldn't want to use these algorithms over a range where that would cause a problem. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

The Non-Recursive Partial Sieving Algorithm

Just substitute the following Crystal code for the `count_primes` function above to enjoy the further gain in speed:

def count_primes(n : Int64) : Int64
  if n < 3
    if n < 2
      return 0_i64
    else
      return 1_i64
    end
  end
  half = ->(n : Int64) : Int64 { (n - 1) >> 1 }
  divide = ->(n : Int64, d : Int64) : Int64 { (n.to_f64 / d.to_f64).to_i64 }
  rtlmt = Math.sqrt(n.to_f64).to_i32; mxndx = (rtlmt - 1) // 2
  cmpsts = BitArray.new(mxndx + 1)
  smalls = Array(Int32).new(mxndx + 1) { |i| i }
  roughs = Array(Int32).new(mxndx + 1) { |i| i + i + 1 }
  larges =
    Array(Int64).new(mxndx + 1) { |i| ((n // (i + i + 1)).to_i64 - 1) >> 1 }
  i = 1; nbps = 0; mxri = mxndx
  while true
    c = (i + i) * (i + 1); break if c > mxndx
    if !cmpsts.unsafe_fetch(i)
      bp = i + i + 1; cmpsts.unsafe_put(i, true)
      until c > mxndx
        cmpsts.unsafe_put(c, true); c += bp
      end # partial sieving for bp completed here!

      j = 0; ri = 0 # adjust `larges` according to partial sieve...
      while j <= mxri
        q = roughs.unsafe_fetch(j); qi = q >> 1
        if !cmpsts.unsafe_fetch(qi)
          d = bp.to_i64 * q.to_i64
          larges.unsafe_put(ri, larges.unsafe_fetch(j) -
                                  if d <= rtlmt.to_i64
                                    ndx = smalls.unsafe_fetch(d >> 1) - nbps
                                    larges.unsafe_fetch(ndx)
                                  else
                                    ndx = half.call(divide.call(n, d))
                                    smalls.unsafe_fetch(ndx)
                                  end + nbps)
          roughs.unsafe_put(ri, q); ri += 1
        end; j += 1
      end

      si = mxndx; bpm = (rtlmt // bp - 1) | 1
      while bpm >= bp # adjust smalls according to partial sieve...
        c = smalls.unsafe_fetch(bpm >> 1) - nbps; e = (bpm * bp) >> 1
        while si >= e
          smalls.unsafe_put(si, smalls.unsafe_fetch(si) - c); si -= 1
        end
        bpm -= 2
      end

      mxri = ri - 1; nbps += 1
    end; i += 1
  end

  ans = larges.unsafe_fetch(0); i = 1
  while i <= mxri # combine results; adjust for over subtraction base primes...
    ans -= larges.unsafe_fetch(i); i += 1
  end
  ans += (mxri.to_i64 + 1 + 2 * (nbps.to_i64 - 1)) * mxri.to_i64 // 2 # adjust!

  ri = 1 # do final phi calculation for pairs of larger primes...
  while true # break on condition when up to cube root of range!
    p = roughs.unsafe_fetch(ri).to_i64; q = n // p
    e = smalls.unsafe_fetch(half.call(divide.call(q, p))) - nbps
    break if e <= ri; ori = ri + 1
    while ori <= e
      ndx = half.call(divide.call(q, roughs.unsafe_fetch(ori).to_i64))
      ans += smalls.unsafe_fetch(ndx).to_i64; ori += 1
    end
    ans -= (e - ri).to_i64 * (nbps.to_i64 + ri.to_i64 - 1); ri += 1
  end

  ans + 1 # for only even prime of two!
end

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in almost two minutes as long as the computer used has the required free memory of about 800 Megabytes. This Crystal version is about twice as slow as the Nim version from which it was translated because Nim's default GCC back-end is better at optimization (at least for this algorithm) than the LLVM back-end used by Crystal and also there is more numeric calculation checking such as numeric overflow used in Crystal that can't be turned off. At that, this version is about the same speed as when translated to Rust, which also uses a LLVM back-end.

Elm

Currently, Elm does not have linear arrays and has no mutation whatsoever, even array content mutation protected by "monadic function chains" as for Haskell, so many algorithms can't be easily implemented such as the Legendre Partial Sieving algorithm which requires random access to linear arrays whose contents are mutated according to previous passes. However, given a prime number generator based on persistent data structures such as the List version [the Sieve of Eratosthenes task page], one can translate the algorithm of the "bottom-up" recursive version without the "TinyPhi" Look Up Table (LUT) optmization which would need a linear array to implement, as in the following code:

module Main exposing (main)

import Browser exposing (element)
import Task exposing (Task, succeed, perform, andThen)
import Html exposing (div, text)
import Time exposing (now, posixToMillis)

type CIS a = CIS a (() -> CIS a) -- infinite Co-Inductive Stream...

uptoCIS2List : comparable -> CIS comparable -> List comparable
uptoCIS2List n cis =
  let loop (CIS hd tl) lst =
        if hd > n then List.reverse lst
        else loop (tl()) (hd :: lst)
  in loop cis []

-- require a range of primes by Sieve of Eratosthenes...
primesTreeFolding : () -> CIS Int
primesTreeFolding() =
  let
    merge (CIS x xtl as xs) (CIS y ytl as ys) =
      case compare x y of
        LT -> CIS x <| \ () -> merge (xtl()) ys
        EQ -> CIS x <| \ () -> merge (xtl()) (ytl())
        GT -> CIS y <| \ () -> merge xs (ytl())
    pmult bp =
      let adv = bp + bp
          pmlt p = CIS p <| \ () -> pmlt (p + adv)
      in pmlt (bp * bp)
    allmlts (CIS bp bptl) =
      CIS (pmult bp) <| \ () -> allmlts (bptl())
    pairs (CIS frst tls) =
      let (CIS scnd tlss) = tls()
      in CIS (merge frst scnd) <| \ () -> pairs (tlss())
    cmpsts (CIS (CIS hd tl) tls) =
      CIS hd <| \ () -> merge (tl()) <| cmpsts <| pairs (tls())
    testprm n (CIS hd tl as cs) =
      if n < hd then CIS n <| \ () -> testprm (n + 2) cs
      else testprm (n + 2) (tl())
    oddprms() =
      CIS 3 <| \ () -> testprm 5 <| cmpsts <| allmlts <| oddprms()
  in CIS 2 <| \ () -> oddprms()

countPrimesTo : Float -> Float -- only use integral values!
countPrimesTo n =
  if n < 3 then if n < 2 then 0 else 1 else
  let nnf = toFloat (floor n) -- erase fractional part!
      sqrtn = sqrt nnf |> truncate
      oprms = primesTreeFolding() |> uptoCIS2List sqrtn |> List.drop 1
      opsz = List.length oprms
      lvl opi opilmt plst m acc =
        if opi >= opilmt then acc else
        case plst of
          [] -> acc -- should never happen
          (op :: optl) ->
            let opl = toFloat op
                nm = m * opl in
            if nnf <= nm * opl then acc + toFloat (opilmt - opi) else
            let q = nnf / nm |> floor |> toFloat
                nacc = acc + q - toFloat (floor (q / 2))
                sacc = if opi <= 0 then 0 else lvl 0 opi oprms nm 0
            in lvl (opi + 1) opilmt optl m (nacc - sacc)
  in nnf - toFloat (floor (nnf / 2)) - lvl 0 opsz oprms 1 0 + toFloat opsz

-- run the required task tests...
timemillis : () -> Task Never Int -- a side effect function
timemillis() = now |> andThen (\ t -> succeed (posixToMillis t))

test : () -> Cmd Msg -- side effect function chain (includes "perform")...
test() =
  timemillis()
    |> andThen (\ strt ->
       let rsltstrs = List.range 0 9 |> List.map ( \ n ->
                        "π(10^" ++ String.fromInt n ++ ") = " ++
                          String.fromFloat (countPrimesTo (toFloat (10^n))))
       in timemillis()
         |> andThen (\ stop ->
              succeed (List.append rsltstrs ["This took "
                         ++ String.fromInt (stop - strt)
                         ++ " milliseconds."])))
    |> perform Done

-- following code has to do with outputting to a web page using MUV/TEA...
type alias Model = List String

type Msg = Done Model

main : Program () Model Msg
main = -- starts with empty list of strings; views model of filled list...
  element { init = \ _ -> ( [], test() )
          , update = \ (Done mdl) _ -> ( mdl , Cmd.none )
          , subscriptions = \ _ -> Sub.none
          , view = div [] << List.map (div [] << List.singleton << text) }
Output:
π(10^0) = 0
π(10^1) = 4
π(10^2) = 25
π(10^3) = 168
π(10^4) = 1229
π(10^5) = 9592
π(10^6) = 78498
π(10^7) = 664579
π(10^8) = 5761455
π(10^9) = 50847534
This took 341 milliseconds.

Erlang

-mode(native).

-define(LIMIT, 1000000000).

main(_) ->
    put(primes, array:from_list(primality:sieve(floor(math:sqrt(?LIMIT))))),
    ets:new(memphi, [set, named_table, protected]),
    output(0, 9).

nthprime(N) -> array:get(N - 1, get(primes)).

output(A, B) -> output(A, B, 1).
output(A, B, _) when A > B -> ok;
output(A, B, N) ->
    io:format("10^~b    ~b~n", [A, pi(N)]),
    output(A + 1, B, N * 10).

pi(N) ->
    Primes = get(primes),
    Last = array:get(array:size(Primes) - 1, Primes),
    if
        N =< Last -> small_pi(N);
        true ->
            A = pi(floor(math:sqrt(N))),
            phi(N, A) + A - 1
    end.

phi(X, 0) -> X;
phi(X, A) ->
    case ets:lookup(memphi, {X, A}) of
        [] ->
            Phi = phi(X, A-1) - phi(X div nthprime(A), A-1),
            ets:insert(memphi, {{X, A}, Phi}),
            Phi;

        [{{X, A}, Phi}] -> Phi
    end.

% Use binary search to count primes that we have already listed.
small_pi(N) ->
    Primes = get(primes),
    small_pi(N, Primes, 0, array:size(Primes)).

small_pi(_, _, L, H) when L >= (H - 1) -> L + 1;
small_pi(N, Primes, L, H) ->
    M = (L + H) div 2,
    P = array:get(M, Primes),
    if
        N > P -> small_pi(N, Primes, M, H);
        N < P -> small_pi(N, Primes, 0, M);
        true -> M + 1
    end.
Output:
10^0    1
10^1    4
10^2    25
10^3    168
10^4    1229
10^5    9592
10^6    78498
10^7    664579
10^8    5761455
10^9    50847534

F#

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following F# versions are [translated from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

The Basic Recursive Legendre Algorithm

Translation of: Nim
let countPrimes (lmt: uint64) =
  if lmt < 3UL then (if lmt < 2UL then 0L else 1L) else
  let sqrtlmt = lmt |> float |> sqrt |> uint64
  let mxndx = (sqrtlmt - 3UL) / 2UL |> int
  let oprms =
    let cb = Array.init (mxndx + 1) <| fun i -> uint32 (i + i + 3)
    let rec loopi i =
      let sqri = (i + i) * (i + 3) + 3
      if sqri > mxndx then () else
      if cb.[i] = 0u then loopi (i + 1) else
      let bp = i + i + 3
      let rec cull c = if c > mxndx then () else cb.[c] <- 0u; cull (c + bp)
      cull sqri; loopi (i + 1)
    loopi 0; cb |> Array.filter ((<>) 0u)
  let rec phi x a =
    if a <= 0 then x - (x >>> 1) |> int64 else
    let na = a - 1 in let p = uint64 oprms.[na]
    if x <= p then 1L else phi x na - phi (x / p) na
  phi lmt oprms.Length + int64 oprms.Length

let strt = System.DateTime.Now.Ticks

{ 0 .. 9 } |> Seq.iter (fun i ->
  printfn "π(10**%d) = %d" i (countPrimes (uint64(10. ** i))))

let elpsd = (System.DateTime.Now.Ticks - strt) / 10000L
printfn "This took %d milliseconds." elpsd
Output:
π(10**0) = 0
π(10**1) = 4
π(10**2) = 25
π(10**3) = 168
π(10**4) = 1229
π(10**5) = 9592
π(10**6) = 78498
π(10**7) = 664579
π(10**8) = 5761455
π(10**9) = 50847534
This took 525 milliseconds.

The above time as run on an Intel i5-6500 (3.6 GHz single-threaded boost) isn't particularly fast but includes some DotNet initialization and JIT compilation overhead, and is about as fast as a highly optimized page-segmented wheel-factorized Sieve of Eratosthenes; Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm

Just substitute the following F# code for the `countPrimes` function above to enjoy the gain in speed:

Translation of: Nim
let TinyPhiPrimes = [| 2; 3; 5; 7; 11; 13 |]
let TinyPhiDeg = TinyPhiPrimes.Length - 1
let TinyPhiOddCirc = (TinyPhiPrimes |> Seq.reduce (*)) / 2
let TinyPhiTot = TinyPhiPrimes |> Seq.fold (fun s p -> s * (p - 1)) 1
let TinyPhiLUT =
  let cb = Array.init TinyPhiOddCirc (fun i -> 1u)
  TinyPhiPrimes |> Seq.skip 1
    |> Seq.iter (fun bp ->
      cb.[(bp - 1) >>> 1] <- 0u
      { (bp * bp - 1) >>> 1 .. bp .. TinyPhiOddCirc - 1}
        |> Seq.iter (fun c -> cb.[c] <- 0u) )
  let rec loopi i acc =
    if i >= TinyPhiOddCirc then () else
    let nacc = acc + cb[i] in cb.[i] <- nacc; loopi (i + 1) nacc
  loopi 0 0u; cb
let tinyPhi (x: uint64): int64 =
  let ndx = (x - 1UL) >>> 1 |> int64
  let numtots = ndx / int64 TinyPhiOddCirc
  let li = ndx - numtots * int64 TinyPhiOddCirc |> int
  numtots * int64 TinyPhiTot + int64 TinyPhiLUT.[li]

let countPrimes (lmt: uint64) =
  if lmt < 169UL then // below 169 whose sqrt is 13 is where TinyPhi doesn't work...
    ( if lmt < 2UL then 0L else
      if lmt < 3UL then 1L else
      // adjust for the missing "degree" base primes
      if lmt < 9UL then 1L + int64 (lmt - 1UL) / 2L else
      if lmt <= 13UL then int64 (lmt - 1UL) / 2L else
      5L + int64 TinyPhiLUT.[int (lmt - 1UL) / 2]) else
  let sqrtlmt = lmt |> float |> sqrt |> uint64
  let mxndx = (sqrtlmt - 3UL) / 2UL |> int
  let oprms =
    let cb = Array.init (mxndx + 1) <| fun i -> uint32 (i + i + 3)
    let rec loopi i =
      let sqri = (i + i) * (i + 3) + 3
      if sqri > mxndx then () else
      if cb.[i] = 0u then loopi (i + 1) else
      let bp = i + i + 3
      let rec cull c = if c > mxndx then () else cb.[c] <- 0u; cull (c + bp)
      cull sqri; loopi (i + 1)
    loopi 0; cb |> Array.filter ((<>) 0u)
  let rec lvl pilmt m =
    let rec looppi pi acc =
      if pi >= pilmt then acc else
      let p = uint64 oprms.[pi] in let nm = p * m
      if lmt <= nm * p then acc + int64 (pilmt - pi) else
      let nacc = if pi <= TinyPhiDeg then acc else acc - lvl pi nm
      looppi (pi + 1) (nacc + tinyPhi (lmt / nm))
    looppi TinyPhiDeg 0L
  tinyPhi lmt - lvl oprms.Length 1UL + int64 oprms.Length

Use of the above function gets a gain in speed of about a further ten times for far larger ranges than this over the above version due to less recursion, the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

The Non-Recursive Partial Sieving Algorithm

Just substitute the following F# code for the `countPrimes` function above to enjoy the further gain in speed; note that rather than using a recursive function loop for the partial sieving culling pass as per the above two version, this version uses a Seq iteration which is slower but more concise in expression, but as sieving is a negligible part of the total execution time, it doesn't matter and conciseness was considered to be more important; this version may appear to be recursive to those unfamiliar with Functional Programming (FP) implementations but FP generally turns all function recursions in tail-call position (which all of these are) into the same compiled code as a simple loop in other languages:

Translation of: Nim
let masks = Array.init 8 ((<<<) 1uy) // quick bit twiddling

let countPrimes (lmt: uint64): int64 =
  if lmt < 3UL then (if lmt < 2UL then 0L else 1L) else
  let inline half x = (x - 1) >>> 1
  let inline divide nm d = (float nm) / (float d) |> int
  let sqrtlmt = lmt |> float |> sqrt |> uint64
  let mxndx = (sqrtlmt - 1UL) / 2UL |> int in let cbsz = (mxndx + 8) / 8
  let cullbuf = Array.zeroCreate cbsz
  let smalls = Array.init (mxndx + 1) uint32
  let roughs = Array.init (mxndx + 1) <| fun i -> uint32 (i + i + 1)
  let larges = Array.init (mxndx + 1) <| fun i ->
    int64 (lmt / uint64 (i + i + 1) - 1UL) / 2L 

  let rec loopbp bp nobps rilmt =
    let i = int (bp - 1UL) >>> 1 in let sqri = (i + i) * (i + 1)
    if sqri > mxndx then nobps, rilmt else
    if (cullbuf.[i >>> 3] &&& masks.[i &&& 7]) <> 0uy then
      loopbp (bp + 2UL) nobps rilmt else
    let w = i >>> 3 in cullbuf.[w] <- cullbuf.[w] ||| masks.[i &&& 7] // cull bp
    { sqri .. int bp .. mxndx } |> Seq.iter (fun c -> // cull multiples of bp...
       let w = c >>> 3 in cullbuf.[w] <- cullbuf.[w] ||| masks.[c &&& 7] )

    // adjust `larges for last partial loop pass;
    // compress larges/roughs for current partial sieve pass...
    let rec loopri iri ori =
      if iri > rilmt then ori - 1 else
      let r = uint64 roughs.[iri] in let sri = int (r >>> 1)
      if (cullbuf.[sri >>> 3] &&& masks.[sri &&& 7]) <> 0uy then
        loopri (iri + 1) ori else // skip for roughs culled this pass!
      let d = bp * r
      larges.[ori] <- larges.[iri] -
                        ( if d <= sqrtlmt then
                            larges.[int smalls.[int (d >>> 1)] - nobps]
                          else let ndx = (half << divide lmt) d
                               int64 smalls.[ndx] ) + int64 nobps
      roughs.[ori] <- uint32 r; loopri (iri + 1) (ori + 1)

    // adjust `smalls` for last partial loop pass...
    let rec loopbpm bpm mxsi =
      if bpm < bp then () else
      let c = smalls.[int (bpm >>> 1)] - uint32 nobps
      let ei = (bpm * bp) >>> 1 |> int
      let rec loopsi si =
        if si < ei then si else smalls.[si] <- smalls.[si] - c; loopsi (si - 1)
      loopbpm (bpm - 2UL) (loopsi mxsi)

    let nrilmt = loopri 0 0
    loopbpm ((sqrtlmt / bp - 1UL) ||| 1UL) mxndx
    loopbp (bp + 2UL) (nobps + 1) nrilmt

  // accumulate result so far; compensate for over subtraction...
  let numobps, mxri = loopbp 3UL 0 mxndx
  let rec smr i acc =
    if i > mxri then // adjust accumulated answer!
      acc + (int64 mxri + 1L + 2L * int64 (numobps - 1)) * int64 mxri / 2L
    else smr (i + 1) (acc - larges.[i])
  let ans0 = smr 1 larges.[0]

  // finally, add result from pairs of rough primes up to cube root of range,
  // where they are two different primes; compensating for over addition...
  let rec loopri ri acc =
    let p = uint64 roughs.[ri] in let q = lmt / p
    let ei = int smalls.[(half << divide q) p] - numobps
    if ei <= ri then acc else
    let rec loopori ori oacc =
      if ori > ei then oacc else
      let ndx = (half << divide q) (uint64 roughs.[ori])
      loopori (ori + 1) (oacc + int64 smalls.[ndx])
    let nacc = loopori (ri + 1) acc // subtract over addition of base primes:
    loopri (ri + 1) (nacc - int64 (ei - ri) * (int64 numobps + int64 ri - 1L))

  loopri 1 ans0 + 1L // add one for only even prime of two!

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version (although one can't see that gain for these trivial ranges as it is buried in the constant overheads) due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division; for instance, this version can calculate the number of primes to 1e11 in about a hundred milliseconds. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in about a hundred seconds as long as the computer used has the required free memory of about 800 Megabytes. This F# version is about twice as slow as the Nim version from which it was translated due to the benefits of native code compilation and more optimizations for Nim rather than JIT compilation for F#. At that, this version is about the same speed as when translated to Rust and Crystal once the range is increased where constant overheads aren't a factor, which both use native code compilers.

FreeBASIC

Non-memoized

Sub primeSieve(limit As Uinteger, primes() As Uinteger)
    Dim As Uinteger i, j
    Dim As Uinteger sieve(limit)
    
    For i = 2 To limit
        sieve(i) = 1
    Next
    
    For i = 2 To limit
        If sieve(i) Then
            Redim Preserve primes(Ubound(primes) + 1)
            primes(Ubound(primes)) = i
            For j = i * i To limit Step i
                sieve(j) = 0
            Next
        End If
    Next
End Sub

Function phi(x As Uinteger, a As Uinteger, primes() As Uinteger) As Uinteger
    If a <= 1 Then Return Iif(a < 1, x, x - (x \ 2))
    
    Dim As Uinteger pa = primes(a - 1)
    If x <= pa Then Return 1
    
    Return phi(x, a - 1, primes()) - phi(x \ pa, a - 1, primes())
End Function

Function pi(n As Uinteger) As Uinteger
    If n < 3 Then Return Iif(n < 2, 0, 1)
    
    Dim As Uinteger primes()
    primeSieve(Int(Sqr(n)), primes())
    
    Dim As Uinteger a = Ubound(primes) + 1   
    Return phi(n, a, primes()) + a - 1
End Function

Dim As Uinteger i, n = 1
For i = 0 To 9
    Print "10^" & i & "  " & pi(n)
    n *= 10
Next

Sleep
Output:
10^0  0
10^1  4
10^2  25
10^3  168
10^4  1229
10^5  9592
10^6  78498
10^7  664579
10^8  5761455
10^9  50847534

Go

Translation of: Wren
Library: Go-rcu

Memoized

This runs in about 1.9 seconds which is surprisingly slow for Go. However, Go has the same problem as Wren in not being able to use lists for map keys. I tried using a 2-element array for the key but this was a second slower than the Cantor function.

The alternative of sieving a billion numbers and then counting them takes about 4.8 seconds using the present library function though this could be cut to 1.2 seconds using a third party library such as primegen.

package main

import (
    "fmt"
    "log"
    "math"
    "rcu"
)

func cantorPair(x, y int) int {
    if x < 0 || y < 0 {
        log.Fatal("Arguments must be non-negative integers.")
    }
    return (x*x + 3*x + 2*x*y + y + y*y) / 2
}

func pi(n int) int {
    if n < 2 {
        return 0
    }
    if n == 2 {
        return 1
    }
    primes := rcu.Primes(int(math.Sqrt(float64(n))))
    a := len(primes)
    memoPhi := make(map[int]int)

    var phi func(x, a int) int // recursive closure
    phi = func(x, a int) int {
        if a < 1 {
            return x
        }
        if a == 1 {
            return x - (x >> 1)
        }
        pa := primes[a-1]
        if x <= pa {
            return 1
        }
        key := cantorPair(x, a)
        if v, ok := memoPhi[key]; ok {
            return v
        }
        memoPhi[key] = phi(x, a-1) - phi(x/pa, a-1)
        return memoPhi[key]
    }

    return phi(n, a) + a - 1
}

func main() {
    for i, n := 0, 1; i <= 9; i, n = i+1, n*10 {
        fmt.Printf("10^%d  %d\n", i, pi(n))
    }
}
Output:
10^0  0
10^1  4
10^2  25
10^3  168
10^4  1229
10^5  9592
10^6  78498
10^7  664579
10^8  5761455
10^9  50847534

Non-memoized

Much quicker than before (0.53 seconds) and, of course, uses less memory.

package main

import (
    "fmt"
    "math"
    "rcu"
)

func pi(n int) int {
    if n < 2 {
        return 0
    }
    if n == 2 {
        return 1
    }
    primes := rcu.Primes(int(math.Sqrt(float64(n))))
    a := len(primes)

    var phi func(x, a int) int // recursive closure
    phi = func(x, a int) int {
        if a < 1 {
            return x
        }
        if a == 1 {
            return x - (x >> 1)
        }
        pa := primes[a-1]
        if x <= pa {
            return 1
        }
        return phi(x, a-1) - phi(x/pa, a-1)
    }

    return phi(n, a) + a - 1
}

func main() {
    for i, n := 0, 1; i <= 9; i, n = i+1, n*10 {
        fmt.Printf("10^%d  %d\n", i, pi(n))
    }
}
Output:
Same as first version.

Iterative, partial sieving

Translation of: Vlang

This non-memoized, non-recursive version is much quicker than the first two versions, running in about 2 milliseconds which is the same as V.

package main

import (
    "fmt"
    "math"
    "time"
)

var masks = [8]uint8{1, 2, 4, 8, 16, 32, 64, 128}

func half(n int) int { return (n - 1) >> 1 }

func divide(nm, d uint64) int { return int(float64(nm) / float64(d)) }

func countPrimes(n uint64) int64 {
    if n < 9 {
        if n < 2 {
            return 0
        } else {
            return (int64(n) + 1) / 2
        }
    }
    rtlmt := int(math.Sqrt(float64(n)))
    mxndx := (rtlmt - 1) / 2
    arrlen := mxndx + 1
    smalls := make([]uint32, arrlen)
    roughs := make([]uint32, arrlen)
    larges := make([]int64, arrlen)
    for i := uint32(0); i < uint32(arrlen); i++ {
        smalls[i] = i
        roughs[i] = i + i + 1
        larges[i] = int64((n/uint64(i+i+1) - 1) / 2)
    }
    cullbuflen := (mxndx + 8) / 8
    cullbuf := make([]uint8, cullbuflen)
    nbps := 0
    rilmt := arrlen
    for i := 1; ; i++ {
        sqri := (i + i) * (i + 1)
        if sqri > mxndx {
            break
        }
        if (cullbuf[i>>3] & masks[i&7]) != 0 {
            continue
        }
        cullbuf[i>>3] |= masks[i&7]
        bp := i + i + 1
        for c := sqri; c < arrlen; c += bp {
            cullbuf[c>>3] |= masks[c&7]
        }
        nri := 0
        for ori := 0; ori < rilmt; ori++ {
            r := int(roughs[ori])
            rci := r >> 1
            if (cullbuf[rci>>3] & masks[rci&7]) != 0 {
                continue
            }
            d := r * bp
            t := int64(0)
            if d <= rtlmt {
                t = larges[int(smalls[d>>1])-nbps]
            } else {
                t = int64(smalls[half(divide(n, uint64(d)))])
            }
            larges[nri] = larges[ori] - t + int64(nbps)
            roughs[nri] = uint32(r)
            nri++
        }
        si := mxndx
        for pm := (rtlmt/bp - 1) | 1; pm >= bp; pm -= 2 {
            c := smalls[pm>>1]
            e := (pm * bp) >> 1
            for ; si >= e; si-- {
                smalls[si] -= (c - uint32(nbps))
            }
        }
        rilmt = nri
        nbps++
    }
    ans := larges[0] + int64(((rilmt + 2*(nbps-1)) * (rilmt - 1) / 2))
    for ri := 1; ri < rilmt; ri++ {
        ans -= larges[ri]
    }
    for ri := 1; ; ri++ {
        p := uint64(roughs[ri])
        m := n / p
        ei := int(smalls[half(int(m/p))]) - nbps
        if ei <= ri {
            break
        }
        ans -= int64((ei - ri) * (nbps + ri - 1))
        for sri := ri + 1; sri < ei+1; sri++ {
            ans += int64(smalls[half(divide(m, uint64(roughs[sri])))])
        }
    }
    return ans + 1
}

func main() {
    start := time.Now()
    for i, n := uint64(0), uint64(1); i <= 9; i, n = i+1, n*10 {
        fmt.Printf("10^%d  %d\n", i, countPrimes(n))
    }
    elapsed := time.Since(start).Microseconds()
    fmt.Printf("\nTook %d microseconds\n", elapsed)
}
Output:
10^0  0
10^1  4
10^2  25
10^3  168
10^4  1229
10^5  9592
10^6  78498
10^7  664579
10^8  5761455
10^9  50847534

Took 1862 microseconds

Haskell

With Memoization:

{-# OPTIONS_GHC -O2 -fllvm -Wno-incomplete-patterns #-}
{-# LANGUAGE DeriveFunctor #-}

import Data.Time.Clock.POSIX ( getPOSIXTime ) -- for timing

import Data.Int ( Int64 )
import Data.Bits ( Bits( shiftL, shiftR ) )

data Memo a = EmptyNode | Node a (Memo a) (Memo a)
  deriving Functor

memo :: Integral a => Memo p -> a -> p
memo (Node a l r) n
  | n == 0 = a
  | odd n = memo l (n `div` 2)
  | otherwise = memo r (n `div` 2 - 1)

nats :: Integral a => Memo a
nats = Node 0 ((+1).(*2) <$> nats) ((*2).(+1) <$> nats)

memoize :: Integral a => (a -> b) -> a -> b
memoize f = memo (f <$> nats)

memoize2 :: (Integral a, Integral b) => (a -> b -> c) -> a -> b -> c
memoize2 f = memoize (memoize . f)

memoList :: [b] -> Integer -> b
memoList = memo . mkList
  where
    mkList []     = EmptyNode -- never used; makes complete
    mkList (x:xs) = Node x (mkList l) (mkList r)
      where (l,r) = split xs
            split [] = ([],[])
            split [x] = ([x],[])
            split (x:y:xs) = let (l,r) = split xs in (x:l, y:r)

isqrt :: Integer -> Integer
isqrt n = go n 0 (q `shiftR` 2)
 where
   q = head $ dropWhile (< n) $ iterate (`shiftL` 2) 1
   go z r 0 = r
   go z r q = let t = z - r - q
              in if t >= 0
                 then go t (r `shiftR` 1 + q) (q `shiftR` 2)
                 else go z (r `shiftR` 1) (q `shiftR` 2)

primes :: [Integer]
primes = 2 : _Y ((3:) . gaps 5 . _U . map(\p-> [p*p, p*p+2*p..])) where
  _Y g = g (_Y g)  -- = g (g (g ( ... )))   non-sharing multistage fixpoint combinator
  gaps k s@(c:cs) | k < c     = k : gaps (k+2) s  -- ~= ([k,k+2..] \\ s)
                  | otherwise =     gaps (k+2) cs --   when null(s\\[k,k+2..]) 
  _U ((x:xs):t) = x : (merge xs . _U . pairs) t   -- tree-shaped folding big union
  pairs (xs:ys:t) = merge xs ys : pairs t
  merge xs@(x:xt) ys@(y:yt) | x < y     = x : merge xt ys
                            | y < x     = y : merge xs yt
                            | otherwise = x : merge xt yt

phi :: Integer -> Integer -> Integer
phi = memoize2 phiM
  where
    phiM x 0 = x
    phiM x a = phi x (a-1) - phi (x `div` p a) (a - 1)

    p = memoList (undefined : primes)

legendrePi :: Integer -> Integer
legendrePi n
  | n < 2 = 0
  | otherwise = phi n a + a - 1
    where a = legendrePi (floor (sqrt (fromInteger n)))

main :: IO ()
main = do
  strt <- getPOSIXTime
  mapM_ (\n -> putStrLn $ show n ++ "\t" ++ show (legendrePi (10^n))) [0..9]
  stop <- getPOSIXTime
  let elpsd = round $ 1e3 * (stop - strt) :: Int64
  putStrLn $ "This last took " ++ show elpsd ++ " milliseconds."
0       0
1       4
2       25
3       168
4       1229
5       9592
6       78498
7       664579
8       5761455
9       50847534
This last took 20383 milliseconds.

The above code reveals all of the problems with the usual Haskell developer with overuse of the multi-precision `Integer` type when there is no way that the range of a 64-bit integer will ever be exceeded, overuse of lists, developed and tested using the interpreter, etc., such that, even compiled using the LLVM back-end, the above code is about ten times slower that any other compiled language for a memoized version.

Non Memoized Haskell Versions

There doesn't seem to be much point to improving the above memoized version even though one could gain a factor of about ten in doing so as the need for memoization seems to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Haskell versions are [translated from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

The Basic Recursive Legendre Algorithm

{-# OPTIONS_GHC -O2 -fllvm #-}
{-# LANGUAGE FlexibleContexts, BangPatterns #-}

import Data.Time.Clock.POSIX ( getPOSIXTime ) -- for timing

import Data.Int ( Int64 )
import Data.Word ( Word32, Word64 )
import Data.Bits ( Bits( (.&.), (.|.), shiftL, shiftR ) )
import Control.Monad ( unless, when, forM_ )
import Control.Monad.ST ( ST, runST )
import Data.Array.ST ( runSTUArray )
import Data.Array.Base ( UArray(..), IArray(unsafeAt), listArray, elems, assocs,
                         MArray( unsafeNewArray_, newArray, unsafeRead, unsafeWrite ),
                         STUArray,  unsafeFreezeSTUArray, castSTUArray )

countPrimes :: Word64 -> Int64
countPrimes n =
  if n < 3 then (if n < 2 then 0 else 1) else
  let sqrtn = truncate $ sqrt $ fromIntegral n
      qdrtn = truncate $ sqrt $ fromIntegral sqrtn
      rtlmt = (sqrtn - 3) `div` 2
      qrtlmt = (qdrtn - 3) `div` 2
      oddPrimes@(UArray _ _ psz _) = runST $ do -- UArray of odd primes...
        cmpsts <- newArray (0, rtlmt) False :: ST s (STUArray s Int Bool)
        forM_ [ 0 .. qrtlmt ] $ \ i -> do
          t <- unsafeRead cmpsts i
          unless t $ do
            let sqri = (i + i) * (i + 3) + 3
                bp = i + i + 3
            forM_ [ sqri, sqri + bp .. rtlmt ] $ \ c ->
              unsafeWrite cmpsts c True
        fcmpsts <- unsafeFreezeSTUArray cmpsts
        let !numoprms = sum $ [ 1 | False <- elems fcmpsts ]
            prms = [ fromIntegral $ i + i + 3 | (i, False) <- assocs fcmpsts ]
        return $ listArray (0, numoprms - 1) prms :: ST s (UArray Int Word32)
      phi x a =
        if a < 1 then x - (x `shiftR` 1) else
        let na = a - 1
            p = fromIntegral $ unsafeAt oddPrimes na in
        if x < p then 1 else
        phi x na - phi (x `div` p) na

  in fromIntegral (phi n psz) + fromIntegral psz

main :: IO ()
main = do
  strt <- getPOSIXTime
  mapM_ (\n -> putStrLn $ show n ++ "\t" ++ show (countPrimesx (10^n))) [ 0 .. 9 ]
  stop <- getPOSIXTime
  let elpsd = round $ 1e3 * (stop - strt) :: Int64
  putStrLn $ "This took " ++ show elpsd ++ " milliseconds."
Output:
0       0
1       4
2       25
3       168
4       1229
5       9592
6       78498
7       664579
8       5761455
9       50847534
This took 273 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

Although this is much faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm

Just substitute the following Haskell code for the `countPrimes` function above to enjoy the gain in speed:

cTinyPhiPrimes :: [Int]
cTinyPhiPrimes = [ 2, 3, 5, 7, 11, 13 ]
cC :: Int
cC = length cTinyPhiPrimes - 1
cTinyPhiOddCirc :: Int
cTinyPhiOddCirc = product cTinyPhiPrimes `div` 2
cTinyPhiTot :: Int
cTinyPhiTot = product [ p - 1 | p <- cTinyPhiPrimes ]
cTinyPhiLUT :: UArray Int Word32
cTinyPhiLUT = runSTUArray $ do
  ma <- newArray (0, cTinyPhiOddCirc - 1) 1
  forM_ (drop 1 cTinyPhiPrimes) $ \ bp -> do
    let i = (bp - 1) `shiftR` 1
    let sqri = (i + i) * (i + 1)
    unsafeWrite ma i 0
    forM_ [ sqri, sqri + bp .. cTinyPhiOddCirc - 1 ] $ \ c -> unsafeWrite ma c 0
  let tot i acc =
        if i >= cTinyPhiOddCirc then return ma else do
          v <- unsafeRead ma i
          if v == 0 then do unsafeWrite ma i acc; tot (i + 1) acc
          else do let nacc = acc + 1
                  unsafeWrite ma i nacc; tot (i + 1) nacc
  tot 0 0
tinyPhi :: Word64 -> Int64
tinyPhi n =
  let on = (n - 1) `shiftR` 1
      numcyc = on `div` fromIntegral cTinyPhiOddCirc
      rem = fromIntegral $ on - numcyc * fromIntegral cTinyPhiOddCirc
  in fromIntegral numcyc * fromIntegral cTinyPhiTot +
       fromIntegral (unsafeAt cTinyPhiLUT rem)

countPrimes :: Word64 -> Int64
countPrimes n =
  if n < 3 then (if n < 2 then 0 else 1) else
  let sqrtn = truncate $ sqrt $ fromIntegral n
      qdrtn = truncate $ sqrt $ fromIntegral sqrtn
      rtlmt = (sqrtn - 3) `div` 2
      qrtlmt = (qdrtn - 3) `div` 2
      oddPrimes@(UArray _ _ psz _) = runST $ do -- UArray of odd primes...
        cmpsts <- newArray (0, rtlmt) False :: ST s (STUArray s Int Bool)
        forM_ [ 0 .. qrtlmt ] $ \ i -> do
          t <- unsafeRead cmpsts i
          unless t $ do
            let sqri = (i + i) * (i + 3) + 3
                bp = i + i + 3
            forM_ [ sqri, sqri + bp .. rtlmt ] $ \ c ->
              unsafeWrite cmpsts c True
        fcmpsts <- unsafeFreezeSTUArray cmpsts
        let !numoprms = sum $ [ 1 | False <- elems fcmpsts ]
            prms = [ fromIntegral $ i + i + 3 | (i, False) <- assocs fcmpsts ]
        return $ listArray (0, numoprms - 1) prms :: ST s (UArray Int Word32)
      lvl pi pilmt !m !acc =
        if pi >= pilmt then acc else
        let p = fromIntegral $ unsafeAt oddPrimes pi
            nm = m * p in
        if n <= nm * p then acc + fromIntegral (pilmt - pi) else
        let !q = fromIntegral $ n `div` nm
            !nacc = acc + tinyPhi q
            !sacc = if pi <= cC then 0 else lvl cC pi nm 0
        in lvl (pi + 1) pilmt m $ nacc - sacc

  in  tinyPhi n - lvl cC psz 1 0 + fromIntegral psz

Use of the above function gets a gain in speed of about a further fifteen times over the above version due to less recursion and the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a little over ten seconds.

The Non-Recursive Partial Sieving Algorithm

Just substitute the following Haskell code for the `countPrimes` function above to enjoy the further gain in speed; this version may appear to be recursive to those unfamiliar with Functional Programming (FP) implementations but FP generally turns all function recursions in tail-call position (which all of these are) into the same compiled code as a simple loop in other languages:

countPrimes :: Word64 -> Int64
countPrimes n =
  if n < 3 then (if n < 2 then 0 else 1) else
  let
    {-# INLINE divide #-}
    divide :: Word64 -> Word64 -> Int
    divide nm d = truncate $ (fromIntegral nm :: Double) / fromIntegral d
    {-# INLINE half #-}
    half :: Int -> Int
    half x = (x - 1) `shiftR` 1
    rtlmt = floor $ sqrt (fromIntegral n :: Double) 
    mxndx = (rtlmt - 1) `div` 2
    (!nbps, !nrs, !smalls, !roughs, !larges) = runST $ do
      mss <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Word32)
      let msscst =
            castSTUArray :: STUArray s Int Word32 -> ST s (STUArray s Int Int64)
      mdss <- msscst mss -- for use in adjing counts LUT
      forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mss i (fromIntegral i)
      mrs <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Word32)
      forM_ [ 0 .. mxndx ] $ \ i -> unsafeWrite mrs i (fromIntegral i * 2 + 1)
      mls <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Int64)
      forM_ [ 0 .. mxndx ] $ \ i ->
        let d = fromIntegral (i + i + 1)
        in unsafeWrite mls i (fromIntegral (divide n d - 1) `div` 2)
      cmpsts <- unsafeNewArray_ (0, mxndx) :: ST s (STUArray s Int Bool)

      let loop i !cbpi !rlmti =
            let sqri = (i + i) * (i + 1) in
            if sqri > mxndx then do
              fss <- unsafeFreezeSTUArray mss
              frs <- unsafeFreezeSTUArray mrs
              fls <- unsafeFreezeSTUArray mls
              return (cbpi, rlmti + 1, fss, frs, fls)
            else do
              v <- unsafeRead cmpsts i
              if v then loop (i + 1) cbpi rlmti else do
                unsafeWrite cmpsts i True -- cull current bp so not a "k-rough"!
                let bp = i + i + 1
                    -- partial cull by current base prime, bp...
                    cull c = if c > mxndx then return () else do
                               unsafeWrite cmpsts c True; cull (c + bp)

                    -- adjust `larges` according to partial sieve...
                    part ri nri = -- old "rough" index to new one...
                      if ri > rlmti then return (nri - 1) else do
                        r <- unsafeRead mrs ri -- "rough" always odd!
                        t <- unsafeRead cmpsts (fromIntegral r `shiftR` 1)
                        if t then part (ri + 1) nri else do -- skip newly culled
                          olv <- unsafeRead mls ri
                          let m = fromIntegral r * fromIntegral bp
                          adjv <- if m <= fromIntegral rtlmt then do
                                    let ndx = fromIntegral m `shiftR` 1
                                    sv <- unsafeRead mss ndx
                                    unsafeRead mls (fromIntegral sv - cbpi)
                                  else do
                                    sv <- unsafeRead mss (half (divide n m))
                                    return (fromIntegral sv)
                          unsafeWrite mls nri (olv - (adjv - fromIntegral cbpi))
                          unsafeWrite mrs nri r; part (ri + 1) (nri + 1)
                    !pm0 = ((rtlmt `div` bp) - 1) .|. 1 -- max base prime mult

                    adjc lmti pm = -- adjust smalls according to partial sieve:
                      if pm < bp then return () else do
                        c <- unsafeRead mss (pm `shiftR` 1)
                        let ac = c - fromIntegral cbpi -- correction
                            bi = (pm * bp) `shiftR` 1 -- start array index
                            adj si = if si > lmti then adjc (bi - 1) (pm - 2)
                                     else do ov <- unsafeRead mss si
                                             unsafeWrite mss si (ov - ac)
                                             adj (si + 1)
                        adj bi
                    dadjc lmti pm =
                      if pm < bp then return () else do
                        c <- unsafeRead mss (pm `shiftR` 1)
                        let ac = c - fromIntegral cbpi -- correction
                            bi = (pm * bp) `shiftR` 1 -- start array index
                            ac64 = fromIntegral ac :: Int64
                            dac = (ac64 `shiftL` 32) .|. ac64
                            dbi = (bi + 1) `shiftR` 1
                            dlmti = (lmti - 1) `shiftR` 1
                            dadj dsi = if dsi > dlmti then return ()
                                      else do dov <- unsafeRead mdss dsi
                                              unsafeWrite mdss dsi (dov - dac)
                                              dadj (dsi + 1)
                        when (bi .&. 1 /= 0) $ do
                          ov <- unsafeRead mss bi
                          unsafeWrite mss bi (ov - ac)
                        dadj dbi
                        when (lmti .&. 1 == 0) $ do
                          ov <- unsafeRead mss lmti
                          unsafeWrite mss lmti (ov - ac)
                        adjc (bi - 1) (pm - 2)
                cull sqri; nrlmti <- part 0 0
                dadjc mxndx pm0
                loop (i + 1) (cbpi + 1) nrlmti
      loop 1 0 mxndx

    !ans0 = unsafeAt larges 0 - -- combine all counts; each includes nbps...
              sum [ unsafeAt larges i | i <- [ 1 .. nrs - 1 ] ]
    -- adjust for all the base prime counts subracted above...
    !adj = (nrs + 2 * (nbps - 1)) * (nrs - 1) `div` 2
    !adjans0 = ans0 + fromIntegral adj

    loopr ri !acc = -- o final phi calculation for pairs of larger primes...
      let r = fromIntegral (unsafeAt roughs ri)
          q = n `div` r
          lmtsi = half (fromIntegral (q `div` r))
          lmti = fromIntegral (unsafeAt smalls lmtsi) - nbps
          addcnt pi !ac =
            if pi > lmti then ac else
            let p = fromIntegral (unsafeAt roughs pi)
                ci = half (fromIntegral (divide q p))
            in addcnt (pi + 1) (ac + fromIntegral (unsafeAt smalls ci))
      in if lmti <= ri then acc else -- break when up to cube root of range!
         -- adjust for the `nbps`'s over added in the `smalls` counts...
         let !adj = fromIntegral ((lmti - ri) * (nbps + ri - 1))
         in loopr (ri + 1) (addcnt (ri + 1) acc - adj)
  in loopr 1 adjans0 + 1 -- add one for only even prime of two!

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division. This version of prime counting function might be considered to be reasonably useful for counting primes up to 1e16 in just over a minute as long as the computer used has the required free memory of about 800 Megabytes, counting the primes to 1e11 in under 30 milliseconds. This Haskell version is about twenty percent slower than the Nim version from which it was translated because Nim's default GCC back-end is better at optimization (at least for this algorithm) than the LLVM back-end used by Haskell here.

J

This is "almost a primitive" in J:

   require'format/printf'
   {{echo '10^%d: %d' sprintf y;p:inv 10^y}}"0 i.10
10^0: 0
10^1: 4
10^2: 25
10^3: 168
10^4: 1229
10^5: 9592
10^6: 78498
10^7: 664579
10^8: 5761455
10^9: 50847534

A complete implementation of the legendre prime counting function would be (1&p: + p:inv):

   (1&p: + p:inv) 10
4
   (1&p: + p:inv) 11
5

But we can simplify that to p:inv when we know that primes are not being tested.

Java

import java.util.*;

public class LegendrePrimeCounter {
    public static void main(String[] args) {
        LegendrePrimeCounter counter = new LegendrePrimeCounter(1000000000);
        for (int i = 0, n = 1; i < 10; ++i, n *= 10)
            System.out.printf("10^%d\t%d\n", i, counter.primeCount((n)));
    }

    private List<Integer> primes;

    public LegendrePrimeCounter(int limit) {
        primes = generatePrimes((int)Math.sqrt((double)limit));
    }

    public int primeCount(int n) {
        if (n < 2)
            return 0;
        int a = primeCount((int)Math.sqrt((double)n));
        return phi(n, a) + a - 1;
    }

    private int phi(int x, int a) {
        if (a == 0)
            return x;
        if (a == 1)
            return x - (x >> 1);
        int pa = primes.get(a - 1);
        if (x <= pa)
            return 1;
        return phi(x, a - 1) - phi(x / pa, a - 1);
    }

    private static List<Integer> generatePrimes(int limit) {
        boolean[] sieve = new boolean[limit >> 1];
        Arrays.fill(sieve, true);
        for (int p = 3, s = 9; s < limit; p += 2) {
            if (sieve[p >> 1]) {
                for (int q = s; q < limit; q += p << 1)
                    sieve[q >> 1] = false;
            }
            s += (p + 1) << 2;
        }
        List<Integer> primes = new ArrayList<>();
        if (limit > 2)
            primes.add(2);
        for (int i = 1; i < sieve.length; ++i) {
            if (sieve[i])
                primes.add((i << 1) + 1);
        } 
        return primes;
    }
}
Output:
10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

JavaScript

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following JavaScript versions are [translated and improved from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

The Basic Recursive Legendre Algorithm

Translation of: Nim
"use strict";

function countPrimesTo(lmt) {
  if (lmt < 3) { if (lmt < 2) return 0; else return 1; }
  const sqrtlmt = Math.sqrt(lmt) >>> 0;
  const oprms = function() {
    const mxndx = (sqrtlmt - 3) >>> 1;
    const arr = new Float64Array(mxndx + 1);
    for (let i = 0 >>> 0; i <= mxndx; ++i) arr[i] = (i + i + 3) >>> 0;
    let bp = 3 >>> 0;
    while (true) {
      let i = (bp - 3) >>> 1; let sqri = ((i + i) * (i + 3) + 3) >>> 0;
      if (sqri > mxndx) break;
      if (arr[i] != 0) for (; sqri <= mxndx; sqri += bp) arr[sqri] = 0;
      bp += 2;
    }
    return arr.filter(v => v != 0);
  }();
  function phi(x, a) {
    if (a <= 0) return x - Math.trunc(x / 2);
    const na = (a - 1) >>> 0; const p = oprms[na];
    if (x <= p) return 1;
    return phi(x, na) - phi(Math.trunc(x / p), na);
  }
  return phi(lmt, oprms.length) + oprms.length;
}

const start = Date.now();
for (let i = 0; i <= 9; ++i) console.log(`π(10**${i}) =`, countPrimesTo(10**i));
const elpsd = Date.now() - start;
console.log("This took", elpsd, "milliseconds.")
Output:
π(10**0) = 0
π(10**1) = 4
π(10**2) = 25
π(10**3) = 168
π(10**4) = 1229
π(10**5) = 9592
π(10**6) = 78498
π(10**7) = 664579
π(10**8) = 5761455
π(10**9) = 50847534
This took 712 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost) using node.js version 17.3.1.

Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm

Just substitute the following JavaScript code for the `countPrimesTo` function above to enjoy the gain in speed:

Translation of: Nim
const TinyPhiPrimes = [ 2, 3, 5, 7, 11, 13 ];
const TinyPhiOddDegree = TinyPhiPrimes.length - 1;
const TinyPhiOddCirc = TinyPhiPrimes.reduce((acc, p) => acc * p) / 2;
const TinyPhiTot = TinyPhiPrimes.reduce((acc, p) => acc * (p - 1), 1)
const TinyPhiLUT = function() {
  const arr = new Uint16Array(TinyPhiOddCirc);  arr.fill(1);
  for (const p of TinyPhiPrimes) {
    if (p <= 2) continue; arr[p >> 1] = 0;
    for (let c = (p * p) >> 1; c < TinyPhiOddCirc; c += p) arr[c] = 0 >>> 0; }
  for (let i = 0 | 0, acc = 0 | 0; i < TinyPhiOddCirc; ++i) {
    acc += arr[i]; arr[i] = acc; }
  return arr; }();
function tinyPhi(x) {
  const ndx = Math.trunc(( x - 1) / 2);
  const numtots = Math.trunc(ndx / TinyPhiOddCirc);
  const rem = (ndx - numtots * TinyPhiOddCirc) >>> 0;
  return numtots * TinyPhiTot + TinyPhiLUT[rem];
}

function countPrimesTo(lmt) {
  if (lmt < 169) {
    if (lmt < 3) { if (lmt < 2) return 0; else return 1; }
    // adjust for the missing "degree" base primes
    if (lmt <= 13) return ((lmt - 1) >>> 1) + (lmt < 9 ? 1 : 0);
    return 5 + TinyPhiLUT[(lmt - 1) >>> 1];
  }
  const sqrtlmt = Math.sqrt(lmt) >>> 0;
  const oprms = function() {
    const mxndx = (sqrtlmt - 3) >>> 1;
    const arr = new Float64Array(mxndx + 1);
    for (let i = 0 >>> 0; i <= mxndx; ++i) arr[i] = (i + i + 3) >>> 0;
    let bp = 3 >>> 0;
    while (true) {
      let i = (bp - 3) >>> 1; let sqri = ((i + i) * (i + 3) + 3) >>> 0;
      if (sqri > mxndx) break;
      if (arr[i] != 0) for (; sqri <= mxndx; sqri += bp) arr[sqri] = 0;
      bp += 2;
    }
    return arr.filter(v => v != 0); }();
  function lvl(pilmt, m) {
    let ans = 0;
    for (let pi = TinyPhiOddDegree; pi < pilmt; ++pi) {
      const p = oprms[pi]; const nm = m * p;
      if (lmt <= nm * p) return ans + pilmt - pi;
      ans += tinyPhi(Math.trunc(lmt / nm));
      if (pi > TinyPhiOddDegree) ans -= lvl(pi, nm);
    }
    return ans;
  }
  return tinyPhi(lmt) - lvl(oprms.length, 1) + oprms.length;
}

Use of the above function gets a gain in speed of about a further fifteen times over the above version due to less recursion and the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

The Non-Recursive Partial Sieving Algorithm

Just substitute the following JavaScript code for the `countPrimesTo` function above to enjoy the gain in speed:

Translation of: Nim
const masks = new Uint8Array(8); // faster than bit twiddling!
for (let i = 0; i < 8; ++i) masks[i] = (1 << i) >>> 0;

function countPrimesTo(lmt) {
  if (lmt < 3) { if (lmt < 2) return 0; else return 1; }
  function half(x) { return (x - 1) >>> 1; }
  function divide(nm, d) { return (nm / d) >>> 0; }
  const sqrtlmt = Math.trunc(Math.sqrt(lmt));
  const mxndx = (sqrtlmt - 1) >>> 1; const cbsz = (mxndx + 8) >>> 3;
  const cullbuf = new Uint8Array(cbsz);
  const smalls = new Uint32Array(mxndx + 1);
  for (let i = 0; i <= mxndx; ++i) smalls[i] = i >>> 0;
  const roughs = new Uint32Array(mxndx + 1);
  for (let i = 0; i <= mxndx; ++i) roughs[i] = (i + i + 1) >>> 0;
  const larges = new Float64Array(mxndx + 1);
  for (let i = 0; i <= mxndx; ++i) larges[i] =
    Math.trunc((Math.trunc(lmt / (i + i + 1)) - 1) / 2);

  // partial sieve loop, adjusting larges/smalls, compressing larges/roughs...
  let nobps = 0 >>> 0; let rilmt = mxndx; let bp = 3 >>> 0;
  while (true) { // break when square root is reached
    const i = bp >>> 1; let sqri = ((i + i) * (i + 1)) >>> 0;
    if (sqri > mxndx) break; // partial sieving pass if bp is prime:
    if ((cullbuf[i >> 3] & masks[i & 7]) == 0) {
      cullbuf[i >> 3] |= masks[i & 7]; // cull bp!
      for (; sqri <= mxndx; sqri += bp)
        cullbuf[sqri >> 3] |= masks[sqri & 7]; // cull bp mults!

      // now adjust `larges` for latest partial sieve pass...
      var ori = 0 // compress input rough index to output one
      for (let iri = 0; iri <= rilmt; ++iri) {
        const r = roughs[iri]; const rci = r >>> 1; // skip roughs just culled!
        if ((cullbuf[rci >> 3] & masks[rci & 7]) != 0) continue;
        const d = bp * r;
        larges[ori] = larges[iri] -
                        ( (d <= sqrtlmt) ? larges[smalls[d >>> 1] - nobps]
                            : smalls[half(divide(lmt, d))] ) +
                          nobps; // base primes count over subtracted!
        roughs[ori++] = r;
      }

      let si = mxndx // and adjust `smalls` for latest partial sieve pass...
      for (let bpm = (sqrtlmt / bp - 1) | 1; bpm >= bp; bpm -= 2) {
        const c = smalls[bpm >>> 1] - nobps;
        const ei = ((bpm * bp) >>> 1);
        while (si >= ei) smalls[si--] -= c;
      }

      nobps++; rilmt = ori - 1;
    }
    bp += 2;
  }

  // combine results to here; correcting for over subtraction in combining...
  let ans = larges[0]; for (let i = 1; i <= rilmt; ++i) ans -= larges[i];
  ans += Math.trunc((rilmt + 1 + 2 * (nobps - 1)) * rilmt / 2);

  // add final adjustment for pairs of current roughs to cube root of range...
  let ri = 0
  while (true) { // break when reaches cube root of counting range...
    const p = roughs[++ri]; const q = Math.trunc(lmt / p);
    const ei = smalls[half(divide(q, p))] - nobps;
    if (ei <= ri) break; // break here when no more pairs!
    for (let ori = ri + 1; ori <= ei; ++ori)
      ans += smalls[half(divide(q, roughs[ori]))];
    ans -= (ei - ri) * (nobps + ri - 1);
  }

  return ans + 1; // add one for only even prime of two!
}

The above code enjoys quite a large gain in speed of about a further ten times (and increasing with range) over the immediately preceding version for larger "non-trivial" ranges (since there is an appreciable environment overhead in initialization and JIT compilation) due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division; for instance, this version can calculate the number of primes to 1e11 in about 125 milliseconds. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in under three minutes as long as the computer used has the required free memory of about 800 Megabytes. This JavaScript version is about three times as slow as the Nim version from which it was translated for large ranges where the initialization overhead is not significant due to being run by the JavaScript engine and JIT compiled.

jq

Adapted from Wren

Works with: jq

Works with gojq, the Go implementation of jq (*)

The following implementation freely uses the following conveniences of jq syntax:

  • the jq expression {$x} is an abbreviation for {"x" : $x}
  • the jq expression {x} is an abbreviation for {"x" : .x}

The key-value pairs can be similarly specified in JSON objects with multiple keys.

(*) gojq struggles to get beyond [5,9592] without using huge amounts of memory; jq becomes excessively slow after [7,664579].

Preliminaries

# For the sake of the accuracy of integer arithmetic when using gojq:
def power($b): . as $in | reduce range(0;$b) as $i (1; . * $in);

# Input: $n, which is assumed to be positive integer,
# Output: an array of primes less than or equal to $n (e.g. 10|eratosthenes #=> [2,3,5,7]
def eratosthenes:
  # erase(i) sets .[i*j] to false for integral j > 1
  def erase(i):
    if .[i] then 
      reduce range(2; (1 + length) / i) as $j (.; .[i * $j] = false)
    else .
    end;

  (. + 1) as $n
  | (($n|sqrt) / 2) as $s
  | [null, null, range(2; $n)]
  | reduce (2, 1 + (2 * range(1; $s))) as $i (.; erase($i))
  | map(select(.)) ;

Phi and the Legendre Counting Function

def legendre:
  (sqrt | floor + 1 | eratosthenes) as $primes

  # Input: {x, a}
  # Output: {phi: phi(x,a), memo} where .memo might have been updated
  | def phi:
    . as {x: $x, a: $a, memo: $memo}
    | if $a == 0 then {phi: $x, $memo}
      else "\($x),\($a)" as $ix
      | $memo[ $ix ] as $m
      | if $m then {phi: $m, $memo}
        else .a += -1
        | phi as {phi: $phi1, memo: $memo}
        | (($x / $primes[$a - 1])|floor) as $x
        | ({$x, a, $memo} | phi) as {phi: $phi2, memo: $memo}
        | ($phi1 - $phi2) as $phi
        | {$phi, $memo}
	| .memo[$ix] = $phi
        end
      end;

  def l:
    . as $n
    | if . < 2 then 0
      else ($n|sqrt|floor|l) as $a
      | ({x: $n, $a, memo: {}} | phi).phi + $a - 1
      end;

  l;


def task:
  range(0;10)
  | . as $i
  | [., (10|power($i)|legendre)]
  ;

task
Output:
[0,0]
[1,4]
[2,25]
[3,168]
[4,1229]
[5,9592]
[6,78498]
[7,664579]
<terminated>

Julia

Using the Julia Primes library

This would be faster with a custom sieve that only counted instead of making a bitset, instead of the more versatile library function.

using Primes

function primepi(N)
    delta = round(Int, N^0.8)
    return sum(i -> count(primesmask(i, min(i + delta - 1, N))), 1:delta:N)
end

@time for power in 0:9
    println("10^", rpad(power, 5), primepi(10^power))
end
Output:
10^0    0
10^1    4
10^2    25
10^3    168
10^4    1229
10^5    9592
10^6    78498
10^7    664579
10^8    5761455
10^9    50847534
  1.935556 seconds (1.64 k allocations: 415.526 MiB, 2.15% gc time)

Using the method given in the task

Translation of: CoffeeScript

Run twice to show the benefits of precomputing the library prime functions.

using Primes

const maxpi = 1_000_000_000
const memoφ = Dict{Vector{Int}, Int}()
const pₐ = primes(isqrt(maxpi))

function π(n)
    function φ(x, a)
        if !haskey(memoφ, [x, a])
            memoφ[[x, a]] = a == 0 ? x : φ(x, a - 1) - φ(x ÷ pₐ[a], a - 1)
        end
        return memoφ[[x, a]]
    end

    n < 2 && return 0
    a = π(isqrt(n))
    return φ(n, a) + a - 1
end

@time for i in 0:9
    println("10^", rpad(i, 5), π(10^i))
end

@time for i in 0:9
    println("10^", rpad(i, 5), π(10^i))
end
Output:
10^0    1
10^1    4   
10^2    25  
10^3    168 
10^4    1229
10^5    9592
10^6    78498
10^7    664579
10^8    5761455
10^9    50847534
 12.939211 seconds (69.44 M allocations: 3.860 GiB, 27.98% gc time, 1.58% compilation time)
10^0    1
10^1    4
10^2    25
10^3    168
10^4    1229
10^5    9592
10^6    78498
10^7    664579
10^8    5761455
10^9    50847534
  0.003234 seconds (421 allocations: 14.547 KiB)

nonrecursive method

Translation of: Nim
function countprimes(n)
    n < 3 && return typeof(n)(n > 1)
    rtlmt = isqrt(n)
    mxndx = (rtlmt - 1) ÷ 2
    smalls::Array{UInt32} = [i for i in 0:mxndx+1]
    roughs::Array{UInt32} = [i + i + 1 for i in 0:mxndx+1]
    larges::Array{Int64} = [(n ÷ (i + i + 1) - 1) ÷ 2 for i in 0:mxndx+1]
    cmpsts = falses(mxndx + 1)
    bp, npc, mxri = 3, 0, mxndx
    @inbounds while true
        i = bp >> 1
        sqri = (i + i) * (i + 1)
        sqri > mxndx && break
        if !cmpsts[i + 1]
           cmpsts[i + 1] = true
            for c in sqri:bp:mxndx
                cmpsts[c + 1] = true
            end
            ri = 0
            for k in 0:mxri
                q = roughs[k + 1]
                qi = q >> 1
                cmpsts[qi + 1] && continue
                d = UInt64(bp) * UInt64(q)
                larges[ri + 1] = larges[k + 1] + npc -
                  (d <= rtlmt ? larges[smalls[d >> 1 + 1] - npc + 1]
                              : smalls[(Int(floor(n / d)) - 1) >> 1 + 1])
                roughs[ri + 1] = q
                ri += 1
            end
            m = mxndx
            @simd for k in (rtlmt ÷ bp - 1) | 1 : -2 : bp
                c = smalls[k >> 1 + 1] - npc
                ee = (k * bp) >> 1
                while m >= ee
                    smalls[m + 1] -= c
                    m -= 1
                end
            end
            mxri = ri - 1
            npc += 1
        end
        bp += 2
    end

    result = larges[1]
    @simd for i in 2:mxri+1
        result -= larges[i]
    end
    result += (mxri + 1 + 2 * (npc - 1)) * mxri ÷ 2

    for j in 1:mxri
        p = UInt64(roughs[j + 1])
        m = n ÷ p
        ee = smalls[(Int(floor(m / p)) - 1) >> 1 + 1] - npc
        ee <= j && break
        for k in j+1:ee
            result += smalls[(Int(floor(m / roughs[k + 1])) - 1) >> 1 + 1]
        end
        result -= (ee - j) * (npc + j - 1)
    end
    return result + 1
end

for i in 0:14
    println("π(10^$i) = ", countprimes(10^i))
end

@time countprimes(10^13)
@time countprimes(10^14)
Output:
π(10^0) = 0
π(10^1) = 4
π(10^2) = 25
π(10^3) = 168
π(10^4) = 1229
π(10^5) = 9592
π(10^6) = 78498
π(10^7) = 664579
π(10^8) = 5761455
π(10^9) = 50847534
π(10^10) = 455052511
π(10^11) = 4118054813
π(10^12) = 37607912018
π(10^13) = 346065536839
π(10^14) = 3204941750802
  0.894891 seconds (13 allocations: 48.442 MiB, 1.06% gc time)
  4.479385 seconds (13 allocations: 153.185 MiB, 0.12% gc time)

Kotlin

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Kotlin versions are [translated and improved from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

The Basic Recursive Legendre Algorithm

Translation of: Nim
fun countPrimes(lmt: Long): Long {
  if (lmt < 3) return if (lmt < 2) 0 else 1
  val sqrtlmt = Math.sqrt(lmt.toDouble()).toInt()
  fun makePrimes(): IntArray {
    val mxndx = (sqrtlmt - 3) / 2
    val arr = IntArray(mxndx + 1, { it + it + 3})
    var i = 0
    while (true) {
      val sqri = (i + i) * (i + 3) + 3
      if (sqri > mxndx) break
      if (arr[i] != 0) {
        val bp = i + i + 3
        for (c in sqri .. mxndx step bp) arr[c] = 0
      }
      i++
    }
    return arr.filter { it != 0 }.toIntArray()
  }
  val oprms = makePrimes()
  fun phi(x: Long, a: Int): Long {
    if (a <= 0) return x - (x shr 1)
    val na = a - 1; val p = oprms[na].toLong()
    if (x <= p) return 1
    return phi(x, na) - phi(x / p, na)
  }
  return phi(lmt, oprms.size) + oprms.size.toLong()
}

fun main() {
  val strt = System.currentTimeMillis()

  for (i in 0 .. 9) {
    val arg = Math.pow(10.toDouble(), i.toDouble()).toLong()
    println("π(10**$i) = ${countPrimes(arg)}")
  }

  val stop = System.currentTimeMillis()

  println("This took ${stop - strt} milliseconds.")
}
Output:
π(10**0) = 0
π(10**1) = 4
π(10**2) = 25
π(10**3) = 168
π(10**4) = 1229
π(10**5) = 9592
π(10**6) = 78498
π(10**7) = 664579
π(10**8) = 5761455
π(10**9) = 50847534
This took 480 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm

Just substitute the following Kotlin code for the `countPrimes` function above to enjoy the gain in speed:

Translation of: Nim
val cTinyPhiPrimes = intArrayOf(2, 3, 5, 7, 11, 13)
val cTinyPhiDeg = cTinyPhiPrimes.size - 1
val cTinyPhiOddCirc = cTinyPhiPrimes.reduce(Int::times) / 2
val cTinyPhiTot = cTinyPhiPrimes.fold(1) { s, p -> s * (p - 1) }
fun makeTPLUT(): IntArray {
  val arr = IntArray(cTinyPhiOddCirc) { _ -> 1 }
  for (bp in cTinyPhiPrimes.drop(1)) {
    arr[bp shr 1] = 0
    for (c in ((bp * bp) shr 1) until cTinyPhiOddCirc step bp) arr[c] = 0 }
  var acc = 0
  for (i in 0 until cTinyPhiOddCirc) { acc += arr[i]; arr[i] = acc }
  return arr
}
val cTinyPhiLUT = makeTPLUT()
fun tinyPhi(x: Long): Long {
  val ndx = (x - 1) shr 1; val numtots = ndx / cTinyPhiOddCirc.toLong()
  val rem = (ndx - numtots * cTinyPhiOddCirc.toLong()).toInt()
  return numtots * cTinyPhiTot.toLong() + cTinyPhiLUT[rem].toLong()
}

fun countPrimes(lmt: Long): Long {
  if (lmt < 169) {
    if (lmt < 3) return if (lmt < 2) 0 else 1
    // adjust for the missing "degree" base primes
    if (lmt <= 13)
      return ((lmt - 1).toLong() shr 1) + (if (lmt < 9) 1 else 0);
    return 5.toLong() + cTinyPhiLUT[(lmt - 1).toInt() shr 1].toLong();
  }
  val sqrtlmt = Math.sqrt(lmt.toDouble()).toInt()
  fun makePrimes(): IntArray {
    val mxndx = (sqrtlmt - 3) / 2
    val arr = IntArray(mxndx + 1, { it + it + 3})
    var i = 0
    while (true) {
      val sqri = (i + i) * (i + 3) + 3
      if (sqri > mxndx) break
      if (arr[i] != 0) {
        val bp = i + i + 3
        for (c in sqri .. mxndx step bp) arr[c] = 0
      }
      i++
    }
    return arr.filter { it != 0 }.toIntArray()
  }
  val oprms = makePrimes()
  fun lvl(pilmt: Int, m: Long): Long {
    var acc = 0.toLong()
    for (pi in cTinyPhiDeg until pilmt) {
      val p = oprms[pi].toLong(); val nm = m * p
      if (lmt <= nm * p) return acc + (pilmt - pi).toLong()
      val q = (lmt.toDouble() / nm.toDouble()).toLong(); acc += tinyPhi(q)
      if (pi > cTinyPhiDeg) acc -= lvl(pi, nm)
    }
    return acc
  }
  return tinyPhi(lmt) - lvl(oprms.size, 1) + oprms.size.toLong()
}

Use of the above function gets a gain in speed of about a further twenty times over the above version due to less recursion and the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions, which have also been converted to floating point divisions because they are faster although with a lost of precision for counting ranges that one would likely not use this implementation for. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

The Non-Recursive Partial Sieving Algorithm

The following Kotlin code enjoys the further gain in speed:

Translation of: Nim
import java.util.BitSet

fun countPrimes(lmt: Long): Long {
  if (lmt < 3) return if (lmt < 2) 0 else 1 // odds only!
  fun half(x: Int): Int = (x - 1) shr 1
  fun divide(nm: Long, d: Long): Int = (nm.toDouble() / d.toDouble()).toInt()
  val sqrtlmt = Math.sqrt(lmt.toDouble()).toLong()
  val mxndx = ((sqrtlmt - 1) shr 1).toInt()
  val cmpsts = BitSet(mxndx + 1)
  val smalls = IntArray(mxndx + 1) { it }
  val roughs = IntArray(mxndx + 1) { it + it + 1 }
  val larges = LongArray(mxndx + 1) { (lmt / (it + it + 1).toLong() - 1) shr 1 }

  // partial sieve loop, adjusting larges/smalls, compressing larges/roughs...
  var nobps = 0; var rilmt = mxndx; var bp = 3.toLong()
  while (true) {
    val i = (bp shr 1).toInt(); val sqri = (i + i) * (i + 1)
    if (sqri > mxndx) break
    if (!cmpsts.get(i)) { // condition for partial sieving pass for bp is prime
      cmpsts.set(i) // cull bp!
      for (c in sqri .. mxndx step bp.toInt()) cmpsts.set(c) // cull bp mults!

      // now adjust `larges` for latest partial sieve pass...
      var ori = 0 // compress input rough index to output one
      for (iri in 0 .. rilmt) {
        val r = roughs[iri]; val rci = (r shr 1)
        if (cmpsts.get(rci)) continue // skip roughs culled in this sieve pass!
        val d = bp * r.toLong()
        larges[ori] = larges[iri] -
                        ( if (d <= sqrtlmt)
                            larges[smalls[(d shr 1).toInt()] - nobps]
                          else smalls[half(divide(lmt, d))].toLong() ) +
                          nobps.toLong() // base primes count over subtracted!
        roughs[ori++] = r
      }

      var si = mxndx // and adjust `smalls` for latest partial sieve pass...
      for (bpm in (sqrtlmt / bp - 1) or 1 downTo bp step 2) {
        val c = smalls[(bpm shr 1).toInt()] - nobps
        val ei = ((bpm * bp) shr 1).toInt()
        while (si >= ei) smalls[si--] -= c
      }

      nobps++; rilmt = ori - 1
    }
    bp += 2
  }

  // combine results to here; correcting for over subtraction in combining...
  var ans = larges[0]; for (i in 1 .. rilmt) ans -= larges[i]
  ans += (rilmt.toLong() + 1 + 2 * (nobps.toLong() - 1)) * rilmt.toLong() / 2

  // add final adjustment for pairs of current roughs to cube root of range...
  var ri = 0
  while (true) { // break when reaches cube root of counting range...
    val p = roughs[++ri].toLong(); val q = lmt / p
    val ei = smalls[half(divide(q, p))] - nobps
    if (ei <= ri) break // break here when no more pairs!
    for (ori in ri + 1 .. ei)
      ans += smalls[half(divide(q, roughs[ori].toLong()))].toLong()
    ans -= (ei - ri).toLong() * (nobps.toLong() + ri.toLong() - 1)
  }

  return ans + 1 // add one for only even prime of two!
}

fun main() {
  val strt = System.currentTimeMillis()

  for (i in 0 .. 9) {
    val arg = Math.pow(10.toDouble(), i.toDouble()).toLong()
    println("π(10**$i) = ${countPrimes(arg)}")
  }

  val stop = System.currentTimeMillis()

  println("This took ${stop - strt} milliseconds.")
}

The above code enjoys quite a large gain in speed of about a further ten times (and increasing with range) over the immediately preceding version for larger "non-trivial" ranges (since there is an appreciable environment overhead in initialization and JIT compilation) due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division; for instance, this version can calculate the number of primes to 1e11 in about 82 milliseconds. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in about two minutes as long as the computer used has the required free memory of about 800 Megabytes. This Kotlin version is about twice as slow as the Nim version from which it was translated for large ranges where the initialization overhead is not significant due to being run under the Java Virtual Machine (JVM) environment and JIT compiled.

Mathematica /Wolfram Language

ClearAll[Phi,pi]
$RecursionLimit = 10^6;
Phi[x_, 0] := x
Phi[x_, a_] := Phi[x, a] = Phi[x, a - 1] - Phi[Floor[x/Prime[a]], a - 1]
pi[n_] := Module[{a}, If[n < 2, 0, a = pi[Floor[Sqrt[n]]]; Phi[n, a] + a - 1]]
Scan[Print[pi[10^#]] &, Range[0,9]]
Output:
0
4
25
168
1229
9592
78498
664579
5761455
50847534

Mojo

Currently, Mojo does not allow many forms of recursion so the simple recursive solutions are not possible; however, those are much slower than the solution by partial sieving anyway, as per the following code:

Translation of: Nim
from time import (now)

alias cLIMIT: UInt64 = 100_000_000_000

@always_inline
fn mkMasks() -> DTypePointer[DType.uint8]:
  let rslt = DTypePointer[DType.uint8].alloc(8)
  for i in range(8): rslt.offset(i).store(1 << i)
  return rslt

let masksp = mkMasks()

fn intsqrt(n: UInt64) -> UInt64:
  if n < 4:
    if n < 1: return 0 else: return 1
  var x: UInt64 = n; var qn: UInt64 = 0; var r: UInt64 = 0
  while qn < 64 and (1 << qn) <= n:
    qn += 2
  var q: UInt64 = 1 << qn
  while q > 1:
    if qn >= 64:
      q = 1 << (qn - 2); qn = 0
    else:
      q >>= 2
    let t: UInt64 =  r + q
    r >>= 1
    if x >= t:
      x -= t; r += q
  return r

fn countPrimes(n: UInt64) -> Int64:
  if n < 3:
    if n < 2: return 0
    else: return 1
  let rtlmt: Int = intsqrt(n).to_int() # precision limits range to maybe 1e16!
  let mxndx = (rtlmt - 1) >> 1
  @always_inline
  fn half(n: Int64) -> Int64    : return ((n - 1) // 2)
  @always_inline
  fn divide(nm: UInt64, d: UInt64) -> Int64: return ((nm * 1.0) / (d * 1.0)).to_int()
  let smalls = # current accumulated counts of odd primes 1 to sqrt range
    DTypePointer[DType.uint32].alloc(mxndx + 1)
  # initialized for no sieving whatsoever other than odds-only - partial sieved by 2:
  #   0 odd primes to 1; 1 odd prime to 3, etc....
  for i in range(mxndx + 1): smalls.offset(i).store(i)
  let roughs = # current odd k-rough numbers up to sqrt of range; k = 2
    DTypePointer[DType.uint32].alloc(mxndx + 1)
  # initialized to all odd positive numbers 1, 3, 5, ... sqrt range...
  for i in range(mxndx + 1): roughs.offset(i).store(i + i + 1)
  # array of current phi counts for above roughs...
  # these are not strictly `phi`'s since they also include the
  # count of base primes in order to match the above `smalls` definition!
  let larges = # starts as size of counts just as `roughs` so they align!
    DTypePointer[DType.uint64].alloc(mxndx + 1)
  # initialized for current roughs after accounting for even prime of two...
  for i in range(mxndx + 1): larges.offset(i).store((n // (i + i + 1) - 1) // 2)
  # cmpsts is a bit-packed boolean array representing
  # odd composite numbers from 1 up to rtlmt used for sieving...
  # initialized as "zeros" meaning all odd positives are potentially prime
  # note that this array starts at (and keeps) 1 to match the algorithm even
  # though 1 is not a prime, as 1 is important in computation of phi...
  let cmpsts = DTypePointer[DType.uint8].alloc((mxndx + 8) // 8)
  memset_zero(cmpsts, (mxndx + 8) // 8)

  # number of found base primes and current highest used rough index...
  var npc: Int = 0; var mxri: Int = mxndx
  for i in range(1, mxndx + 1): # start at index for 3; i will never reach mxndx...
    let sqri = (i + i) * (i + 1) # computation of square index!
    if sqri > mxndx: break # stop partial sieving due to square index limit!
    if (cmpsts.offset(i >> 3).load() & masksp.offset(i & 7).load()) != 0: continue # if not prime
    # culling the base prime from cmpsts means it will never be found again
    let cp = cmpsts.offset(i >> 3)
    cp.store(cp.load() | masksp.offset(i & 7).load()) # cull base prime
    let bp = i + i + 1 # base prime from index!
    for c in range(sqri, mxndx + 1, bp): # SoE culling of all bp multiples...
      let cp = cmpsts.offset(c >> 3); cp.store(cp.load() | masksp.offset(c & 7).load())
    # partial sieving to current base prime is now completed!

    var ri: Int = 0 # to keep track of current used roughs index!
    for k in range(mxri + 1): # processing over current roughs size...
      # q is not necessarily a prime but may be a
      # product of primes not yet culled by partial sieving;
      # this is what saves operations compared to recursive Legendre:
      let q: UInt64 = roughs.offset(k).load().to_int(); let qi = q >> 1 # index of always odd q!
      # skip over values of `q` already culled in the last partial sieve:
      if (cmpsts.offset(qi >> 3).load() & masksp.offset(qi & 7).load()) != 0: continue
      # since `q` cannot be equal to bp due to cull of bp and above skip;
      let d: UInt64 = bp * q # `d` is odd product of some combination of odd primes!
      # the following computation is essential to the algorithm's speed:
      # see above description in the text for how this works:
      larges.offset(ri).store(larges.offset(k).load() -
                   (larges.offset(smalls.offset(d >> 1).load().to_int() - npc).load() if d <= rtlmt
                    else smalls.offset(half(divide(n, d))).load().to_int()) + npc)
      # eliminate rough values that have been culled in partial sieve:
      # note that `larges` and `roughs` indices relate to each other!
      roughs.offset(ri).store(q.to_int()); ri += 1 # update rough value; advance rough index

    var m = mxndx # adjust `smalls` counts for the newly culled odds...
    # this is faster than recounting over the `cmpsts` array for each loop...
    for k in range(((rtlmt // bp) - 1) | 1, bp - 1, -2): # k always odd!
      # `c` is correction from current count to desired count...
      # `e` is end limit index no correction is necessary for current cull...
      let c = smalls.offset(k >> 1).load() - npc; let e = (k * bp) >> 1
      while m >= e:
        let cp = smalls.offset(m)
        cp.store(cp.load() - c); m -= 1 # correct over range down to `e`
    mxri = ri - 1; npc += 1 # set next loop max roughs index; count base prime
  # now `smalls` is a LUT of odd prime accumulated counts for all odd primes;
  # `roughs` is exactly the "k-roughs" up to the sqrt of range with `k` the
  #    index of the next prime above the quad root of the range;
  # `larges` is the partial prime counts for each of the `roughs` values...
  # note that `larges` values include the count of the odd base primes!!!
  # `cmpsts` are never used again!
  
  # the following does the top most "phi tree" calculation:
  var result: Int64 = larges.load().to_int() # the answer to here is all valid `phis`
  for i in range(1, mxri + 1): result -= larges.offset(i).load().to_int() # combined here by subtraction
  # compensate for the included odd base prime counts over subracted above:
  result += ((mxri + 1 + 2 * (npc - 1)) * mxri // 2)

  # This loop adds the counts due to the products of the `roughs` primes,
  # of which we only use two different ones at a time, as all the
  # combinations with lower primes than the cube root of the range have
  # already been computed and included with the previous major loop...
  # see text description above for how this works...
  for j in range(1, mxri + 1):  # for all `roughs` (now prime) not including one:
    let p: UInt64 = roughs.offset(j).load().to_int()
    let m: UInt64 = (n // p) # `m` is the `p` quotient
    # so that the end limit `e` can be calculated based on `n`/(`p`^2)
    let e: Int = smalls.offset(half((m // p).to_int())).load().to_int() - npc
    # following break test equivalent to non-memoization/non-splitting optmization:
    if e <= j: break # stop at about `p` of cube root of range!
    for k in range(j + 1, e + 1): # for all `roughs` greater than `p` to end limit:
      result += smalls.offset(half(divide(m, roughs.offset(k).load().to_int()))).load().to_int()
    # compensate for all the extra base prime counts just added!
    result -= ((e - j) * (npc + j - 1))

  result += 1 # include the count for the only even prime of two
  smalls.free(); roughs.free(); larges.free(); cmpsts.free()

  return result

fn main():
  var pow: Int = 1
  for i in range(10):
    print('10^', i, '=', countPrimes(pow))
    pow *= 10
  let start = now()
  let answr = countPrimes(cLIMIT)
  let elpsd = (now() - start) / 1000000
  print("Found", answr, "primes up to", cLIMIT, "in", elpsd, "milliseconds.")
Output:
10^ 0 = 0
10^ 1 = 4
10^ 2 = 25
10^ 3 = 168
10^ 4 = 1229
10^ 5 = 9592
10^ 6 = 78498
10^ 7 = 664579
10^ 8 = 5761455
10^ 9 = 50847534
Found 4118054813 primes up to 100000000000 in 16.696860000000001 milliseconds.

This is as run on an AMD 7840HS single-thread boosted to 5.1 GHz. It is over twice as slow as when run from the Nim code it is translated from, likely due to LLVM not handling this algorithm as well as the GCC compiler that did the final "back-end" compilation for the Nim code.

Nim

The program runs in 2.2 seconds on our laptop. Using int32 instead of naturals (64 bits on our 64 bits computer) saves 0.3 second. Using an integer rather than a tuple as key (for instance x shl 32 or a) saves 0.4 second.

import math, strutils, sugar, tables

const
  N = 1_000_000_000
  S = sqrt(N.toFloat).int

var composite: array[3..S, bool]
for n in countup(3, S, 2):
  if n * n > S: break
  if not composite[n]:
    for k in countup(n * n, S, 2 * n):
      composite[k] = true

# Prime list. Add a dummy zero to start at index 1.
let primes = @[0, 2] & collect(newSeq, for n in countup(3, S, 2): (if not composite[n]: n))

var cache: Table[(Natural, Natural), Natural]

proc phi(x, a: Natural): Natural =
  if a == 0: return x
  let pair = (x, a)
  if pair in cache: return cache[pair]
  result = phi(x, a - 1) - phi(x div primes[a], a - 1)
  cache[pair] = result

proc π(n: Natural): Natural =
  if n <= 2: return 0
  let a = π(sqrt(n.toFloat).Natural)
  result = phi(n, a) + a - 1

var n = 1
for i in 0..9:
  echo "π(10^$1) = $2".format(i, π(n))
  n *= 10

</syntaxhighlight>

This is not quite as fast as the Nim version or other languages whose back end is C as the LLVM-based back ends don't compile to quite the same optimization level for this algorithm than does GCC.

Output:
π(10^0) = 0
π(10^1) = 4
π(10^2) = 25
π(10^3) = 168
π(10^4) = 1229
π(10^5) = 9592
π(10^6) = 78498
π(10^7) = 664579
π(10^8) = 5761455
π(10^9) = 50847534

Non-Memoized Versions

The above code takes a huge amount of memory to run, especially if using the full size "Naturals" which not using them will severely limit the usable counting range (which of course is limited anyway by the huge memory use by the memoization). As per the comments on the task, memoization is not strictly needed, as the exponential growth in execution time can be limited by a very simple optimization that stops splitting of the recursive "tree" when there are no more contributions to be had by further "leaves" to the right of the "tree", as per the following code:

# compile with: nim c -d:danger -t:-march=native --gc:arc

from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt

let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))

proc countPrimes(n: int64): int64 =
  if n < 3:
    return if n < 2: 0 else: 1
  else:
    let rtlmt = n.float64.sqrt.int
    let mxndx = (rtlmt - 1) div 2
    let sz = (mxndx + 8) div 8
    var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0(sz))
    for i in 1 .. mxndx:
      if (cmpsts[i shr 3] and masksp[i and 7]) != 0: continue
      let sqri = (i + i) * (i + 1)
      if sqri > mxndx: break
      let bp = i + i + 1
      for c in countup(sqri, mxndx, bp):
        let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
    var pisqrt = 0'i64
    for i in 0 .. mxndx:
      if (cmpsts[i shr 3] and masksp[i and 7]) == 0: pisqrt += 1
    var primes = cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * pisqrt.int))
    var j = 0
    for i in 0 .. mxndx:
      if (cmpsts[i shr 3] and masksp[i and 7]) == 0: primes[j] = (i + i + 1).uint32; j += 1
    proc phi(x: int64; a: int): int64 =
      if a <= 1:
        return if a < 1: x else: x - (x shr 1)
      let p = primes[a - 1].int64
      if x <= p: return 1 # very simple one-line optimization that limits exponential growth!
      return phi(x, a - 1) - phi((x.float64 / p.float64).int64, a - 1)
    result = phi(n, pisqrt.int) + pisqrt - 1
    cmpsts.dealloc; primes.dealloc

let nstrt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let nelpsd = (getMonoTime() - nstrt).inMilliseconds
echo "This took ", nelpsd, " milliseconds."
Output:
π(10^0) = 0
π(10^1) = 4
π(10^2) = 25
π(10^3) = 168
π(10^4) = 1229
π(10^5) = 9592
π(10^6) = 78498
π(10^7) = 664579
π(10^8) = 5761455
π(10^9) = 50847534
This took 194 milliseconds.

As shown in the output, the above program runs without memoization about 12 times faster and using a huge amount less memory than the previous memoized algorithm. Other than not requiring memoization, the implementation also substitutes floating point division for integer division so as to be faster for many CPU architectures (although only being of 53-bits accuracy instead of 64-bits; which limited precision will limit the range over which the primes can be accurately counted, but as this algorithm is only reasonably useful to about ten to the fourteenth power anyway, that shouldn't be a problem.

Using Look Up Table for "TinyPhi" and Decreasing Recursion

A further optimization that greatly reduces the number of operations and thus the run time for these small ranges is to use a pre-computed "TinyPhi" Look Up Table (LUT) array for the "leaves" where the "degree" is small (such as degree of six for the smallest primes used as 13 as used here, which then forms a repeating wheel pattern of 30030 values); this would reduce the "width" of the tree at its widest "base" to eliminate many divisions and recursions and thus reduce the execution time greatly, although the effect is more for smaller ranges such as this as for greater ranges. As well, one can compute the "tree" starting at the lower left, which then reduces some of the recursions to a simple loop and can eliminate the recursions used for succeeding "ones" according to the optimization eliminating the need for more recursion. This is because "Phi"/φ can be calculated by the given "top down" recursion expression given in the task which is derived from the "bottom up" expression as follows:

(where denotes the floor function).

The following Nim code implements these changes:

# compile with: nim c -d:danger -t:-march=native --gc:arc

from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt

let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))

const TinyPhiPrimes = [2, 3, 5, 7, 11, 13]
const TinyPhiCirc = 3 * 5 * 7 * 11 * 13
const TinyPhiRes = 2 * 4 * 6 * 10 * 12
const CC = 6
proc makeTinyPhiLUT(): array[TinyPhiCirc, uint16] =
  for i in 0 .. TinyPhiCirc - 1: result[i] = 1
  for i in 1 .. 6:
    if result[i] == 0: continue
    result[i] = 0; let bp = i + i + 1
    let sqri = (i + i) * (i + 1)
    for c in countup(sqri, TinyPhiCirc - 1, bp): result[c] = 0
  var acc = 0'u16;
  for i in 0 .. TinyPhiCirc - 1: acc += result[i]; result[i] = acc
const TinyPhiLUT = makeTinyPhiLUT()
proc tinyPhi(x: int64): int64 {.inline.} =
  let ndx = (x - 1) div 2; let numtot = ndx div TinyPhiCirc.int64
  return numtot * TinyPhiRes.int64 + TinyPhiLUT[(ndx - numtot * TinyPhiCirc.int64).int].int64

proc countPrimes(n: int64): int64 =
  if n < 169: # below 169 whose sqrt is 13 is where TinyPhi doesn't work...
    if n < 3: return if n < 2: 0 else: 1
    # adjust for the missing "degree" base primes
    if n <= 13: return (n - 1) div 2 + (if n < 9: 1 else: 0)
    return 5 + TinyPhiLUT[(n - 1).int div 2].int64
  let rtlmt = n.float64.sqrt.int
  let mxndx = (rtlmt - 1) div 2
  var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0((mxndx + 8) div 8))
  for i in 1 .. mxndx:
    if (cmpsts[i shr 3] and masksp[i and 7]) != 0: continue
    let sqri = (i + i) * (i + 1)
    if sqri > mxndx: break
    let bp = i + i + 1
    for c in countup(sqri, mxndx, bp):
      let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
  var pisqrt = 0'i64
  for i in 0 .. mxndx:
    if (cmpsts[i shr 3] and masksp[i and 7]) == 0: pisqrt += 1
  let primes = cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * pisqrt.int))
  var j = 0
  for i in 0 .. mxndx:
    if (cmpsts[i shr 3] and masksp[i and 7]) == 0: primes[j] = (i + i + 1).uint32; j += 1
  var phi = tinyPhi(n)
  proc lvl(m, mbf: int64; mxa: int) = # recurse from bottom left of "tree"...
    for a in CC .. mxa:
      let p = primes[a].int64
      if m < p * p: phi += mbf * (mxa - a + 1).int64; return # rest of level all ones!
      let nm = (m.float64 / p.float64).int64; phi += mbf * tinyPhi(nm)
      if a > CC: lvl(nm, -mbf, a - 1) # split
    # finished level!
  lvl(n, -1, pisqrt.int - 1); result = phi + pisqrt - 1
  cmpsts.dealloc; primes.dealloc  

let strt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "This took ", elpsd, " milliseconds."

The results of the above code are identical to the previous version except it is almost 20 times faster due to the reduced recursion; however, the gain due to the "TinyPhi" LUT is most effective for relatively trivial counting ranges such as this and will have less and less benefit as ranges get larger, although the reduced recursion due to reordering is a constant factor gain across the range.

Note that the optimization to eliminate the need for memoization has been changed to "if m < p * p: phi += mbf * (mxa - a + 1).int64; return # rest of level all ones!", which has exactly the same effect as the previous optimization but stops the recursion for all "parent" branches of the same "level".

Larger "TinyPhi" LUT's can be used for some added benefit (especially for these smaller ranges) although they quickly get quite huge where the benefit is reduced due to lesser cache associativity; for instance the LUT for a "TinyPhi" of degree eight can a wheel circumference of 9699690 values which can be reduced by a factor of two by only considering odd values and another factor of two using the fact that the LUT is symmetric approaching the middle from the beginning and end of the LUT.

Reducing Operations by using Partial Sieving

There has been some work adapting the Legendre theory to the use of partial sieving as is known in the competitive programming communities, although there doesn't seem to be a link to a mathematical paper deriving the basis of that algorithm. The result of that work is a code algorithm that still uses memory proportional to the square root of the range to be counted but greatly reduces the computational complexity from O(n / (log n)^2) to O(n^(3/4) / (log n)^2). Since a mathematical paper supporting the algorithm has not been found, a brief explanation is provided here (if someone knows of such a paper, please edit in a link):

It turns out that there is another way to think of the "Phi"/φ(m, a)/inclusion-exclusion expression from which the task expression and the above implementations have been derived - that expresses the count of values left after the range to m has been partially sieved/partially culled by the primes up to the "ath" one. Now, Legendre would likely not have considered thinking of the inclusion/exclusion principle that way as he made no real practical use of his work other than some trivial examples showing that it worked and could calculate the number of primes up to a range only knowing the primes up to the square root of that range, as one has to be able to see that this can be used repeatedly to adjust Look Up Tables (LUT's) of partial counts with each partial sieve and, of course, there were no computers in his time (1800's).

The LUT's used in the following code are as follows:

  1. The simple "smalls" LUT is a table of the counts of remaining values after culling of base primes up to a "kth prime" value up to the square root of the counting range (may as well be odd values since two is the only even prime), which could be updated by repeated scanning of the sieve buffer but is more efficiently updated by decrementing the values between each cull value by the constant difference. To make this easier, this "smalls" LUT includes the count of the base primes used up to the current state and doesn't include the one that is normally included in a "Phi" count.
  2. Another important table is the "roughs" LUT where the "k-roughs" term wasn't used until post 2000 or so, but is the remaining (may as well be odd) values in a range (in this case to the square root of the counting range) after elimination of all multiples of primes up to the kth one. This table will start as half (because odds-only) the square root of the counting range in length but will be reduced in effective length by each partial sieving pass by each base prime value so that only prime values above the last base prime value remain (plus the first value of one) so will be approximately sqrt n / log n in length.
  3. The final "larges" LUT is the most important one and perhaps the hardest to grasp as it contains the partial counts of values up to the entire counting range for each of the current "roughs" values so thus has an effective length that is synchronized and thus reduced with the above "roughs" LUT. It is initialized with the number of odd un-culled values up to and including each current "k-rough" value, and like the "smalls" LUT, its values are adjusted for each partial sieve pass. The updated values of this "larges" LUT cannot be determined only by reference to the remaining count after culling but the offset to be subtracted must be determined by look up of the previous values in this same LUT of all prime products smaller than the square root of the counting range.
  4. These three LUT's change until partial sieving is complete, which since they represent values up to the square root of the counting range, happens after all culling by base prime values up to the quad root of the counting range.
  5. As the Sieve of Eratosthenes sieving is only to the range of the square root of the counting range, it is a negligible part of the total execution time and hasn't been much optimized.

In order to understand the updating of the "larges" LUT as per the above, one needs to understand the relationship between the φ recursion as per the task description and partial sieving as follows:

  1. The number of (odd) values remaining after a partial sieve and the "phi" expression are closely related as they only differ by the number of base primes and one.
  2. As per the previous implementation, the "tree" is being adjusted and built from the bottom up, so the initial state of the LUT's represents the bottom of the tree just after partial sieving by the only even base prime of two, and every succeeding partial sieve pass need to build the immediate "parent" node counts based on the values available in the "child" nodes as per the current state of the LUT's.
  3. The relationship between the "parents" and the "children" for a given count limit is given by the relationship "φ(x, a) = φ(x, a−1) − φ(⌊x/pa⌋, a−1)", which in words says that the "larges" LUT entry for a given count for a "parent" level adjustment will have a new value of the old value minus the looked up count of that value divided by the current partial sieve base prime cull value for the partial sieving pass.
  4. There are two different means of obtaining the adjustment to be subtracted from the old "larges" count value: for products of the two values less than or equal to the square root of the counting table, the index of the product is looked up in the "smalls" LUT and that index is then used to find the adjustment in the "larges" LUT; for products larger than that limit, the quotient will be less than that limit so can be converted to an index and directly looked up in the "smalls" LUT.
  5. Since the "larges" LUT needs to be treated in the same way as to offsets and scaling as the "smalls" LUT and the latter includes the count of current base primes used and doesn't include the one, the "larges" LUT needs to do the same. When adjusting for a partial sieving pass by taking a difference, the offset of the current base primes count is cancelled out and needs to be added back in after the difference.
  6. The synced reduction in effective length of the "roughs" and "larges" LUT's due to culled values for each partial sieving pass is effected in the same loop that does the partial sieving.
  7. During the partial sieving passes loop, it is impossible for the "non-branch splitting" optimization that makes memoization unnecessary to occur, so that check is not applied during this loop, thus saving a little execution time.
  8. After partial sieving is complete (when the base primes reach the cube root of the counting range), the loop could continue without sieving to determine the rest of the "parent" values up to the top level with "a" equal to the last base prime up to the square root of the counting range, but that would be less efficient as to having to inject a check for the "non-splitting termination optimization", to determine that the base primes have reached a level where sieving isn't necessary, and a slightly less efficient compensation for the differences eliminating the base primes count and restoring on an operation-by-operation basis, so this loop is terminated and a new loop does this final phase.
  9. This partial sieving and LUT building loop is the source of the major part of the execution complexity, which for very large ranges will tend toward a O(n^(3/4)/((log n)^2)) execution complexity as can be seen that we are using products of the base primes up the the cube root of the counting range and what will tend toward the number of primes up to the square root of the counting range.

Since we no longer need to use partial sieving and adjustment of the LUT's, at this stage the result of the current state of computation can be combined into a variable. As the zeroth index of "larges" LUT contains the count of the value one "rough" (also the zeroth value in the LUT), it is the result of only one base prime at a time and is therefore included and thus additive where all the other effective values in the "larges" LUT are the result of the product of two co-primes and therefore excluded and subtracted from the zeroth value. This will result in an over subtraction of the number of base primes up to the quad root of the counting range, so that will need to be added back in by the number of effective indexes greater than zero in the "larges" LUT minus one just as this offset is compensated as the values were adjusted and will have to be further adjusted as we add further inclusions to the final result.

At this stage, an answer has been obtained but it is not quite correct as to the inclusions due to the base primes/lowest prime factors between the quad root of the counting range and the square root of the counting range have not yet been added, as follows:

  1. Again, these are pairs of (this time prime as partial sieving has been completed) factors with the base primes as described above.
  2. It is easy to demonstrate why there cannot be more than two prime factors used since the smallest prime factor used will be just over the quad root of the counting range, and if there were three successive prime factors used, the quotient would be less than the quad root of the counting range which would be a breaking/non-splitting condition and this is the best case for three prime factors so three prime factors cannot be used in this loop
  3. This second non-sieving loop will never reach the top end of the range for two large successive prime factors since when the prime factors start at about the cube root of the counting range, they will hit the terminating non-splitting condition and the loop will terminate. This termination is equivalent in effect to the non-splitting optimization so that "tree" splitting is not necessary but since the count LUT's (and the result accumulator variable) don't contain the one offset, no further compensation of adding a computed number of ones needs to be made.
  4. Since there are always two prime factors as per the above, their compensation will always be positive as in included/added; this is due to that effectively they are being double subtracted as in once as an equivalent compensation to the "larges" values and once for the final transfer of the "larges" values to the result accumulator.
  5. The product of the pairs of prime factors will always be greater than the square root of the counting range so the quotient will always be less than the square root of the counting range and only the "smalls" LUT needs to be used for these compensations.
  6. As the compensations for each pair are each adding the count of base primes up to the quad root of the counting range to the accumulator, this needs to be compensated by subtracting a multiple of the number of compensations minus one times the number of base primes used.
  7. Since this second loop terminates early, there are slightly less operations than in the first partial sieving loop as the base prime multiplier will never exceed about the cube root of the counting range so that the first loop controls the computational complexity more than this second loop.

Finally, we add one to the final answer to account for the only even prime of two.

The reason that this algorithm can reduce the number of operations as compared to the "standard" recursive Legendre algorithm in that the product of some combinations of multiple primes is compensated in one operation rather than many "branching"/recursive operations as in the recursive or semi-recursive algorithms.

One might question the validity of this "partial sieving" optimization considering the task explanation, but the Legendre prime counting function was likely never used practically at the time it was invented other than for trivial examples showing that it worked as mentioned above since there were no electronic computers; the follow-on work by Meissel was used by him to hand calculate the number of primes to a billion (1e9) over the course of some number of years in about 1870, and he most likely would have used partial sieving as otherwise he would have done over five million long divisions or over 600 thousand long divisions even with use of a "TinyPhi" Look Up Table of degree six, which would not be possible in his lifetime at something like a hundred long division calculations a day. D. H. Lehmer did not use partial sieving in his computer work on prime counting in 1959, but that was because he was focused on fitting a version of the Meissel algorithm into the memory limitations of the computers of that day (a few thousand words of RAM memory) such that he had to pre-calculate and store the primes values required on a digital magnetic tape that had to be played several times as the calculation proceeded and thus didn't fully investigate (or fully understand) all of the optimizations as used by Meissel, using the brute mathematical computational strength of the computer to overcome the lack of optimizations. So partial sieving as an optimization technique would likely have been known (by Meissel) but not applied until Legarias, Miller, and Odlyzko (LMO) investigated a way to apply it to a computer algorithm. It could have been applied to both the Legendre and Meissel algorithms much sooner other than, at least for the Legendre algorithm, the need for memory proportional to the square root of the counting range would have limited the useful range to counting ranges of 1e12 or so when computers of the day had at most several Megabytes of RAM memory. One of the biggest advantages of LMO compared to even this optimization of Legendre is it greatly reduces memory requirements such that in 1985 it was possible to determine the count of primes to 4e16 on a mainframe computer with only a few megabytes of RAM memory, which took 1730 minutes even multi-threaded (two threads).

In short, the defining characteristic of prime counting functions of the Legendre type is only requiring a sieve to the square root of the counting range, of functions of the Meissel type that they require a sieve to the power of two thirds of the counting range, of functions of the Meissel-Lehmer type that they require a sieve to the power of three quarters of the counting range, with later Meissel variants such as that of Deleglise and Gourdon being tuned variants that are between the Legendre requirement and the Meissel requirement in order to balance the time spent counting and the time spent sieving, with all of them decreasing the number of "Phi" operations as the sieving range increases. All of these algorithms can have the improvement of the LMO "partial sieving" technique applied to them (although LMO is as applied to Meissel) in "splitting" the calculation to above some limit and below that limit, with LMO splitting between values above and below the counting range to the power of two thirds and this algorithm splitting at the square root of the counting range. The combination of "partial sieving" and this "splitting" reduces the computational complexity by combining many of the divisors of products of more than one prime into one operation. This algorithm has approximately the computational complexity of Meissel-Lehmer without the increased sieving range because these operations have been combined into the calculation of "Phi"/φ so that that the calculations of "P2" and "P3" are not necessary and therefore the sieving of the counting range to the three quarters power is not necessary.

A Nim implementation of the above described algorithm is as follows:

# compile with: nim c -d:danger -t:-march=native --gc:arc

from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt

let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))

# non-recursive Legendre prime counting function for a range `n`...
# this has O(n^(3/4)/((log n)^2)) time complexity; O(n^(1/2)) space complexity.
proc countPrimes(n: int64): int64 =
  if n < 3: # can't odd sieve for value less than 3!
    return if n < 2: 0 else: 1
  else:
    proc half(n: int): int {.inline.} = (n - 1) shr 1 # convenience conversion to index
    # dividing using float64 is faster than int64 for some CPU's...
    # precision limits range to maybe 1e16!
    proc divide(nm, d: int64): int {.inline.} = (nm.float64 / d.float64).int
    let rtlmt = n.float64.sqrt.int # precision limits range to maybe 1e16!
    let mxndx = (rtlmt - 1) div 2
    var smalls = # current accumulated counts of odd primes 1 to sqrt range
      cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * (mxndx + 1)))
    # initialized for no sieving whatsoever other than odds-only - partial sieved by 2:
    #   0 odd primes to 1; 1 odd prime to 3, etc....
    for i in 0 .. mxndx: smalls[i] = i.uint32
    var roughs = # current odd k-rough numbers up to sqrt of range; k = 2
      cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * (mxndx + 1)))
    # initialized to all odd positive numbers 1, 3, 5, ... sqrt range...
    for i in 0 .. mxndx: roughs[i] = (i + i + 1).uint32
    # array of current phi counts for above roughs...
    # these are not strictly `phi`'s since they also include the
    # count of base primes in order to match the above `smalls` definition!
    var larges = # starts as size of counts just as `roughs` so they align!
      cast[ptr[UncheckedArray[int64]]](alloc(sizeof(int64) * (mxndx + 1)))
    # initialized for current roughs after accounting for even prime of two...
    for i in 0 .. mxndx: larges[i] = ((n div (i + i + 1) - 1) div 2).int64
    # cmpsts is a bit-packed boolean array representing
    # odd composite numbers from 1 up to rtlmt used for sieving...
    # initialized as "zeros" meaning all odd positives are potentially prime
    # note that this array starts at (and keeps) 1 to match the algorithm even
    # though 1 is not a prime, as 1 is important in computation of phi...
    var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0((mxndx + 8) div 8))

    # number of found base primes and current highest used rough index...
    var npc = 0; var mxri = mxndx
    for i in 1 .. mxndx: # start at index for 3; i will never reach mxndx...
      let sqri = (i + i) * (i + 1) # computation of square index!
      if sqri > mxndx: break # stop partial sieving due to square index limit!
      if (cmpsts[i shr 3] and masksp[i and 7]) != 0'u8: continue # if not prime
      # culling the base prime from cmpsts means it will never be found again
      cmpsts[i shr 3] = cmpsts[i shr 3] or masksp[i and 7] # cull base prime
      let bp = i + i + 1 # base prime from index!
      for c in countup(sqri, mxndx, bp): # SoE culling of all bp multiples...
        let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
      # partial sieving to current base prime is now completed!

      var ri = 0 # to keep track of current used roughs index!
      for k in 0 .. mxri: # processing over current roughs size...
        # q is not necessarily a prime but may be a
        # product of primes not yet culled by partial sieving;
        # this is what saves operations compared to recursive Legendre:
        let q = roughs[k].int; let qi = q shr 1 # index of always odd q!
        # skip over values of `q` already culled in the last partial sieve:
        if (cmpsts[qi shr 3] and masksp[qi and 7]) != 0'u8: continue
        # since `q` cannot be equal to bp due to cull of bp and above skip;
        let d = bp * q # `d` is odd product of some combination of odd primes!
        # the following computation is essential to the algorithm's speed:
        # see above description in the text for how this works:
        larges[ri] = larges[k] -
                     (if d <= rtlmt: larges[smalls[d shr 1].int - npc]
                      else: smalls[half(divide(n, d.int64))].int64) + npc.int64
        # eliminate rough values that have been culled in partial sieve:
        # note that `larges` and `roughs` indices relate to each other!
        roughs[ri] = q.uint32; ri += 1 # update rough value; advance rough index

      var m = mxndx # adjust `smalls` counts for the newly culled odds...
      # this is faster than recounting over the `cmpsts` array for each loop...
      for k in countdown(((rtlmt div bp) - 1) or 1, bp, 2): # k always odd!
        # `c` is correction from current count to desired count...
        # `e` is end limit index no correction is necessary for current cull...
        let c = smalls[k shr 1] - npc.uint32; let e = (k * bp) shr 1
        while m >= e: smalls[m] -= c; m -= 1 # correct over range down to `e`
      mxri = ri - 1; npc += 1 # set next loop max roughs index; count base prime
    # now `smalls` is a LUT of odd prime accumulated counts for all odd primes;
    # `roughs` is exactly the "k-roughs" up to the sqrt of range with `k` the
    #    index of the next prime above the quad root of the range;
    # `larges` is the partial prime counts for each of the `roughs` values...
    # note that `larges` values include the count of the odd base primes!!!
    # `cmpsts` are never used again!

    # the following does the top most "phi tree" calculation:
    result = larges[0] # the answer to here is all valid `phis`
    for i in 1 .. mxri: result -= larges[i] # combined here by subtraction
    # compensate for the included odd base prime counts over subracted above:
    result += ((mxri + 1 + 2 * (npc - 1)) * mxri div 2).int64

    # This loop adds the counts due to the products of the `roughs` primes,
    # of which we only use two different ones at a time, as all the
    # combinations with lower primes than the cube root of the range have
    # already been computed and included with the previous major loop...
    # see text description above for how this works...
    for j in 1 .. mxri:  # for all `roughs` (now prime) not including one:
      let p = roughs[j].int64; let m = n div p # `m` is the `p` quotient
      # so that the end limit `e` can be calculated based on `n`/(`p`^2)
      let e = smalls[half((m div p).int)].int - npc
      # following break test equivalent to non-memoization/non-splitting optmization:
      if e <= j: break # stop at about `p` of cube root of range!
      for k in j + 1 .. e: # for all `roughs` greater than `p` to end limit:
         result += smalls[half(divide(m, roughs[k].int64))].int64
      # compensate for all the extra base prime counts just added!
      result -= ((e - j) * (npc + j - 1)).int64

    result += 1 # include the count for the only even prime of two
    smalls.dealloc; roughs.dealloc; larges.dealloc; cmpsts.dealloc

let strt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "This took ", elpsd, " milliseconds."

The output of the above code is the same as the previous version except that it is almost ten times faster still and the execution time is almost too small to be measured for this trivial range (for prime counting functions) of only a billion; it takes a little over a second to calculate the count of primes up to 1e13 and just over 30 seconds to calculate the number of primes up to 1e16 on a modern CPU.

The above program uses about 250,000 long integer divisions to calculate the count of primes to a billion as here, which is small enough to be able to calculate it by hand in several man-years of work. The Meissel-LMO algorithm with usual basic optimizations greatly reduces the number of integer long divisions (to only about 17 thousand with a "TinyPhi" LUT of degree eight) so as to be even more possible to hand calculate in only a few man-months, although still subject to math errors as Meissel made in this computation to a billion in the about 1870's. Meissel did not have our advantage in electronic computational power, but just using partial sieving and even a small degree of "TinyPhi" table would have made it easily possible to hand calculate in a few man-years.

The advantage of this "k-roughs" algorithm is its relative simplicity, but that comes at the cost of fairly high memory use due to using Legendre such that it is only really usable to about 1e16 even for a modern computer. Another disadvantage is that each partial sieving pass must be completed before being able to proceed with the next, which precludes using page segmentation and the ease of introducing multi-threading that provides. One could use page segmented sieving as for the the LMO algorithm but if one were to add that much added code complexity, one may as well implement full LMO and also enjoy the reduction in memory use. Once one has page-segmented sieving, one can then easily convert the code to multi-threading for gains in speed proportional to the number of effective threads for a given computer.

As counting range is increased above these trivial ranges, a better algorithm such as that of LMO will likely perform much better than this and use much less memory (to proportional to the cube root of the range), although at the cost of greater code complexity and more lines of code (about 500 LOC). The trade-off between algorithms of the Legendre type and those of the Meissel type is that the latter decreases the time complexity but at the cost of more time spent sieving; modern follow-on work to LMO produces algorithms which are able to tune the balance between Legendre-type and Meissel-type algorithm in trading off the execution time costs between the different parts of the given algorithm for the best in performance, with the latest in Kim Walisch's tuning of Xavier Gourdon's work being about five to ten times faster than LMO (plus includes multi-threading abilities).

Pascal

// Rosetta Code task "Legendre prime counting function".
// Solution for Free Pascal (Lazarus) or Delphi.
program LegendrePrimeCount;

{$IFDEF FPC} // Free Pascal
  {$MODE Delphi}
{$ELSE}      // Delphi
  {$APPTYPE CONSOLE}
{$ENDIF}

// Optimization needs to be on if the program is to finish in a reasonable
//   length of time. See "Comments on Task" on the Rosetta Code website.
{$DEFINE SIMPLE_OPTIMIZATION}

uses SysUtils, Types;

{-------------------------------------------------------------------
Function to return an array of primes up to the passed-in limit.
Uses a straightforward Eratosthenes sieve.
TIntegerDynArray must be 0-based. To be compatible with Rosetta Code,
the first prime 2 is at result[1], and result[0] is not used.
}
function FindPrimes( limit : integer) : Types.TIntegerDynArray;
var
  deleted : array of boolean;
  j, k, p, resultSize : integer;
begin
  if (limit < 2) then begin
    SetLength( result, 1);
    exit;
  end;
  SetLength( deleted, limit + 1); // 0..limit
  deleted[0] := true;
  for j := 1 to limit do deleted[j] := false;
  p := 2;
  while (p*p <= limit) do begin
    j := 2*p;
    while (j <= limit) do begin
      deleted[j] := true;
      inc( j, p);
    end;
    repeat inc(p)
    until (p > limit) or (not deleted[p]);
  end;
  resultSize := 0;
  for j := 0 to limit do
    if not deleted[j] then inc( resultSize);
  SetLength( result, resultSize);
  k := 0;
  for j := 0 to limit do begin
    if not deleted[j] then begin
      result[k] := j;
      inc(k);
    end;
  end;
end;

{-----------------------------------------------------------------------------
Function to count primes up to the passed-in limit, by Legendre's method.
Iterative, using a stack. Each item in the stack is a term phi(x,a) along
with a sign. If the top item on the stack can be evaluated easily, it is
popped off and its value is added to the result. Else the top item is
replaced by two items according to the formual in the task description.
}
function CountPrimes( n : integer) : integer;
type
  TPhiTerm = record
    IsNeg : boolean;
    x : integer;
    a : integer;
  end;
const
  STACK_SIZE = 100; // 10 is enough for n = 10^9
var
  primes : Types.TIntegerDynArray;
  nrPrimes : integer;
  stack : array [0..STACK_SIZE - 1] of TPhiTerm;
  sp : integer; // stack pointer, points to first free entry
  tos : TPhiTerm; // top of stack
begin
  primes := FindPrimes( Trunc( Sqrt( n + 0.5)));
  nrPrimes := Length( primes) - 1; // primes[0] is not used
  result := nrPrimes - 1; // initialize total
  // Push initial entry onto stack
  with stack[0] do begin
    IsNeg := false;
    x := n;
    a := nrPrimes;
  end;
  sp := 1;
  while sp > 0 do begin
    tos := stack[sp - 1];
{$IFDEF SIMPLE_OPTIMIZATION}
    // Using optimization described in "Comments on Task"
    if tos.x = 0 then begin // top of stack = 0
      dec(sp); // pop top of stack, no change to result
    end
    else if (tos.a > 0) and (tos.x < primes[tos.a]) then begin // top of stack = 1
      dec( sp); // pop top of stack, update result
      if tos.IsNeg then dec( result)
                   else inc( result);
    end
    else if tos.a = 0 then begin
{$ELSE}
    // Using only the task description, i.e. only phi(x,0) = x
    if tos.a = 0 then begin
{$ENDIF}
      dec( sp); // pop top of stack, update result
      if tos.IsNeg then dec( result, tos.x)
                   else inc( result, tos.x);
    end
    else begin
      // Replace top of stack by two items as in the task description,
      //    namely phi(x, a - 1) and -phi(x div primes[a], a - 1)
      if (sp >= STACK_SIZE) then
        raise SysUtils.Exception.Create( 'Legendre phi stack overflow');
      with stack[sp - 1] do begin
        IsNeg := tos.isNeg;
        x := tos.x;
        a := tos.a - 1;
      end;
      with stack[sp] do begin
        IsNeg := not tos.IsNeg;
        x := tos.x div primes[tos.a];
        a := tos.a - 1;
      end;
      inc(sp);
    end;
  end;
end;

{-----------------------------------------------------------
Main routine
}
var
  power, limit, count : integer;
begin
  WriteLn( 'Limit      Count');
  limit := 1;
  for power := 0 to 9 do begin
    if power > 0 then limit := 10*limit;
    count := CountPrimes( limit);
    WriteLn( SysUtils.Format( '10^%d  %10d', [power, count]))
  end;
end.
Output:
Limit      Count
10^0           0
10^1           4
10^2          25
10^3         168
10^4        1229
10^5        9592
10^6       78498
10^7      664579
10^8     5761455
10^9    50847534

Perl

#!/usr/bin/perl

use strict; # https://rosettacode.org/wiki/Legendre_prime_counting_function
use warnings;
no warnings qw(recursion);
use ntheory qw( nth_prime prime_count );

my (%cachephi, %cachepi);

sub phi
  {
  return $cachephi{"@_"} //= do {
    my ($x, $aa) = @_;
    $aa <= 0 ? $x : phi($x, $aa - 1) - phi(int $x / nth_prime($aa), $aa - 1) };
  }

sub pi
  {
  return $cachepi{$_[0]} //= do {
    my $n = shift;
    $n < 2 ? 0 : do{ my $aa = pi(int sqrt $n); phi($n, $aa) + $aa - 1 } };
  }

print "e             n   Legendre    ntheory\n",
      "-             -   --------    -------\n";
for (1 .. 9)
  {
  printf "%d  %12d %10d %10d\n", $_, 10**$_, pi(10**$_), prime_count(10**$_);
  }
Output:
e             n   Legendre    ntheory
-             -   --------    -------
1            10          4          4
2           100         25         25
3          1000        168        168
4         10000       1229       1229
5        100000       9592       9592
6       1000000      78498      78498
7      10000000     664579     664579
8     100000000    5761455    5761455
9    1000000000   50847534   50847534

Phix

with javascript_semantics
--
-- While a phix dictionary can handle keys of {x,a}, for this 
-- task performance was dreadful (7,612,479 entries, >3mins),
-- so instead memophix maps known x to an index to memophia 
-- which holds the full [1..a] for each x, dropping to a much
-- more respectable (albeit not super-fast) 14s
--
constant memophix = new_dict()
sequence memophia = {}  -- 1..a (max 3401) for each x

function phi(integer x, a)
    if a=0 then return x end if
    integer adx = getd(x,memophix), res
    if adx=NULL then
        memophia = append(memophia,repeat(-1,a))
        adx = length(memophia)
        setd(x,adx,memophix)
    else
        object ma = memophia[adx]
        integer l = length(ma)
        if a>l then
            memophia[adx] = 0   -- kill refcount
            memophia[adx] = ma & repeat(-1,a-l)
        else
            res = ma[a]
            if res>=0 then return res end if
        end if
        ma = 0                  -- kill refcount
    end if
    res = phi(x, a-1) - phi(floor(x/get_prime(a)), a-1)
    memophia[adx][a] = res
    return res
end function
 
function pi(integer n)
    if n<2 then return 0 end if
    integer a = pi(floor(sqrt(n)))
    return phi(n, a) + a - 1
end function
 
atom t0 = time()
for i=0 to iff(platform()=JS?8:9) do
    printf(1,"10^%d  %d\n",{i,pi(power(10,i))})
--  printf(1,"10^%d  %d\n",{i,length(get_primes_le(power(10,i)))})
end for
?elapsed(time()-t0)

The commented-out length(get_primes_le()) gives the same results, in about the same time, and in both cases re-running the main loop a second time finishes near-instantly.

Output:
10^0  0
10^1  4
10^2  25
10^3  168
10^4  1229
10^5  9592
10^6  78498
10^7  664579
10^8  5761455
10^9  50847534
"14.0s"

(It is about 4 times slower under pwa/p2js so output is limited to 10^8, unless you like staring at a blank screen for 52s)

Non-recursive partial sieve

with javascript_semantics
requires("1.0.2") -- (for in, tagstart)
function half(integer n) return floor((n-1)/2)+1 end function // convenience convert to idx

function count_primes(atom n)
    // non-recursive Legendre prime counting function for a range `n`...
    // has O(n^(3/4)/((log n)^2)) time complexity; O(n^(1/2)) space complexity.
    if n<3 then return iff(n<2?0:1) end if // can't odd sieve for n less than 3!
    integer sqrtn = trunc(sqrt(n)),     // (actual limit)
            mxndx = floor((sqrtn-1)/2)  // odds-only limit
    --
    -- smalls is the current accumulated counts of odd primes 1 to sqrt(n), initialized 
    -- to odds-only sieving, ie {0,1,2,3,4...} meaning 0 odd primes to 1, 1 o.p to 3,...
    --
    -- roughs is the current odd k-rough numbers up to sqrt of range; k = 2
    -- initialized to all odd positive numbers 1, 3, 5, 7, 9, 11, ... sqrt(n)
    --
    -- larges is an array of current phi counts for the above roughs... except they are
    -- not strictly `phi`'s since they also include primes, to match `smalls` above!
    -- initialized for current roughs after accounting for the even prime of two...
    --
    -- composite is a flag array representing odd numbers 1..sqrtn, for sieving.
    -- initialized false, meaning all positive odd numbers are potentially prime
    -- note that this array starts at (and keeps) 1 to match the algorithm even
    -- though 1 is not actually a prime, as 1 is important in computation of phi...
    --
    sequence smalls = tagset(mxndx,0),
             roughs = tagstart(1,mxndx+1,2),
             larges = sq_floor_div(sq_sub(sq_div(n,roughs),1),2),
          composite = repeat(false,mxndx+1)

    integer bp = 3,             // 'current' base prime
           nbp = 0,             // number of base primes found 
          mxri = mxndx,         // current highest used rough index
             i = 2, sqri = 4    // index and square (index-1) limit
    // partial sieve loop, adjusting larges/smalls, compressing larges/roughs...
    while sqri<=mxndx do // partial sieve to square index limit
        if not composite[i] then
            // cull from composite so they will never be found again
            composite[i] = true // cull bp and multiples
            for c=sqri+1 to mxndx+1 by bp do
                composite[c] = true
            end for
            // partial sieving to current base prime is now completed!

            // now adjust `larges` for latest partial sieve pass...
            integer ori = 0 // compress input rough index(k) to output one
            for k,q in roughs to mxri+1 do
                // q is not necessarily prime but may be a product of primes not yet 
                // culled by partial sieving (saves ops cmprd to recursive Legendre)
                // skip over values of `q` already culled in the last partial sieve:
                integer qi = floor(q/2)+1; // index of always odd q!
                if not composite[qi] then
                    // since `q` cannot be equal to bp due to cull of bp and above skip;
                    atom d = bp*q, // `d` is odd product of some combination of odd primes!
                    // the following computation is essential to the algorithm's speed,
                    // see the Nim entry for the full details of how this works
                      dadj = iff(d<=sqrtn ? larges[smalls[half(d)]-nbp+1]
                                          : smalls[half(floor(n/d))])
                    ori += 1
                    larges[ori] = larges[k]-dadj+nbp // base primes count over subtracted!
                    // eliminate rough values that have been culled in partial sieve:
                    // note that `larges` and `roughs` indices relate to each other!
                    roughs[ori] = q
                end if
            end for

            integer m = mxndx // and adjust `smalls` for latest partial sieve pass...
            // this is faster than recounting over the `composite` array for each loop...
            for k=(sqrtn/bp-1)||1 to bp by -2 do    // k always odd!
                // `c` is correction from current count to desired count...
                // `e` is end limit index no correction is necessary for current cull...
                integer c = smalls[half(k)]-nbp,
                        e = floor((k*bp)/2)
                while m>=e do
--                  smalls[m+1] -= c    -- (grr, js! [I have a plan, working on it])
                    m += 1
                    smalls[m] -= c
--                  m -= 1
                    m -= 2
                end while
            end for
            
            nbp += 1        // increase number of found base primes
            mxri = ori-1    // advance rough index for later
        end if
        bp += 2
        sqri = (i+i)*(i+1)
        i += 1
    end while

    // now `smalls` is a LUT of odd prime accumulated counts for all odd primes;
    // `roughs` is exactly the "k-roughs" up to the sqrt of range with `k` (erm,
    //   mxri?) the index of the next prime above the quad root of the range;
    // `larges` is the partial prime counts for each of the `roughs` values...
    // note that `larges` values include the count of the odd base primes!!!
    // - and `composite` is never used again!
    
    // the following does the top-most "phi tree" calculation:
    // the answer to here is all valid `phis`, combined here by subtraction,
    // + compensate for included odd base prime counts over subracted above:
    atom result = larges[1] - sum(larges[2..mxri+1])
                + trunc((mxri+1 + 2*(nbp-1))*mxri/2)
                + 1 // include the only even prime, ie 2

    // This loop adds the counts due to the products of the `roughs` primes,
    // of which we only use two different ones at a time, as all the
    // combinations with lower primes than the cube root of the range have
    // already been computed and included with the previous major loop...
    // see text description in the Nim entry for how this works...
    for ri,p in roughs from 2 do // for all `roughs` (now prime) bar '1':
        atom m = trunc(n/p), // `m` is the `p` quotient
        // so that the end limit `e` can be calculated based on `n`/(`p`^2)
             e = smalls[half(floor(m/p))+1]-nbp+1
        // the following test is equivalent to non-splitting optmization:
        if e<=ri then exit end if // quit when no more pairs! - aka stop
                                  // at about `p` of cube root of range!
        for k=ri+1 to e do // for all `roughs` greater than `p` to limit:
            result += smalls[half(floor(m/roughs[k]))];
        end for
        // compensate for all the extra base prime counts just added!
        result -= (e-ri)*(nbp+ri-2)
    end for
    return result
end function

atom t = time()
constant expected = {0,4,25,168,1229,9592,78498,664579,5761455,
                     50847534,455052511,4118054813,37607912018,
                     346065536839,3204941750802}
for i=0 to iff(platform()=JS?11:14) do -- (sp: keep js under 2s)
    atom c = count_primes(power(10,i))
    assert(c==expected[i+1])
    string e = elapsed(time()-t,0.1," (%s)")
    printf(1,"10^%d = %d%s\n",{i,c,e})
end for
printf(1,"\nTook %s\n",elapsed(time()-t))
Output:
10^0 = 0
10^1 = 4
10^2 = 25
10^3 = 168
10^4 = 1229
10^5 = 9592
10^6 = 78498
10^7 = 664579
10^8 = 5761455
10^9 = 50847534
10^10 = 455052511 (0.3s)
10^11 = 4118054813 (1.7s)
10^12 = 37607912018 (7.9s)
10^13 = 346065536839 (40.8s)
10^14 = 3204941750802 (3 minutes and 44s)

Took 3 minutes and 44s

About 15 times slower than Julia on the same box (gulp, earmarked for further investigation in 2.0.0).
[At the last minute, I nicked the idea of using a simple bool array instead of those bit-fields from Julia.]
A copy of this code [up to 1e9] has also found it's way into tests\t68forin.exw

Picat

Performance is surprisingly good, considering that the Picat's library SoE is very basic and that I recompute the base primes each time pi() is called. This version takes 23 seconds on my laptop, outperforming the CoffeeScript version which takes 30 seconds.

table(+, +, nt)
phi(X, 0, _) = X.
phi(X, A, Primes) = phi(X, A - 1, Primes) - phi(X // Primes[A], A - 1, Primes).

pi(N) = Count, N < 2 => Count = 0.
pi(N) = Count =>
    M = floor(sqrt(N)),
    A = pi(M),
    Count = phi(N, A, math.primes(M)) + A - 1.

main =>
    N = 1,
    foreach (K in 0..9)
        writef("10^%w    %w%n", K, pi(N)),
        N := N * 10.
Output:
10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

PicoLisp

Uses code from the Sieve of Eratosthenes to generate primes.

(setq Legendre-Max (** 10 9))

(load "plcommon/eratosthenes.l")  # see task "Sieve of Eratosthenes, 2x3x5x7 wheel version.

# Create an index tree of the first N primes up to √Legendre-Max
(setq Sieve (sieve (sqrt Legendre-Max)))
(balance 'Primes (mapcar 'cons (range 1 (length Sieve)) Sieve))
(de prime (N)  (cdr (lup Primes N)))

(de ϕ (X A)
   (cache '(NIL) (cons X A)
      (if (=0 A)
         X
         (- (ϕ X (dec A)) (ϕ (/ X (prime A)) (dec A))))))

(de π (N)
   (if (< N 2)
      0
      (let A (π (sqrt N))
         (+ (ϕ N A) A -1))))

(for N 10 
   (prinl "10\^" (dec N) "^I" (π (** 10 (dec N)))))

(bye)
Output:
10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

Python

Using primesieve module (pip install primesieve).

from primesieve import primes
from math import isqrt
from functools import cache

p = primes(isqrt(1_000_000_000))

@cache
def phi(x, a):
    res = 0
    while True:
        if not a or not x:
            return x + res
    
        a -= 1
        res -= phi(x//p[a], a) # partial tail recursion

def legpi(n):
    if n < 2: return 0

    a = legpi(isqrt(n))
    return phi(n, a) + a - 1

for e in range(10):
    print(f'10^{e}', legpi(10**e))
Output:
10^0 0
10^1 4
10^2 25
10^3 168
10^4 1229
10^5 9592
10^6 78498
10^7 664579
10^8 5761455
10^9 50847534

Raku

Seems like an awful lot of pointless work. Using prime sieve anyway, why not just use it?

use Math::Primesieve;

my $sieve = Math::Primesieve.new;

say "10^$_\t" ~ $sieve.count: exp($_,10) for ^10;

say (now - INIT now) ~ ' elapsed seconds';
Output:
10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534
0.071464489 elapsed seconds

Ruby

Translation of: Wren
require 'prime'

def pi(n)
  @pr = Prime.each(Integer.sqrt(n)).to_a
  a = @pr.size
  case n
    when 0,1 then 0
    when 2 then 1
    else phi(n,a) + a - 1
  end
end

def phi(x,a)
  case a
    when 0 then x
    when 1 then x-(x>>1)
    else  
    pa = @pr[a-1]
    return 1 if x <= pa
    phi(x, a-1)- phi(x/pa, a-1)
  end
end

(0..9).each {|n| puts "10E#{n} #{pi(10**n)}" }
Output:
10E0 0
10E1 4
10E2 25
10E3 168
10E4 1229
10E5 9592
10E6 78498
10E7 664579
10E8 5761455
10E9 50847534

Sidef

func legendre_phi(x, a) is cached {
    return x if (a <= 0)
    __FUNC__(x, a-1) - __FUNC__(idiv(x, a.prime), a-1)
}

func legendre_prime_count(n) is cached {
    return 0 if (n < 2)
    var a = __FUNC__(n.isqrt)
    legendre_phi(n, a) + a - 1
}

print("e             n   Legendre    builtin\n",
      "-             -   --------    -------\n")

for n in (1 .. 9) {
    printf("%d  %12d %10d %10d\n", n, 10**n,
        legendre_prime_count(10**n), prime_count(10**n))
    assert_eq(legendre_prime_count(10**n), prime_count(10**n))
}
Output:
e             n   Legendre    builtin
-             -   --------    -------
1            10          4          4
2           100         25         25
3          1000        168        168
4         10000       1229       1229
5        100000       9592       9592
6       1000000      78498      78498
7      10000000     664579     664579

Alternative implementation of the Legendre phi function, by counting k-rough numbers <= n.

func rough_count (n,k) {

    # Count of k-rough numbers <= n.

    func (n,p) is cached {

        if (p > n.isqrt) {
            return 1
        }

        if (p == 2) {
            return (n >> 1)
        }

        if (p == 3) {
            var t = idiv(n,3)
            return (t - (t >> 1))
        }

        var u = 0
        var t = idiv(n,p)

        for (var q = 2; q < p; q.next_prime!) {

            var v = __FUNC__(t - (t % q), q)

            if (v == 1) {
                u += prime_count(q, p-1)
                break
            }

            u += v
        }

        t - u
    }(n*k, k)
}

func legendre_phi(n, a) {
     rough_count(n, prime(a+1))
}

func legendre_prime_count(n) is cached {
    return 0 if (n < 2)
    var a = __FUNC__(n.isqrt)
    legendre_phi(n, a) + a - 1
}

print("e             n   Legendre    builtin\n",
      "-             -   --------    -------\n")

for n in (1 .. 9) {
    printf("%d  %12d %10d %10d\n", n, 10**n,
        legendre_prime_count(10**n), prime_count(10**n))
    assert_eq(legendre_prime_count(10**n), prime_count(10**n))
}

Swift

import Foundation

extension Numeric where Self: Strideable {
  @inlinable
  public func power(_ n: Self) -> Self {
    return stride(from: 0, to: n, by: 1).lazy.map({_ in self }).reduce(1, *)
  }
}

func eratosthenes(limit: Int) -> [Int] {
  guard limit >= 3 else {
    return limit < 2 ? [] : [2]
  }

  let ndxLimit = (limit - 3) / 2 + 1
  let bufSize = ((limit - 3) / 2) / 32 + 1
  let sqrtNdxLimit = (Int(Double(limit).squareRoot()) - 3) / 2 + 1
  var cmpsts = Array(repeating: 0, count: bufSize)

  for ndx in 0..<sqrtNdxLimit where (cmpsts[ndx >> 5] & (1 << (ndx & 31))) == 0 {
    let p = ndx + ndx + 3
    var cullPos = (p * p - 3) / 2

    while cullPos < ndxLimit {
      cmpsts[cullPos >> 5] |= 1 << (cullPos & 31)

      cullPos += p
    }
  }

  return (-1..<ndxLimit).compactMap({i -> Int? in
    if i < 0 {
      return 2
    } else {
      if cmpsts[i >> 5] & (1 << (i & 31)) == 0 {
        return .some(i + i + 3)
      } else {
        return nil
      }
    }
  })
}

let primes = eratosthenes(limit: 1_000_000_000)

func φ(_ x: Int, _ a: Int) -> Int {
  struct Cache {
    static var cache = [String: Int]()
  }

  guard a != 0 else {
    return x
  }

  guard Cache.cache["\(x),\(a)"] == nil else {
    return Cache.cache["\(x),\(a)"]!
  }

  Cache.cache["\(x),\(a)"] = φ(x, a - 1) - φ(x / primes[a - 1], a - 1)

  return Cache.cache["\(x),\(a)"]!
}

func π(n: Int) -> Int {
  guard n > 2 else {
    return 0
  }

  let a = π(n: Int(Double(n).squareRoot()))

  return φ(n, a) + a - 1
}

for i in 0..<10 {
  let n = 10.power(i)

  print("π(10^\(i)) = \(π(n: n))")
}
Output:
π(10^0) = 0
π(10^1) = 4
π(10^2) = 25
π(10^3) = 168
π(10^4) = 1229
π(10^5) = 9592
π(10^6) = 78498
π(10^7) = 664579
π(10^8) = 5761455
π(10^9) = 50847534

Vlang

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Currently, anonymous functions and closure recursion doesn't work very well (which is why the V language is still in beta) so those slower algorithms don't work properly. However, the "partial sieving" Nim version has been [translated from the non-memoizing Nim version](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how it works:

The Non-Memoizing Non-Recursive "Partial-Sieving" Algorithm

// compile with:  v -cflags -march=native -cflags -O3 -prod MagicCount.v

import time
import math { sqrt }

const masks = [ u8(1), 2, 4, 8, 16, 32, 64, 128 ] // faster than bit twiddling!

[inline]
fn half(n u64) i64 { return i64((n - 1) >>> 1) } // convenience function!
[inline] // floating point divide faster than integer divide!
fn divide(nm u64, d u64) u64 { return u64(f64(nm) / f64(d)) }

[direct_array_access]
fn count_primes_to(n u64) i64 {
  if n < u64(9) { return if n < i64(2) { i64(0) } else { i64((n + 1) / 2) } }  
  rtlmt := u64(sqrt(f64(n)))
  mxndx := i64((rtlmt - 1) / 2)
  arrlen := int(mxndx + 1)
  mut smalls := []u32{ len: arrlen, cap: arrlen, init: u32(it) }
  mut roughs := []u32{ len: arrlen, cap: arrlen, init: u32(it + it + 1) }
  mut larges := []i64{ len: arrlen, cap: arrlen,
                       init: i64(((n / u64(it + it + 1) - 1) / 2)) }
  cullbuflen := int((mxndx + 8) / 8)
  mut cullbuf := []u8{len: cullbuflen, cap: cullbuflen, init: u8(0)}

  mut nbps := i64(0) // do partial sieving for each odd base prime to quad root...
  mut rilmt := arrlen // number of roughs start with full size!
  for i := i64(1); ; i++ { // start with 3; breaks when partial sieving done...
    sqri := (i + i) * (i + 1)
    if sqri > mxndx { break } // breaks here!
    if (cullbuf[i >>> 3] & masks[i & 7]) != u8(0) { continue } // not prime!
    cullbuf[i >>> 3] |= masks[i & 7] // cull bp rep!
    bp := u64(i + i + 1) // partial sieve by bp...
    for c := sqri; c < arrlen; c += i64(bp) { cullbuf[c >>> 3] |= masks[c & 7] }

    mut nri := 0 // transfer from `ori` to `nri` indexes:
    for ori in 0 .. rilmt { // update `roughs` and `larges` for partial sieve...
      r := roughs[ori]
      rci := i64(r >>> 1) // skip recently culled in last partial sieve...
      if (cullbuf[rci >>> 3] & masks[rci & 7]) != u8(0) { continue }
      d := u64(r) * u64(bp)
      larges[nri] = larges[ori] -
                    (if d <= rtlmt { larges[i64(smalls[d >>> 1]) - nbps] }
                     else { i64(smalls[half(divide(n, d))]) })
                     + nbps // adjust for over subtraction of base primes!
      roughs[nri] = r // compress for culled `larges` and `roughs`!
      nri++ // advance output `roughs` index!
    }

    mut si := mxndx // update `smalls` for partial sieve...
    for pm := ((rtlmt / bp) - 1) | 1; pm >= bp; pm -= 2 {
      c := smalls[pm >>> 1]
      e := (pm * bp) >>> 1
      for ; si >= e; si-- { smalls[si] -= (c - u32(nbps)) }
    }

    rilmt = nri // new rough size limit!
    nbps++ // for one partial sieve base prime pass!
  }

  // combine results so far; adjust for over subtraction of base prime count...
  mut ans := larges[0] + i64(((rilmt + 2 * (nbps - 1)) * (rilmt - 1) / 2))
  for ri in 1 .. rilmt { ans -= larges[ri] } // combine results so far!

  // add quotient for product of pairs of primes, quad root to cube root...
  for ri := 1; ; ri++ { // break controlled below
    p := u64(roughs[ri])
    m := n / p
    ei := int(smalls[half(u64(m / p))]) - nbps
    if ei <= ri { break } // break out when only multiple equal to `p`
    ans -= i64((ei - ri) * (nbps + ri - 1)) // adjust for over addition below!
    for sri in ri + 1 .. ei + 1 { // add for products of 1st and 2nd primes...
      ans += i64(smalls[half(divide(m, u64(roughs[sri])))]) }
  }

  return ans + 1 // add one for only even prime of two!
}

start_time := time.now()
mut pow := u64(1)
for i in 0 .. 10 {
  println("π(10**$i) = ${count_primes_to(pow)}")
  pow *= 10 }
duration := (time.now() - start_time).milliseconds()
println("This took $duration milliseconds.")
Output:
π(10**0) = 0
π(10**1) = 4
π(10**2) = 25
π(10**3) = 168
π(10**4) = 1229
π(10**5) = 9592
π(10**6) = 78498
π(10**7) = 664579
π(10**8) = 5761455
π(10**9) = 50847534
This took 2 milliseconds.

The above code is about as fast as the same algorithm in the fastest of languages such as C/C++/Nim due to the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)); it can compute the number of primes to 1e11 in about 22 milliseconds on the same machine as shown where it is run on an Intel i5-6500 (3.6 GHz single-thread boosted). The above code can compute the number of primes to 1e16 in about a minute, although there is a precision/accuracy problem for counting ranges above about 9e15 due to the speed-up optimization of using floating point division operations (precise only to 53-bits, which is about this limit).

Wren

Library: Wren-math

Memoized

This runs in about 4.5 seconds which is not too bad for the Wren interpreter. As map keys cannot be lists, the Cantor pairing function has been used to represent [x, a] which is considerably faster than using a string based approach for memoization.

To sieve a billion numbers and then count the primes up to 10^k would take about 53 seconds in Wren so, as expected, the Legendre method represents a considerable speed up.

import "./math" for Int

var pi = Fn.new { |n|
    if (n < 3) return (n < 2) ? 0 : 1
    var primes = Int.primeSieve(n.sqrt.floor)
    var a = primes.count
    var memoPhi = {}

    var phi // recursive closure