This page should be presumed to be stale for contexts which need up-to-date details.
Maintaining perspective can be difficult.
For every polynomial-time algorithm you have, there is an exponential algorithm that I would rather run. - Alan Perlis
"Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity." - Wikipedia
However, Big O treatments of algorithms typically neglect limiting behaviors of the underlying machine architecture -- communication overhead, cache management, etc.
In the context of cache management and a relatively modern cpu design, the likelihood of a cache miss and the cost of a cache miss depends on the size of the data set.For estimates on a spec sheet for a relatively modern cpu architecture (2014-ish intel), this looks like:
Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 11 1 512 K 24 13 + 24 (L3) 1 M 30 6 2 M 33 3 4 M 35 2 8 M 36 + 6 ns 1 + 6 ns 16 M 36 + 34 ns 28 ns + 57 ns (RAM) 32 M 36 + 48 ns 14 ns 64 M 36 + 54 ns 6 ns 128 M 40 + 56 ns 4 + 2 ns + 8 (TLB miss) 256 M 42 + 57 ns 2 + 1 ns 512 M 43 + 57 ns 1 + ns 1024 M 44 + 57 ns 1 + ns
Here, a "cycle" is a clock cycle, and depending on type of instructions being used, a cpu core may execute 1 instruction per cycle, 2 instructions per cycle, 4 instructions per cycle or even (in carefully limited contexts) 8 instructions per cycle. (A 3.5GHz clock would have a 0.286ns clock cycle.)
In other words, when working with a gigabyte of memory on that machine, a single instruction with a cache miss might cost the time of almost 1100 instructions to (in extreme cases) almost 8800 instructions.
Of course, the above simplifications focused purely on a single cpu architecture and ignored variations in memory architecture.
And there are other issues -- for example, when two different cores are accessing the same memory, that tends to introduce cache management conflicts which slow things down.
And then there's OS issues.
To simplify this further, though: code which accesses memory sequentially tends to have an order of magnitude speed advantage over code which accesses memory randomly. A rule of thumb here, if you're not prepared to spend a lot of time benchmarking, is that an algorithmic performance advantage which is not at least a factor of 2 is probably a waste of time. (And, in some contexts -- especially where the underlying system was designed to solve a different problem -- factor of 1000 or greater performance improvements might be practical, for reasons independent of cache management.)
Meanwhile, the stuff we do here on rosettacode is pretty much just "dipping our toes in the water". Because of our focus on problems which can be solved by a variety of programmers in a variety of languages, we tend to favor problems which shy away from these performance extremes.
Anyways, beware that all of these details are highly contextual, and an awful approach in one context might or might not be a brilliant approach in an adequately different context.
Once upon a time, I worked with directly with assembly and even machine language (mostly 6502, some pdp-11). Before that, I focused more on the underlying electronics. That feels like a very long time ago. Nowadays I guess I have to occasionally work with the stuff, but I doubt I will ever fully comprehend the details of all of the intel architecture(s).
Currently, on Rosetta Code, most (but not all) of my contributions focus on J.
The block indicating my various language uses is incomplete and perhaps misleading (expertise is context specific - and knowing how to use a language for one purpose does not necessarily include significant expertise for a different kind of use of the language).
|My Favorite Languages|
|Emacs Lisp||Rusty (ex-expert)|
|Icon||Very Rusty (ex-expert)|