Talk:K-d tree

New task

A couple people have encouraged me at times to contribute something from my work. While I don't actually maintain any k-d tree code, I do know k-d trees are used in various ways in astronomy, and it seems they have become well accepted data structures. I found the WP nearest neighbor description a bit to cursory to code from directly, but the Moore psedocode relatively easy to implement. While Moore acknowledges some inefficiencies in his presented code, I thought the simplicity of it made it a good starting point for someone coding a k-d tree for the first time in a new language. I first tried a data set of 1e6 points but found the tree construction took a couple of seconds. That sure showed the motivation for the n log n algorithms! Rather than lead the task in that direction though, I though I'd initially show the simpler, if slower algorithm and just scale back the data set. The more interesting part, after all, is the nearest neighbor search, which is log n and returns the answer in a flash. —Sonia 19:38, 6 March 2012 (UTC)

I was reading through the linked WP page and came across this paragraph:

k-d trees are not suitable for efficiently finding the nearest neighbour in high dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, $N$ , should be $N\gg 2^{k}$ . Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search, and approximate nearest-neighbour methods are used instead.

It would be nice to have some sort of note along those lines here too as it is a major criterion for algorithm/data-structure selection. (Astronomy's mostly low-dimensioned, 2D or occasionally 3D, so k-d trees make plenty of sense for them. Alas, the work I've done in an astronomy-allied field recently was all very high dimensioned with some dimensions not being standard-numeric, so we couldn't make good use of this sort of thing and anyway didn't need it as “nearest neighbour” wasn't a problem we had to solve. Instead, we use lots of relational databases. But I'm rambling…) –Donal Fellows 06:10, 7 March 2012 (UTC)

Certainly true. Once I put in the count of nodes visited I found that many searches on the WP data set (N = 6 and k = 2) lead to all nodes being visited. My choice of the point (9, 2) was contrived to give an answer with only 3 nodes visited. It's tough to pick how much information to present in the task description and how much to expect people to do their own homework, but sure, I added the paragraph. —Sonia 19:02, 7 March 2012 (UTC)