Talk:K-d tree

From Rosetta Code
Revision as of 16:47, 26 April 2012 by rosettacode>Bearophile (+ notes on C entry)

New task

A couple people have encouraged me at times to contribute something from my work. While I don't actually maintain any k-d tree code, I do know k-d trees are used in various ways in astronomy, and it seems they have become well accepted data structures. I found the WP nearest neighbor description a bit to cursory to code from directly, but the Moore psedocode relatively easy to implement. While Moore acknowledges some inefficiencies in his presented code, I thought the simplicity of it made it a good starting point for someone coding a k-d tree for the first time in a new language. I first tried a data set of 1e6 points but found the tree construction took a couple of seconds. That sure showed the motivation for the n log n algorithms! Rather than lead the task in that direction though, I though I'd initially show the simpler, if slower algorithm and just scale back the data set. The more interesting part, after all, is the nearest neighbor search, which is log n and returns the answer in a flash. —Sonia 19:38, 6 March 2012 (UTC)


I was reading through the linked WP page and came across this paragraph:

k-d trees are not suitable for efficiently finding the nearest neighbour in high dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, , should be . Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search, and approximate nearest-neighbour methods are used instead.

It would be nice to have some sort of note along those lines here too as it is a major criterion for algorithm/data-structure selection. (Astronomy's mostly low-dimensioned, 2D or occasionally 3D, so k-d trees make plenty of sense for them. Alas, the work I've done in an astronomy-allied field recently was all very high dimensioned with some dimensions not being standard-numeric, so we couldn't make good use of this sort of thing and anyway didn't need it as “nearest neighbour” wasn't a problem we had to solve. Instead, we use lots of relational databases. But I'm rambling…) –Donal Fellows 06:10, 7 March 2012 (UTC)

Certainly true. Once I put in the count of nodes visited I found that many searches on the WP data set (N = 6 and k = 2) lead to all nodes being visited. My choice of the point (9, 2) was contrived to give an answer with only 3 nodes visited. It's tough to pick how much information to present in the task description and how much to expect people to do their own homework, but sure, I added the paragraph. —Sonia 19:02, 7 March 2012 (UTC)

The REXX code: wrong?

I'm tempted to slap a second incorrect tag on it, for it seems fundamentally wrong: as far as I can see, it constructs no tree, just bipartitions points once about the average of something. As a consequence, searches visit about half of the points, which isn't better than a straight exhaustive search by much. Someone better check the code, the pink tag is forthcoming. --Ledrug 07:15, 26 April 2012 (UTC)

C Entry

Three suggestions for the C entry:

1) To make it easy to change the coordinate type (float, double, etc) and avoid this to cause bugs in the swap function, I suggest to replace:

<lang c>#define MAX_DIM 3</lang>

With: <lang c>typedef double point_t[3];</lang>

And then use it in kd_node_t and swap.


2) To generally use NULL instead of 0 to denote the null pointer, to increase readability. Because 0 is a literal for both the null and integral zeros, while NULL denotes only a null pointer. Using more specific literals is generally good.


3) Regarding this: <lang c>typedef struct kd_node_t *kd_node, kd_node_t; struct kd_node_t {...}</lang>

See the "Chapter 5: Typedefs" here: http://www.kernel.org/doc/Documentation/CodingStyle

I think a better solution is an intermediate way between your code and that coding standard. So using typedef to mask the "struct" is acceptable, but to use it to mask a pointer is not so good. This means I suggest to use something like this in the C entry (untested):

<lang c>typedef double point_t[3]; typedef struct kd_node_t kd_node_t;

struct kd_node_t {

   point_t x;
   kd_node_t *left;
   kd_node_t *right;

};</lang>