Talk:K-d tree: Difference between revisions

Content added Content deleted

Inline

Revision as of 20:05, 5 June 2014

New task

A couple people have encouraged me at times to contribute something from my work. While I don't actually maintain any k-d tree code, I do know k-d trees are used in various ways in astronomy, and it seems they have become well accepted data structures. I found the WP nearest neighbor description a bit to cursory to code from directly, but the Moore psedocode relatively easy to implement. While Moore acknowledges some inefficiencies in his presented code, I thought the simplicity of it made it a good starting point for someone coding a k-d tree for the first time in a new language. I first tried a data set of 1e6 points but found the tree construction took a couple of seconds. That sure showed the motivation for the n log n algorithms! Rather than lead the task in that direction though, I though I'd initially show the simpler, if slower algorithm and just scale back the data set. The more interesting part, after all, is the nearest neighbor search, which is log n and returns the answer in a flash. —Sonia 19:38, 6 March 2012 (UTC)

I was reading through the linked WP page and came across this paragraph:

k-d trees are not suitable for efficiently finding the nearest neighbour in high dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, $N$ , should be $N\gg 2^{k}$ . Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search, and approximate nearest-neighbour methods are used instead.

It would be nice to have some sort of note along those lines here too as it is a major criterion for algorithm/data-structure selection. (Astronomy's mostly low-dimensioned, 2D or occasionally 3D, so k-d trees make plenty of sense for them. Alas, the work I've done in an astronomy-allied field recently was all very high dimensioned with some dimensions not being standard-numeric, so we couldn't make good use of this sort of thing and anyway didn't need it as “nearest neighbour” wasn't a problem we had to solve. Instead, we use lots of relational databases. But I'm rambling…) –Donal Fellows 06:10, 7 March 2012 (UTC)

Certainly true. Once I put in the count of nodes visited I found that many searches on the WP data set (N = 6 and k = 2) lead to all nodes being visited. My choice of the point (9, 2) was contrived to give an answer with only 3 nodes visited. It's tough to pick how much information to present in the task description and how much to expect people to do their own homework, but sure, I added the paragraph. —Sonia 19:02, 7 March 2012 (UTC)

The REXX code: wrong?

I'm tempted to slap a second incorrect tag on it, for it seems fundamentally wrong: as far as I can see, it constructs no tree, just bipartitions points once about the average of something. As a consequence, searches visit about half of the points, which isn't better than a straight exhaustive search by much. Someone better check the code, the pink tag is forthcoming. --Ledrug 07:15, 26 April 2012 (UTC)

C Entry

Three suggestions for the C entry:

1) To make it easy to change the coordinate type (float, double, etc) and avoid this to cause bugs in the swap function, I suggest to replace:

<lang c>#define MAX_DIM 3</lang>

With: <lang c>typedef double point_t[3];</lang>

And then use it in kd_node_t and swap.

2) To generally use NULL instead of 0 to denote the null pointer, to increase readability. Because 0 is a literal for both the null and integral zeros, while NULL denotes only a null pointer. Using more specific literals is generally good.

3) Regarding this: <lang c>typedef struct kd_node_t *kd_node, kd_node_t; struct kd_node_t {...}</lang>

See the "Chapter 5: Typedefs" here: http://www.kernel.org/doc/Documentation/CodingStyle

I think a better solution is an intermediate way between your code and that coding standard. So using typedef to mask the "struct" is acceptable, but to use it to mask a pointer is not so good. This means I suggest to use something like this in the C entry (untested):

<lang c>typedef double point_t[3]; typedef struct kd_node_t kd_node_t;

struct kd_node_t {

   point_t x;
   kd_node_t *left;
   kd_node_t *right;

};</lang>

1) In terms of decoupling struct defs and usage, the easiest thing is probably

<lang c>void swap(struct kd_node_t *x, struct kd_node_t *y)

{ struct kd_node_t tmp; tmp = *x; *x = *y; *y = tmp; }</lang> which hides the details best, but does unnecessary copying. Another way is

<lang c>struct point_t { double x[3]; }

struct kd_node_t { point_t pt; struct kd_node_t *left, *right; }

void swap(struct kd_node_t *x, struct kd_node_t *y) { point_t tmp; tmp = x->pt; x->pt = y->pt; y->pt = tmp; }</lang> which is arguably better, because it now allows the compiler to figure out the best method to do the copying. The thing is, hiding details from swap() is not right. Being a performance critical part, it should know the inner working of the struct involved. You could also argue that one shouldn't do node->x[1], instead some accessor method get_node_coord(node, 1) should be used. If we were designing a library and plan for it to be used by total strangers who need an interface decoupled from data definition, maybe; for a short example on RC and a samll routine used internally, no.

2) I'm ambivalent towards use of NULL token (it's probably #define NULL (0) anyway), which is better than what I have to say about TRUE or FALSE. If one is reading C code, he better know what a nul pointer is anyhow. I prefer writing 0, but won't mind if someone changes it to NULL.

3) The typedef is indeed unneeded. My habit is typedef struct {} sometype_t and typedef struct {} *sometype, where _t says it's a struct, and lacking of it means a pointer. This may make pointers to pointers easier to write (sometype *p instead of sometype_t **p), but is certainly not necessary. I'll drop the typedefs here, but I doubt it will make the code more or less readable. --Ledrug 21:34, 26 April 2012 (UTC)

FWIW, I agree strongly with using NULL; it's more idiomatic and says that “we're thinking about the pointer that does not point”. It's really a code-smell thing; if someone's mixing things up, you've got to examine every use of 0 to figure out what's going on. (This isn't the same as when you're working in your own code, but we want the very best of style here as each language understands that concept.) –Donal Fellows 08:25, 5 June 2012 (UTC)

Dubious about speed claims here

After studying the task and these implementations I'm feeling dubious about speed claims here.

If it takes half a second to construct the tree for a million nodes, and it takes a twentieth of a second to visit a million nodes, what kinds of speedups are we getting that justifies this use pattern?

More specifically: do any of these implementations outperform a exhaustive search on a dense linear dataset by as much as a factor of 2? --Rdm (talk) 20:05, 5 June 2014 (UTC)

@@ Line 70: / Line 70: @@
 :: FWIW, I agree strongly with using <code>NULL</code>; it's more ''idiomatic'' and says that “we're thinking about the pointer that does not point”. It's really a code-smell thing; if someone's mixing things up, you've got to examine ''every'' use of <code>0</code> to figure out what's going on. (This isn't the same as when you're working in your own code, but we want the very best of style here as each language understands that concept.) –[[User:Dkf|Donal Fellows]] 08:25, 5 June 2012 (UTC)
+== Dubious about speed claims here ==
+After studying the task and these implementations I'm feeling dubious about speed claims here.
+If it takes half a second to construct the tree for a million nodes, and it takes a twentieth of a second to visit a million nodes, what kinds of speedups are we getting that justifies this use pattern?
+More specifically: do any of these implementations outperform a exhaustive search on a dense linear dataset by as much as a factor of 2?  --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 20:05, 5 June 2014 (UTC)