Talk:K-d tree: Difference between revisions

Line 92:
 
::: But other algorithms are also possible. For example, a n-dimensional grid (partitioning using evenly sized n-cubes). The grid approach has characteristics in common with both the brute force approach and the kdtree approach. What's best depends on the dataset and how often it gets updated, among other things (but other issues: cache architecture, use patterns, etc... also matter for determining cost, and the relevance of those costs). --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 18:43, 6 June 2014 (UTC)
 
== C code bug producing incorrect results from k-d tree queries ==
 
Hi all. I'm new to contributing to wikis, so apologies if I'm doing this the wrong way. I couldn't find any email address to send feedback to, or anything like that, so I take it one is supposed to report bugs via a wiki contribution like this...? Anyway, I believe the C code for the k-d tree has a bug. This bug arises when there are different points in the dataset that have exactly the same value for a given coordinate – identical <code>x[idx]</code> values, in the terminology used in the code. The bug is in these lines:
 
/* median has duplicate values */
if (store->x[idx] == md->x[idx])
return md;
 
I found this bug empirically; I am using the RosettaCode k-d tree code in my project, and discovered that a dataset with identical coordinate values resulted in an incorrect k-d tree being built, and then in incorrect values being returned by queries against the k-d tree. Fixing the bug fixed that issue. Note that the bug occurs, not only in cases where two points have exactly the same spatial position (which one might argue is against the assumptions of the algorithm, or something), but also in cases where two points have one coordinate value that is identical between them, but differ in their other coordinate values – clearly a bug.
 
The bug is also apparent if you think about how the quickselect algorithm implemented by <code>find_median()</code> works. It would be nice if one could detect duplicate values and short-circuit further iterations of the algorithm, as these lines claim in their comment to do, but that does not in fact work unless one does a three-way partition around the pivot (less than, equal to, greater than). With the standard quickselect algorithm, as implemented by <code>find_median()</code>, which does a two-way partition (less than, greater than or equal to), this attempt to detect duplicate values and short-circuit can do the wrong thing, because although <code>store->x[idx] == md->x[idx]</code>, there is no guarantee that the values in between <code>store</code> and <code>md</code> are also equal; the dataset has not been sorted, just partitioned.
 
If you look at the Wikipedia article on quickselect (https://en.wikipedia.org/wiki/Quickselect), the algorithm there does not attempt to do this duplicate-detection and early termination. Instead, written in the terminology of the RosettaCode example, it does:
 
if (store == md)
return md;
 
which is quite different, and is correct; that says that if the value chosen for the pivot happens to end up in exactly the median position, then you got lucky and you're done since you found the median.
 
If this bug is fixed, other things need to be changed in the code as well, because the <code>store->x[idx] == md->x[idx]</code> test was providing the termination condition for the loop; changing it to the correct test will then cause the code to hang. As in the Wikipedia code, one must also add a test at the top of the loop:
 
if (end == start + 1)
return start;
 
That is at the top of <code>find_median()</code> as it now stands; it needs to instead be at the top of the <code>while (1)</code> loop, because this case can be hit in later iterations and is the base case of the divide-and-conquer algorithm. Finally, where the code does:
 
else start = store;
 
that needs to be:
 
else start = store + 1;
 
again to avoid a hang, again following the Wikipedia pseudocode. (Other differences from the Wikipedia pseudocode are OK, I think, and have to do with the way the C code uses pointers instead of indexes, and keeps <code>end</code> pointing a position after the last element rather than to the last element itself.)
 
I didn't attempt to edit the example myself, partly because I don't know how :->, partly because I didn't want to be presumptuous or offend anyone. But I'm quite sure that the code as it exists is bad; my empirical testing showed that very clearly. It would be good for the code to more rigorously test itself; once the tree is constructed, it is straightforward enough to recursively check that each left/right subtree obeys the spatial partitioning imposed by the coordinate of the parent node. In fact I wrote such test code in my project, along the path to understanding and fixing this bug, and it did flag the tree constructed by the original RosettaCode version of <code>find_median()</code> as having constructed an incorrect tree, and did pass the tree constructed by the algorithm once fixed as described above. Together with a large test dataset that contains many exact duplicate coordinate values, such test code should allow the bug to be confirmed and then the fix validated.
Anonymous user