This section contains some design notes on the implementation of
the PointKdTree
.
Every point is contained in the bounding box of the kd-tree.
A point is listed in a leaf node if and only if it is contained in the bounding box of the leaf node.
The data structure is traversed using cursors. A cursor is a pointer-like object that points to a node of the kd-tree. As always, this allows for the separation of the data structure and the algorithms that use it.
Advantages:
Disadvantages:
Advantages:
Disadvantages:
Choose Proposition 2.
class PointKdTreePoint
{
public:
virtual ~PointKdTreePoint()
{
}
virtual real position(integer axis) const = 0;
};
Disadvantages:
Performance losses because of dynamic binding
Data has to be wrapped to points which inherit from PointKdTreePoint.
Store arbitrary user points and require to supply a functor that gives projection to an axis given a point.
template <typename Real, integer N, typename Point,
typename ProjectionFunctor>
class PointKdtree
{
//...
};
Advantages:
Efficient for concrete types
Supports sharing of data (for example, store indices to a vertex buffer)
You can still use polymorphic points (for example Point = PointKdTreePoint*)
Disadvantages:
template <typename Real, integer N, typename Locator>
class PointKdTree
{
};
class Locator
{
public:
using Point = int;
real position(const Point& point, integer axis) const;
};
Advantages:
Efficient for concrete types
Supports sharing of data (for example, store indices to a vertex buffer)
You can still use polymorphic points (for example Point = PointKdTreePoint*)
Choose Proposition 3.
The problem with this approach is that the point might not be contained in the node that the cursor is pointing to. This would in turn break our invariant. Also, containment can’t be checked without traversing the tree from top to bottom because nodes do not store node bounds for all dimensions.
The tree is then traversed from top to bottom searching for the correct leaf node to add the point to. Sounds good, however, if there are n points, then the traversals for each point take extra time which could probably be amortized by adding a set of points at the same time.
At each node, the set of points is partitioned into two sets, which are sent down the node hierarchy. This way the traversal cost is amortized among the set of added points.
Choose Proposition 3.
Advantages:
Disadvantages:
Each node requires dynamical allocation of a small memory region. This might waste more memory than needs to be.
It is not easy to access all points of the tree as a sequence.
Leaf nodes would refer to points in this array by integer intervals.
Advantages:
One big memory allocation instead of many small ones.
Random-access to (permuted) point list is possible.
It is possible to hide/unhide points as per semidynamic point sets.
Fast because of good locality of reference.
Disadvantages:
Incremental adding or removing of points is not possible because the integer intervals in the nodes would be invalidated.
When hiding a point, the point must be kept in the tree’s point list. I.e. while the point is hidden in the leaf node’s point list, it is not hidden in the tree’s point list.
Leaf nodes would refer to points in this list by inclusive iterator intervals. Why inclusive intervals? Because the point that is pointed to by an end-iterator of a normal range can get removed.
Advantages:
Sequential access to (permuted) point list possible.
It is possible to hide/unhide points as per semidynamic point sets.
When hiding, the points can actually be removed from the tree’s point list.
It is possible to incrementally add or remove points and do this fast (if one also has leaf node pointers from points).
Disadvantages:
No random access to point list.
Dynamic allocation for each linked list node, and traversal of the point list by following links might not be so efficient.
Random access to the point list is important when parallelizing work that is to be done for each point in the point list (or more properly, for each point iterator in the tree). If one chooses the linked list approach, one can get around this problem by copying the iterators of the point list to a random access array and then using that to carry out the task. One does not have this problem with an array since it already is random-access. However, this advantage of arrays is diminished when considering semidynamic point sets: if most of the points have been hided, then one still has to visit those hided points in the point list. Particularly because of this reason, Proposition 3 is chosen.
Consider that there are n points and that they are all at the same position. If you divide the points to child nodes based purely on location (e.g. always store a point on the splitting plane to the negative child), then all the points necessarily end up at the same leaf node. Furthermore, if n exceeds the splitting threshold for a node, the leaf node is the result of splitting until the depth threshold is reached.
This example underlines the importance of distributing the points on the splitting plane evenly to both sides of the plane. When this is done, the number of points at the same position diminishes exponentially. Thus, for our example the height of the tree will be O(log n).
In combination with the sliding midpoint rule, which guarantees that at least 1 point is separated from the others at each split, distributing equal points evenly guarantees that the number of nodes is O(n) in the worst case.
The removal of points requires an efficient way to find out for each point (its iterator) the leaf node that it is stored in.
This is the only choice if one decides to store the points in an array.
This approach does not work, because one does not know which node to follow in the case of a point exactly on the splitting plane… Unless one bases the positioning purely on location which however leads to other robustness problems, as discussed above.
Advantages:
Disadvantages:
Choose Proposition 3. The increased memory requirement is of no concern. When a lot of points need to be removed, it would be useful to have a function that removes all points, but leaves the subdivision structure intact.
Because you can actually be holding only references to the geometry data, it is possible for the user to change geometry without the kd-tree knowing about it. The end-result is that one can invalidate both invariants of the kdtree by modifying point locations. We have no way to prevent this. This problem is similar to what happens with the std::set (for example) if one uses pointers as keys and compares them by their values. We can only educate the user that whenever he wishes to change geometry, he should do it by removing, changing, and inserting back.
Document that the user should not change geometry, but instead follow the remove-change-insert pattern.
At times the array memory will be reallocated
Therefore indices must be used as child references
But this increases indirection in node traversal
It can be problematic if the reallocation happens in the middle of a recursive function call (subdivision) invalidating node pointers at previous levels. Practically, this means that this technique does not work.
You can avoid allocation overhead by using a PoolAllocator or ArenaAllocator.
Choose Proposition 2 as the only working one.
If there are no points in a child of a split node, you can either create an empty leaf node or give a null reference to the intermediate node. However, empty leaf nodes are helpful when you want to subdivide, because then you can explicitly point to the node you want to subdivide. With null references you have to be indirect from the parent: “subdivide the negative child of this node”. This is very impractical.
A splitting node always refers to two concrete children.
The bounding box of the kd-tree (or its contained box) is often needed. Tracing the bounding box is easily done when adding points to the tree. In contrast, if the bounding box wasn’t traced, then it would have to be recomputed manually and this can mean that it can be done many times redundantly.
If the bounding box is traced, then it would be useful to have a function to enlarge it (but not shrink it).
Trace the bounding box of the tree when adding points. Make it possible to enlarge the bounding box but not shrink it.
This is possible. However:
The shapes have to be abstracted by their bounding aligned boxes. Vector based implementations then suffer abstraction penalty.
The splitting bounds and parent nodes are needed by nearest neighbor searching with points but are not needed in ray tracing. Ray tracing then suffers abstraction penalty.
Nearest neighbor searching does not make sense for general shapes. Or it does but would be hard to implement.
Because of the abstraction penalties in both directions, a better alternative is to separately create a kd-tree that is designed for point sets (PointKdTree) and a kd-tree that is designed for ray tracing (KdTree)
A parent link is a link from a child node to its parent node. In particular, a parent link makes it possible to manage hierarhical information, i.e., information in a node which is dependent on information in its children. This makes it possible to turn the kdtree into a self-adapting tree under removal and addition of points. The addition of parent links allows a considerably expanded design space. Therefore we do not decide the question here, but enumerate a selection of possibilities in a moment.
Description:
Split nodes know their bounds on splitting axis
Leaf nodes know their set of points and their number
Advantages:
Minimal memory use.
Good enough to implement high performance algorithms for static point-sets.
Shortcomings:
Points can’t be removed, except all at once.
No adaptation to smaller point sets (i.e. removing all points without touching the subdivision and inserting back a subset of those points).
Additional features to basic version:
Advantages compared to basic version:
Additional features to removable version:
Advantages compared to removable version:
Additional features to adaptable version:
Comments:
Advantages compared to adaptable version:
Shortcomings:
Additional features to adaptable version:
Comments:
Advantages to adaptable version:
Shortcomings:
Additional features to adaptable version:
Comments:
The points can be ordered in such a way that for each node the set of points can be given by an iterator interval.
The set of points and their number are hierarchical information which are efficiently updated by using the parent links.
When removing points from a leaf node, one does the following to update the set of points and their numbers at nodes. The ‘begin’ iterator is set to the ‘begin’ iterator of the negative child. The ‘last’ iterator is set to the ‘last’ iterator of the positive child. The number of points is set to the sum of the number of points in negative and positive child. This has to be recursed all the way to the root.
When inserting points you update the hierarchical information on the way back from the recursive calls.
Advantages compared to adaptable version:
The data structure is traversed using cursors allowing to separate data structure and algorithms that use it.
Generic programming together with policy based design is used to support different types of point points (and point references as well).
Points are added to the tree as sets, rather than one by one. This is more efficient.
The points in the tree are stored in a linked list which is shared between leaf nodes. The leaf nodes refer to this list by inclusive iterator intervals.
Points can be added and removed freely.
Each point stores its containing leaf node to enable fast removal and fast bottom-up nearest neighbor searching.
Each leaf node stores the corresponding bucket node.
Each node is allocated dynamically.
Each node stores the number of points under its subtree.
Each node stores the set of points under its subtree using an iterator range (which implies that the points have to be ordered in an approriate manner).
The intent is that each node can be considered as a leaf node, but with differing levels of detail.
Each node stores its parent node. This enables hierarchical information.
Each split node refers to exactly two concrete nodes.
Do not use the same data structure for points and general shapes (e.g. ray tracing). Instead, create a specialized data structure for both needs.
The packing of nodes induces some performance degradation.
Efficient approximate nearest neighbors requires to store intermediate node bounds in the split dimension.
Storing the node bounds in the split dimension implies that the user can’t be allowed to subdivide a node through a cursor since the node bounds can not be found out efficiently.
Instead, you implement a member function that refines the kd-tree with a user-defined splitting rule. This way you keep the kd-tree nodes having the correct node bounds.
In nearest neighbor searching and refinement, recursive function calls seem to be faster than iteration based on a stack.
In nearest neighbor searching, performance is enhanced by computing partial distances: whenever the distance computation exceeds the current culling distance, the computation can be aborted.
In nearest neighbor searching, performance is enhanced by computing with norm bijections rather than norms (for example, squared L2-norm).