Performance of find methods, comparison with linear search

Wed Nov 21 01:47:01 UTC 2007

On Nov 20, 2007 9:59 PM, Paul Harris <paulharris at computer.org> wrote:
> attached
>
>
> On Nov 20, 2007 9:38 PM, Renaud Detry <renaudjdetry at airpost.net> wrote:
> > > Then I moved
> > >  triplet nearest = random_points[0];
> > > to above the for loop, and added a
> > >   std::cout << "PCH: " << nearest << std::endl;
> > > below the for loop... I got this result
> > >

Hi,

Trying Renaud's original version I also get:
 - snip -
 <Stopwatch> kdtree [lap_iterative_linear_find_nearest]: 0s (0 clock ticks)
 lap_iterative_linear_find_nearest = [ lap_iterative_linear_find_nearest 0 ];
 <Stopwatch> kdtree [lap_recursive_linear]: 112.57s (112570000 clock ticks)
 lap_recursive_linear = [ lap_recursive_linear 1.1257e+08 ];

That's an amazing result, and when I try Paul's version I get (full log):
N: 100000
N_PROBES: 10000
Range: 2.15443
N = [ N 100000 ];
N_PROBES = [ N_PROBES 10000 ];
<Stopwatch> kdtree [lap_building_vectors]: 0.02s (20000 clock ticks)
lap_building_vectors = [ lap_building_vectors 20000 ];
<Stopwatch> kdtree [lap_building_tree]: 0.21s (210000 clock ticks)
lap_building_tree = [ lap_building_tree 210000 ];
<Stopwatch> kdtree [lap_find_within_range]: 0.35s (350000 clock ticks)
lap_find_within_range = [ lap_find_within_range 350000 ];
Mean count 7
<Stopwatch> kdtree [lap_count_within_range]: 0.35s (350000 clock ticks)
lap_count_within_range = [ lap_count_within_range 350000 ];
<Stopwatch> kdtree [lap_find_nearest]: 14.53s (14530000 clock ticks)
lap_find_nearest = [ lap_find_nearest 1.453e+07 ];
# comparisons: 32907492
<Stopwatch> kdtree [lap_optimizing]: 1.06s (1060000 clock ticks)
lap_optimizing = [ lap_optimizing 1.06e+06 ];
<Stopwatch> kdtree [lap_find_within_range_opt]: 0.23s (230000 clock ticks)
lap_find_within_range_opt = [ lap_find_within_range_opt 230000 ];
Mean count 7
<Stopwatch> kdtree [lap_count_within_range_opt]: 0.23s (230000 clock ticks)
lap_count_within_range_opt = [ lap_count_within_range_opt 230000 ];
<Stopwatch> kdtree [lap_find_nearest_opt]: 13.81s (13810000 clock ticks)
lap_find_nearest_opt = [ lap_find_nearest_opt 1.381e+07 ];
# comparisons: 31189607
<Stopwatch> kdtree [lap_iterative_linear_find_within_range]: 20.09s
(20090000 clock ticks)
lap_iterative_linear_find_within_range = [
lap_iterative_linear_find_within_range 2.009e+07 ];
PCH: 50880000
<Stopwatch> kdtree [lap_iterative_linear_find_nearest]: 19.94s
(19940000 clock ticks)
lap_iterative_linear_find_nearest = [
lap_iterative_linear_find_nearest 1.994e+07 ];
PCH: 70820000
PCH: (26.0063,39.9494,46.6781)
# comparisons: 1000000000
<Stopwatch> kdtree [lap_recursive_linear]: 109.1s (109100000 clock ticks)
lap_recursive_linear = [ lap_recursive_linear 1.091e+08 ];

Wow, it went from 0.0s to 19.94s for the
lap_iterative_linear_find_nearest. This difference in the loop is:

@@ -208,13 +212,17 @@
   }
   sw.lap("iterative_linear_find_within_range");

-  triplet nearest = random_points[0];
+
+  std::cout << "PCH: " << clock() << std::endl;
+ triplet nearest = random_points[0];
+ int num_com_linear = 0;
   for (int i = 0; i < N_PROBES; ++i)
   {
     for (int j = 0; j < N; ++j)
     {
       double d = distance(random_probes[i], random_points[j]);
+      ++num_com_linear;
       if (d < mindist)
       {
         mindist = d;

I was surprised and I tried again without the counter...

@@ -208,13 +212,17 @@
   }
   sw.lap("iterative_linear_find_within_range");

+ // std::cout << "PCH: " << clock() << std::endl;
   triplet nearest = random_points[0];
   int num_com_linear = 0;
   for (int i = 0; i < N_PROBES; ++i)
   {
     for (int j = 0; j < N; ++j)
     {
       double d = distance(random_probes[i], random_points[j]);
+      //++num_com_linear;
       if (d < mindist)
       {
         mindist = d;

And I still get the same results (around 19s). So I don't know how I
can get these 0.0s at first. I'm puzzled.



Contrary to Paul, I do not believe that kdtree find_nearest is
"normally" slow because of memory lookup. We do not allocate many
things in this example. "triplet" only weight 12 octect of memory...
10000 of them does not represent much memory space.

To prove my point, I designed another performance test (in
attachement). This one purely test the performance of the tree. I ran
it with "valgrind --tool=cachegrind". Below is the output without
valgrind:

[sylvain at sapiens examples]$ ./perf_kdtree 100000 1000
For 100000 triplets and 1000 tries:
find_nearest()      takes 1.84
find_within_range() takes 22.48
optimize()          takes 1.55
find_nearest()      takes 2.88
find_within_range() takes 23.47

Where it showes again that after optimisation, it takes longer to get
an average find_nearest() and find_within_range() response. This is
clearly a fault in our algorithm! It should not be the case! Now the
output of valgrind:

[sylvain at sapiens examples]$ valgrind --tool=cachegrind ./perf_kdtree 100000 1000
==7511== Cachegrind, an I1/D1/L2 cache profiler.
==7511== Copyright (C) 2002-2007, and GNU GPL'd, by Nicholas Nethercote et al.
==7511== Using LibVEX rev 1732, a library for dynamic binary translation.
==7511== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==7511== Using valgrind-3.2.3, a dynamic binary instrumentation framework.
==7511== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==7511== For more details, rerun with: -v
==7511==
For 100000 triplets and 1000 tries:
find_nearest()      takes 66.14
find_within_range() takes 1439.35
optimize()          takes 117.32
find_nearest()      takes 183.93
find_within_range() takes 1411.65
==7511==
==7511== I   refs:      83,179,497,189
==7511== I1  misses:             1,893
==7511== L2i misses:             1,885
==7511== I1  miss rate:           0.00%
==7511== L2i miss rate:           0.00%
==7511==
==7511== D   refs:      57,363,357,908  (33,633,704,194 rd + 23,729,653,714 wr)
==7511== D1  misses:       109,279,903  (   107,674,718 rd +      1,605,185 wr)
==7511== L2d misses:        83,157,268  (    82,998,770 rd +        158,498 wr)
==7511== D1  miss rate:            0.1% (           0.3%   +            0.0%  )
==7511== L2d miss rate:            0.1% (           0.2%   +            0.0%  )
==7511==
==7511== L2 refs:          109,281,796  (   107,676,611 rd +      1,605,185 wr)
==7511== L2 misses:         83,159,153  (    83,000,655 rd +        158,498 wr)
==7511== L2 miss rate:             0.0% (           0.0%   +            0.0%  )


If you've never seen a valgrind output before, what it means is that
there is simply NO significant hopping in memory. In other words, when
you build your tree in a block, like in the precedent test, hopping in
memory will not happen. Therefore, the long execution time we get for
find_nearest() and find_within_range() may be due to an error in the
algorithm.


That's when I started to look deeper at find_nearest() and I realized
the following:
1. The algorithm is recursive. Recursive is slower than iterative, so
iterative would improve
    performance.
2. The algorithm seems to behave this way:
       at each node, compare the hyperrectangle formed by the center
point and the node value
       with the current best hyperrectange. Then if the current best
hyperrectangle intersect with
       the left node and the right node value, continue in each node.
    I think it's better to get the approximate best hyperrectangle
during iterative deepening
    search like when you want to insert the point (if it was an
insertion). Now walk up the tree
    from that point on and go into the node that intersect the HYPERSPHERE only.
    This is where the current algorithm looses efficiency IMHO.

Please comment on this.

During the weekend I will try to reimplement find_nearest taking these
two points into consideration. I'll send you the result over the week
end.


Sylvain