Super-super fast Point Inclusion test for Cylinder/Breps?

Generally, this type of problem belongs to “parallel problems” where multi-thread programing aproach can be applied to it. But code that solves this problem for 1 item (code inside loop) is extremly fast so implementing parallel programing solution will not improve speed for few 100.000s of items. If there is far more items (100s milions of points) then parallel-code could perform faster. Or if code inside the loop is heavy-one and takes longer time to execute (lets say 1 sec per 1 item) than even for low number of items (like few 100s) multi-thread code can execute faster.
Anyway here is code that implemets multi-threading - for loop is replaced with Parallel.For method so you can give it a try:

    public static bool[] IsPointInsideCylinder02(Cylinder cylinder, Point3d[] points, double tolerance)
    {
        bool[] pointStatus = new bool[points.Length];
        //pointStatus[i] = true => points[i] is inside of cylinder
        Vector3d N = cylinder.Axis;
        N.Unitize(); // N should be already unit size vector, but for any case...
        Point3d C1 = cylinder.Center;
        //we find base of cylinder = circle 
        Circle baseCircle = cylinder.CircleAt(cylinder.Height1);
        double radiusWithTolerance = baseCircle.Radius + tolerance;
        double SquareRadiusWithTolerance = radiusWithTolerance * radiusWithTolerance;

        double t_C1 = 0;
        double t_C2 = cylinder.TotalHeight;
        t_C1 = t_C1 - tolerance;
        t_C2 = t_C2 + tolerance;

        Parallel.For(0, points.Length, i=> 
        {                
            double t = N.X * (points[i].X - C1.X) + N.Y * (points[i].Y - C1.Y) + N.Z * (points[i].Z - C1.Z);
            double TX = C1.X + t * N.X;
            double TY = C1.Y + t * N.Y;
            double TZ = C1.Z + t * N.Z;
            var SquareDistancePT = (points[i].X - TX) * (points[i].X - TX) +
                    (points[i].Y - TY) * (points[i].Y - TY) +
                    (points[i].Z - TZ) * (points[i].Z - TZ);
            if (t < t_C1 || t > t_C2 || SquareDistancePT > SquareRadiusWithTolerance)
            {
                pointStatus[i] = false; //point not inside cylinder
            }
            else
            {
                pointStatus[i] = true;
            }                
        } );
        //
        return pointStatus;
    }

@RadovanG, I’m deeply impressed.

This is amazing. In the other thread we were talking about “making waves to get the tanker off the cliffs” and somone suggested that another approach is to “wait for the tide to get the tanker off the reef” but here you mill down the cliffs to grains of sand to set the tanker afloat.

That is the attitude I like! Wow!

// Rolf

RadovanG : awesome, and truly the most direct approach to take.

RIL: well there is obviously also a problem of measuring. By measuring my approach for instance I took everything into account. However, only measuring the algorithm part, by using stopwatch, flushing garbage collection and running it multiple times taking the best count brought me to 2 ms at 50000 pts. Still slower than Radovans solution, but working in scriptcomponents always gives overhead, no matter if passed by value or reference. However, when talking about performance we also need to say what and how to measure. Hopefully there are more of these threads here. I enjoyed following it. Bytheway I also dismissed multithreading here. However what might be even increasing performance could be the use of single floating numbers. This however would only makes sense if the points are directly imported in a floating struct or as Point3f, however Point3f is still a “heavier” struct.

1 Like

Very true. For this reason I preferred to talk about X ms “in total”.

But also this “total” proved to be a relative thing because, depending on use case, I can sometimes actually skip parts of the “total”, like the reading if the input (controlled by my “invalidate” port). This is possible only the cases where the input doesn’t change while the algorithm still needs to be executed several times when other parameters change (for example when a stable list of mesh vertices is the input). This happens to be one of my use cases so for this reason also the “isolated fastest algorithm-part” makes a huge difference for me.

In any case, having come so far, what still interests me a (little) bit is if the reading of the in-port of the point list can be optimized, or perhaps more important, if the GC activity can be minimized or eliminated.

I can’t do any test right now but I noticed that every once in a while it looks like the GC kicks in and causes runtimes like 50ms or even more. That does however NOT make the super-super-fast algorithms meaningless (at least not in my use cases) but in some cases it makes the performance a bit unpredictable.

But since .NET is all new to me I have no idea if the GC can be tweaked, or what strategies people are using to avoid choking the GC. As an old Delphi guy I always took care of the cleaning up myself. And of course, the cheapest objets to clean up are the ones that are not created in the first place (“reusing” already instantiated pools of objects was a std strategy in cases where super-super performance was needed). In GrassHopper I don’t know (yet) if I can get far enough down to the in-port data as to skip object instantioation altogether and only use nested arrays from the very beginning, like so (but without the List<Point3d> being instantiated before getting my hands on the incoming point list) :

    // Create three (3) nested arrays to hold long lists 
    // of X,Y and Z values of the incoming point list
    double[][] points = new double[3][];
    points[0] = new double[m_Pts.Count];  // X
    points[1] = new double[m_Pts.Count];  // Y
    points[2] = new double[m_Pts.Count];  // Z

    // Copy point list
    for (var i = 0; i < m_Pts.Count; ++i)
    {
        Point3d pt = m_Pts[i];
        points[0][i] = pt.X;
        points[1][i] = pt.Y;
        points[2][i] = pt.Z;
    }

The time for the allocation of this array (with 50.000 pts) is not measurable (less than 1 ms). Worth noticing is that “switching” the direction of the arrays, from

double[list.Count][3] to
double[3][list.Count]

(only three (3) long arrays for the X, Y and X values), the time for instantiation of the (few) arrays is minimized. Otherwise a double[3] array would have to be allocated for each point, which comes in the thousands.

Q: Is it possible to tweak an in-port to skip the instantiation of point objects altogether, using these arrays instead?

I assume there’s some kind of subscription laid on the port as well, but if the subscription can be kept while the rest is replaced with the 1+3 arrays, then that could never be heavy on the GC even in cases where the point list must be re-read every time the inclusion algorithm has to be reevaluated.

// Rolf

Hi
Probably there will be some performance gain if using double[3][n] instead of Point3d[n] but you have to test it in different circumstances.

Anyway I find time to tested sequential against parallel code on my PC (quad-core Intel® Xeon® Processor E3-1241 v3 3.50 GHz, 16GB RAM, Win10 Pro) and here is the result:

  1. 500.000 points
    • sequential for => 5-6 msec
    • parallel.for => 2-3 msec
  2. 5.000.000 points
    • sequential for => 62-66 msec
    • parallel.for => 22-36 msec
  3. 50.000.000 points
    • sequential for => 585-590 msec
    • parallel.for => 190-230 msec

So even if following photo may tell us something about multi-threaded programming, results show us that it is working as expected…

4 Likes

And what about splitting the inclusion test and check first if the point is above or below the cylinder ?
And then calculate the square distance only for the remaining points ?
Could this speed up things a little more WHEN there are several points not comprised between the cylinder bases ?
What do you think ? :slight_smile:

1 Like

yap, I did that in my algorithm. however it makes more sense to also move the calculation of TX TY TZ in the end,

RIL: but I was talking about using float types not double

struct Point3f
{
  // strut with no constructor, search for "Unions in C#" as an alternative
  float X;
  float Y;
  float Z;
}

pointStatus[i] = false;

float t = ...
if (t>=t_C1)
 {
     if  (t <= t_C2 )
    {
       float TX = ...
       float TY = ...
       float TZ = ...
       var sqdist = ...
       if (SquareDistancePT <= SquareRadiusWithTolerance)
          pointStatus[i] = true;

    }
 }

TomTom sugestion with nested if will generally result in faster code execution.
But there is no need to have line of code for this: pointStatus[i] = false;
because if we declare pointStatus[] like this: bool[] pointStatus = new bool[points.Length];
default value for pointStatus[i] will be false.
Float type instead of double will save same memory space but not sure if it will increase speed of computation.

And code with nested ifs like this:
if (t>=t_C1)
{
if (t <= t_C2 )
{
var sqdist = …
if (SquareDistancePT <= SquareRadiusWithTolerance)
pointStatus[i] = true;
}
}

is semantically identical to following version of shorter code (but most probably no gain in perfformance):

if (t>=t_C1 && t <= t_C2)
{
var sqdist = …
if (SquareDistancePT <= SquareRadiusWithTolerance)
pointStatus[i] = true;
}

But generally, optimization may be tricky. There are lot of artickles on web about it, and some of them are interesting reading…
https://hackernoon.com/the-rules-of-optimization-why-so-many-performance-efforts-fail-cf06aad89099

1 Like

thank you, very interesting article

Get this book and you’ll know more than enough about the .NET environment and how to use it to your advantage.

http://www.writinghighperf.net/

/Nathan

2 Likes

I just came back from a trip to Uppsala so I only now read the latest replies.

Those parallel puppets was a funny description of the real difference between “actual” and “theory”. :slight_smile:

Thank you all for your hints, code examples and links on optimization. They are all very much appreciated. I have read the code examples, the link on optimization (which I wholeheartedly agree with) and I ordered the book (thanks @nathanletwory).

Having said this I think it I should point out why I in this particular case I do not deal with the performance of other parts of the total task (things like reading of the source data (meshes, point clouds), displaying realtime results, saving final results to file and so on. In some cases the performance of such doesn’t actually matter while in other cases it does matter. I’m saying this from the perspective of having personally designed a XX million dollar dynamic real time logistics system which were soaked in all kinds of such performance problems (including interacting services and a real time planning system which people in the industry could not believe where doing real time calculations when they first saw it). In that case I solved even the most “inconceivable” performance problems by design (read: By the way I designed the core system). The system thus was “inherently performant” so to speak.

Doing so implies doing the most significant optimizations before even building (implementing) the final system.

I put this last statement on it’s own line because I know its a bit like “swearing in the church” to apply optimizations in “the wrong order”. But at the same time anyone with even a little experience from building complete applications (from scratch) can easily understand the point; Some systems are so lousy, so slow, and so hopelessly difficult to fix and optimize that when that is discovered we have to admit that sometimes a system “is slow by design”. (Let’s not talk about the speed with which Rhino creates its geometric objects… for example. Because “it’s to late” to fix it anyway). :wink:

And here’s my point: If something can be “slow by design” then also the opposite can be true - namely that something can be very fast, by design.

But that implies focus on the design stage instead of leaving it to trim the hedge after the fact.

So if the very design approach of the core processes in a system (the parts that actually does the bulk of the job) is “inherently fast” (by design, that is), then that can also mean that one doesn’t have to tweak the “business logic” of the system so very much in later stages of the development. Which is good, since it is the late optimizations which are to be avoided (because optimizations often obscures business concepts and makes the system übercomplex, difficult to maintain, and so on).

With “business logic” I mean the logical flow which solves the specific problem which the application was to solve in the first place. And with “core processes” I mean such basic and generic concepts and code that can be used in different systems as well, and even for different tasks or problem domains, even simplifying the perception of the problem to be solved (in addition to doing the job very fast). This involves finding an early (design) strategy which can result also in code that doesn’t have to be written/executed at all, if the design approach arrives at the same desired result in other smarter ways.

So performance “by design” (actually “early design”) is definitely an option in some cases as well. In some cases but not in all cases. This is due to the fact that many applications runs a thousand times faster than they have to, from a user perspective.

But user perspective is not the only perspective. In recent times we also have seen requirements on energy consumption for performing complex calculations, which can be yet another reason for optimization.

Anyway, I appreciate all your contributions in this thread. I’m in the early stage of designing and application and I know exactly why I need to ensure that I have an “performance by design” approach before even starting to build the final application.

Thank you all for your attention and contributions.

// Rolf