GH: Slow serializing to disk - due to compression?

RIL · November 9, 2018, 11:31am

Two questions regarding (my own) Data Output and Data Input components:

Q1: Everytime I connect an output wire, the Write method is fired. Is it possible to know which output was causing the event?

public override bool Write(GH_IO.Serialization.GH_IWriter writer)
{
   // multiple values stored, so all these are stored everytime an output is 
   // connected, which they will be everytime the GH definition is opened.
   // (Outputs created on open)
}

bild

Q2: Is it possible to tweak the serialization so as to not compress the data (if I understood it correctly, the “Serialize_Binary()” method spends quite some time compressing the data :

var bytes = datachunk.Serialize_Binary(); // byte[]

or is it the WriteAllBytes(...) method which is compressing the data?

System.IO.File.WriteAllBytes(fpath, bytes);

Anyway, I would prefer writing with high speed rather that speding CPU cycles compressing data on “super-fast” NVMe disks…

Any trick I could use to speed up the disk-writes?

// Rolf

DavidRutten · November 9, 2018, 8:11pm

Did someone say this, or did you profile that?

It is not possible to disable the compression, but then it should not represent the bulk of the work anyway.

DavidRutten · November 9, 2018, 8:13pm

You could try and execute the file serialisation (or at least the write) on a different thread.

What sort of write times are we talking about? And is it because the contains a lot of internalised data?

RIL · November 9, 2018, 8:19pm

5-10 seconds (meshes).

No. I avoid that from experience… The heavy lifting is parts of meshes, which are “captured” (cut out), then the separated parts are dumped to disk, and then picked up by other definitions for analyse. The “capturing” of the part meshes is for speed-up so as to not having to traverse a big mesh when a specific analysis only concerns a part of the mesh, etc.

That kind of stuff. Some lines and points and breps goes with the dumps as well, but the meshes are the slow ones to read/write from disk.

// Rolf

stephen4 · November 9, 2018, 8:58pm

The WriteAllBytes method is only serializing the data as is without compression besides the serialization itself, I believe. WriteAllBytes is a .Net method not related to Grasshopper. It simply returns a byte() from content passed into it.

The serialize_binary() method from grasshopper api is a helper method for serializing grasshopper data trees into a byte() in a compliant way grasshopper understands. I have not experienced speed issues due to the serialize_binary() method itself, but I will investigate. The bottleneck may be from the data size itself, the grasshopper graph, or a combination thereof.

I don’t know anything about what you are doing in the graph, what other nodes which may cause slowness, but here are some ideas:

If you are doing batch processing inside the graph your network should be as simple as possible. Every node can add compute time
Use multi-threaded nodes where possible. If the work is independent enough to multi-thread then it should help. You may need to rework your graph design.

***Work with FileStreams throughout the graph life cycle and write to disk only when needed. Then queue up each file to write to disk by a background process. You can create a server-style console app Grasshopper being the client. It would receive the meshes as a Filestream or just a byte() and manage the writing of the files to disk, listen for changes from Grasshopper, etc. After converting Grasshopper data into bytes() we can work with bytes() however we want. If you need speed then streaming the meshes would be helpful. Not sure if the analyse is done on the same computer or a different one but on the same network either way it can work. **

Writing 3D data to disk is generally slower then accessing from memory so, I always try to use a api, procol to transport data until I need to write to file.

I have noticed it’s not a great idea when speed is the goal to do all processing within host applications (Grasshopper/Rhino app). I do processing from external tools that automate, manage, open and close gh and 3dm files. This is much more robust for batch processing work and the approach I take.

Hope this helps. Provide more insight if possible.

DavidRutten · November 9, 2018, 9:01pm

How does the GH_Chunk performance compare to a straight up ISerializable approach which converts the meshes directly to byte arrays and writes those arrays to disk without compression?

My gut feeling is that the bottleneck here is the conversion from ON_Mesh to byte[] via OpenNurbs, but I may be wrong.

stephen4 · November 9, 2018, 10:02pm

I use both approaches and have not noticed a big difference but this all depend on implementation details and graph complexity. When I have time I will benchmark and post a solution.

RIL · November 10, 2018, 2:11pm

Well yes, that’s part of why I’m breaking up big graphs/networks/gh-definitions into smaller “batch-processors” (if your terms graphs& networks alludes to the internal component network inside gh-definitions?)

It’s not only about speed though. It’s also about complexity and extendability of the overall workflow. Basic requirements are

part results can later be extended into different directions for different analyses (a bit like like super class → sub classes). Starting out with a workflow of a few “basic” analyses, which later will split into a “tree” of variant analyses.
part results must be persisted (dumped to disk). This has also some benefits, despite the cost";
2.1 A failed workflow can be restarted from the latest successfully dumped data
3.2 Variant analyse processes can start from such dumped part results at any later point in time without having to rerun the entire workflow.
3.3 If saving the dumped data on a common file system, slave instances of Rhino/Gh can pick up the data to process following steps (or divert into different sub-workflows from there).

Speed
Now, with the basic requirement to dump part solutions, I of course also want each workflow-step ( = each gh-definition) to do its job as fast as possible, which would include the save operation itself. But since saving the resulting data would be the last operation in each workflow step, then in my understanding, threading the save operation wouldn’t save any overall time for each workflow step. This is why I started to pay attention to the save operation itself.

My current gh-definitions are too big, to difficult to modify or to customize into processing variants (different analysis) and as monoliths they are not very scalable. Taken together it motivates to split the workflow into workflow-steps (separate gh-definitions) “connected” via dumped disk.data.

if the bottle neck is CPU-cycles then not even an in-memory disk would provide better speed. I’m also not certain that an in-memory disk would survive a workflow crash, but perhaps they could work in “isolation” (I have to do some homework on that subject).

Inside the definitions I have tried to make very efficient solutions (packing component netoworks into C# code, threaded which possible, aso). So splitting up the gh definitions into workflow-steps seems unavoidable.

One last thing: If Elefront would have been opensource I would have tested going down that route long ago. I need to be in charge of all critical functionality as to not corner myself and the project, so using 3rd party components come last on my list of alternatives (although I’d love to use much of those fantastic plugins that already exist). If I had a budget I’d offer to buy a source code licence (critical if support and maintenance would stop, can’t take that risk).

Hope this explains my goal, where I’m coming from, and why.

// Rolf

RIL · November 10, 2018, 2:54pm

Reading your post again, I should perhaps mention explicitly that I save only “final results” from each gh-definition ( = workflow-step).

Some of these wf-steps save several part-meshes (or sub-meshes being cut out of the big main meshes) and saving multiple such part-meshes may perhaps benefit from threading rather than saving them in one go, or in sequence.

// Rolf

RIL · November 10, 2018, 3:05pm

I’m currently trying to “rescue” my computer which is struggling with hardware issues, so I can’t just now try a “straight up ISerializable”. I’ll dig into this next week, if I’m still with a working dev machine then.

// Rolf

DavidRutten · November 10, 2018, 7:02pm

Good luck.

Topic		Replies	Views
C# - Custom Data Output - Input: Where to start? Grasshopper Developer	19	3713	April 16, 2024
Can Two GrassHopper definitions communicate? Grasshopper	64	10816	December 28, 2020
Grasshopper Save Solution? Grasshopper windows	3	878	October 15, 2021
Looking for low latency reading and writing of large data sets from Excel Grasshopper	16	4042	January 17, 2018
Managing Large Projects Grasshopper	81	9527	February 6, 2024

GH: Slow serializing to disk - due to compression?

Related topics