Python: Parse ghx files with ElementTree?

chanley · November 3, 2017, 7:58pm

I am experimenting with parsing ghx files using python. The end goal is to ultimately have something that can loop through all of the ghx files on our server, (first converting all gh files to ghx), and list out used components. NOTE: I am testing this from inside a grasshopper python component, and reading one ghx file, to try and find the correct approach.

I was able to find a post that described some parsing here (Access XML data). But, I was not terribly successful in using that method against a ghx file. With the code below, I am able to get all “items”, (which perhaps is wrong to begin with), but I cant seem to figure out how to only get the items that contain component names. I thought that maybe I could use ElementTree, but I haven’t had much success.

import clr  
clr.AddReference("System.Xml")  
import System.Xml
import xml.etree.ElementTree as ET

def XMLTest():
    filename = r"C:\CutFillrevised.ghx"  
    xmldoc = System.Xml.XmlDocument()  
    xmldoc.Load(filename)  
    items = xmldoc.GetElementsByTagName('items')

for item in items:
    if item is None: #without this, the first line is the file returns blank, which triggers an error when using the
                     #parse function in ElementTree.
        continue
    itemstr = item.InnerXml.ToString()
    print itemstr


if x == True:
    XMLTest()

I have attached an example ghx file that I have been using to try and read. If anyone has any suggestions, it would be greatly appreciated!

Thanks,
Chris
CutFillrevised.ghx (202.6 KB)
(edited to update GHX file, original file had <?xml version="1.0" encoding="utf-8" standalone="yes"?> removed for testing purposes, new upload is complete file).

nathanletwory · November 3, 2017, 8:29pm

You probably want to first find the <chunk> with name DefinitionsObjects (line 82 of your attached ghx). You can see from the first <items> section that the definion contains 27 objects (components). This corresponds to the <chunks> section starting at line 86. This should contain 27 children of type <chunk>. For instance line 87 starts with a <chunk name="Object" index="0"> . This <chunk> contains an <items> with two <item> - one for the GUID of the component, one with the name.

It is a bit wieldy, but iterating over the 27 <chunk> and looking for the correct grand-children you can find out all the component names used.

chanley · November 5, 2017, 6:35pm

Thanks Nathan. I spent a little time working on a rough script this morning. This was done completely outside of a rhino environment, and I’m sharing only to show a proof of concept, not a “best practice”. (this is a bit of a learning exercise for me.) Anyway, this is a “brute force” approach. Not very elegant as it explicitly defines the “path”. I am hoping to explore xml file enumeration/parsing a bit more and figure out how to search for the path, then start iterating as you suggested.

import xml.etree.ElementTree as ET
import collections

file = "CutFillrevised.xml" #renamed ghx...should work with ghx
root = ET.parse(file)
#doc = root.getroot()

comps = []

for i in root.iterfind('chunks/chunk/chunks/chunk/chunks/chunk/chunks/chunk/items'): #check if path is consistent in other files..maybe not?
    #print (i.attrib)
    #print (i.text)
    for line in i: #nested for loop...work around because I couldn't figure out how to access key values directly in ET
        thing = line.attrib.get('name') #these return strings, so use string manipulation?
        thing2 = line.text
        if thing == "Name":
            #print (thing,thing2)
            comps.append(thing2)
        
counter=collections.Counter(comps) #creates dictionary from list with repeated items...
print(counter)
#print counter.keys()
#print counter.values()
print(counter.most_common(1))
print "Total Components used in",file," :",(len(comps))

Above code returns the following against the test file.
0 - Counter({‘Multiplication’: 3, ‘Division’: 2, ‘ValuePill’: 2, ‘Surface’: 2, ‘Flatten Tree’: 2, ‘List Length’: 2, ‘IntSlider’: 2, ‘Divide Surface’: 2, ‘Mass Addition’: 2, ‘Panel’: 2, ‘Larger Than’: 1, ‘Vector 2Pt’: 1, ‘Average’: 1, ‘Deconstruct Vector’: 1, ‘Square’: 1, ‘Dispatch’: 1})
1- [(‘Multiplication’, 3)]
2 - Total Components used in CutFillrevised.xml : 27

andheum · November 24, 2017, 5:48pm

If you want, metahopper allows you to do this without custom scripting - just supply a list of filepaths to the DocumentInfo component and you’ll get a listing of all components and params from each file.

chanley · November 25, 2017, 8:30pm

I’m afraid I’m not seeing how to feed a file path to the DocumentInfo component. Perhaps I don’t have a current version? I am only seeing boolean inputs for AU and R?

andheum · November 25, 2017, 8:31pm

MetaHopper.gha (259 KB)
try this one!

chanley · November 28, 2017, 12:27am

@andheum, thanks for the updated component! That’s super handy! For this exploration, I’m not sure that get’s me what I’m shooting for. For example, a python component returns as a ZuiPythonComponent, as oppossed to the “Name”? A fair bit of our custom components are python based…(maybe this is an indicator that we should move beyond python?!)

One of the goals of this little side-project of mine, is to ultimately try to operate from outside of a rhino/GH environment. (of course, that logic immediately breaks down because my current efforts are dependent on using a gh component that @DavidRutten wrote to convert GH files to GHX). I also acknowledge that this side project is a little outside of the bounds of this forum.

I am hoping that, if we can assemble a reasonably accurate data set of what components people in our office are using the most, we can get a better understanding of common tasks/workflows that our designers are using. Then we can adapt our training focus and workflows based on what people are trying to do. I also have an image in my mind of some type of dataviz of this dataset. At any rate, I spent a little time on it this evening and came up with a version that outputs to a csv. Not entirely sure it’s the best structure yet, but ultimately, we can dump it into a Database and the let the DB server handle the heavy lifting of responding to queries.

"""Testing method to read GHX files outside of a rhino/GH environment
Successfully tested with Python Ver 3.6.3, (NOT IRON PYTHON).
GOALS:          1 - Iterate project folders on network server to find grasshopper files and write to a data set of used components
                2 - Visualize/chart data
REQUIRES:       conversion of GH file to GHX (ideally we should write our own version of this....)
                http://www.grasshopper3d.com/forum/topics/batch-convert-gh-ghx-files
ASSUMPTIONS:    Parsing raw xml files is faster than methods dependent on Rhino/GH environment.
                We can use existing data visualization frameworks, (highcharts), to visualize/drill down into data
"""
import csv
import os
from fnmatch import fnmatch
import xml.etree.ElementTree as ET
import collections

root = "D:/Documents/02_MyITProjects/_Grasshopper/GHSTuff_Hanley/GHCONVERTED"
pattern = "*.ghx"
comps = []

try:
    with open("D:/Documents/02_MyITProjects/_Grasshopper/GHSTuff_Hanley/GHCONVERTED/output.csv", "w", newline='' ) as output:
        writer = csv.writer(output)
        writer.writerow(['Name', 'Count', 'Filename'])
        for path, subdirs, files in os.walk(root):
            for name in files:
                if fnmatch(name, pattern):
                    f = os.path.join(path, name) 
                    root = ET.parse(f) 
                    for i in root.iterfind('chunks/chunk/chunks/chunk/chunks/chunk/chunks/chunk/items'): 
                        for line in i:
                            thing = line.attrib.get('name')
                            thing2 = line.text
                            if thing == "Name":
                                comps.append(thing2) 
                    counter = collections.Counter(comps) 
                    for key, val in counter.items():
                        writer.writerow([key, val, f])
                    print ("Processing: ", counter)
                    del comps[:]
except ValueError:
    print("something went wrong....sorry about that")

Lastly, I am posting this more of a record of efforts using this particular method. At this point, I am not expecting any assistance from the folks at mcneel. Although, I do think some type of builtin method to generate this data would be interesting…

Attached is the csv output I am getting at this point…apparently we can’t upload csv files? here’s a screen shot.
component name | count per file | file name

andheum · November 28, 2017, 2:08am

Metahopper can give you the nickname too, just attach “componentinfo” to the component list from the files. Nickname is the visible name you see on the component on the canvas.

andheum · November 28, 2017, 2:10am

Also there really shouldn’t be any need to convert a .gh to a .ghx - the GH_Archive class can read from a .gh file and give you back all the “chunks” you need to traverse the file if you want to do it that way. But having done that for a long time - doing it programmatically from a GH_Document loaded in memory (the metahopper approach) is way easier than trying to parse xml chunks.

chanley · November 28, 2017, 4:33pm

Thanks Andrew! Metahopper certainly does do what I was looking for, from within a GH/RHINO environment.

One thing that might be a bit of a hiccup, is that this method needs to “open” the file, which triggers the native “missing plugin” warning. When I tested the metahopper approach on a folder with 100 some files, I had to close/cancel for each file that had a plugin that I may not have had on my machine. Maybe one addition to the metahopper approach could be to somehow suppress the native Missing Plugin window during it’s parsing operations? (the example screen shot only used 16 files).

The original thought, (which I may need to let go!), was to iterate over all of our project folders and rip though the files through a scheduled task. It potentially is quite a large number. The decision to try and utilize the non-GH/RHINO approach was in hopes of being able to read a raw data format and just scrape the data.

Both approaches are giving me a better understanding of what is involved, in what started as a “side project”! Thanks again for your help. It’s very much appreciated!

The attached example file is done in rhino 6 beta, Nov 21 release. Requires Metahopper and Human plugin.
Metahopper_CompsInFiles.gh (24.4 KB)