Parsing string with repeating pattern with regex?

Hi,

I read a text file line by line with a GHPython script. What I get is a list of strings, one string per line. I now need to parse each string into more manageable data (i.e. strings, integers) that I can use to place geometry and annotate it.

The strings look like this:

  • “the description (number)” (e.g. “door (0)”)
  • “the description (number|number|number)” (e.g. "window (1|22|4))
  • “the description (number|number|number|number)” (e.g. "toilet (2|6|5|10))

The description refers to the geometry type, the first integer to the floor number, the second to last integers to the rooms in which one of the described objects is to be placed.
And no, the structuring of the data in the text file was not my idea! :wink:

Now what I want is a list of split/parsed strings for each line from the text file that I can process further, for instance:

  • “window (1|22|4)” -> [ “window”, “1”, “22”, “4” ]

I guess regular expressions are best fit to accomplish this and I already managed to come up with this:

  • (.+)\s+((\d+)\), which perfectly matches [ “door", “0” ] for “door (0)”

However, some items have more data to parse:

  • (.+)\s((\d+)+\|\), which matches only [ “window”, “1” ] for “window (1|22|4)”

How can I repeat the pattern matching for the part (\d+)+\| (i.e “1|”) up to the closing parenthesis for an undefined number repetitions of this pattern? The last item to match would be an integer, which could be caught separately with (\d+)\).

Also is there a way to match either the simple or the extended case with a single regular expression?

Thanks!

Do you have access to some good references and tutorials like:

https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference

https://www.regular-expressions.info/tutorial.html

The Microsoft topic is for the regex in .NET ie: C# and VB etc. Python may have some additions and omissions.

I’ve found that using regular expressions is truly a top-notch way of dealing with tasks like yours, but each time I need to use them it’s a major brain exercise with lots of trial and error to get it right (probably because I’m not doing it every day :grin:).

I also found this book very helpful because it presents increasingly more complicated tasks with real-world examples:

“Learning Regular Expressions” by Ben Forta, Addison-Wesley 2018

Yes, I know how to navigate and search the internet! :wink: But thanks!

Same, here! They are super nice to parse strings, but without regular exercise, pretty hard to master. I’ve worked with regex in the past, but never dove in really deep!

Well I don’t have the time, nor the patience to read a whole book in order to find a solution to this problem. :wink: It might be great to check it out for future reference though.

Hello,

You should be able to add some (\d*) to catch more digits which may or may not be present.

I’ve found the parse module to be handy for simple pattern matching in the past but it is probably CPython only.,

Does this help?

>>> re.search('(\d+).?(\d*).?(\d*)','window (1|22|4)').group(1)
'1'
>>> re.search('(\d+).?(\d*).?(\d*)','window (1|22|4)').group(2)
'22'
>>> re.search('(\d+).?(\d*).?(\d*)','window (1|22|4)').group(3)
'4'
1 Like

I’d tried that, but it didn’t catch all integers somehow, mainly because of the special character “|”, I imagine.

Wow, the parse module looks super straightforward and easy to use. Thanks for recommending it, Dancer! :slight_smile:
Unfortunately, as you may know, CPython is still off limits in the current iteration of GHPython, unless you are willing to fiddle with PythonRemote or alike plugins. :frowning:
This project is not worth that. I want to be done with it as soon as possible!

Yes, sure. Thanks! I didn’t know about the group() thing yet.

Here’s the solution that finally worked, if anybody else is interested or ever comes across a similar problem:

data = []
with open("textfile.txt") as f:
    for line in f:
        description, numbers = re.search(r"(\w+) \(([^)]+)\)", line).groups()
        data.append((description, *numbers.split("|")))

print data # i.e. [("door", "0"), ("window", "1", "22", "4")]
1 Like

You can further simplify your regexp:

import re
p = re.compile("(.+) \(\(.+)\)") # better to compile if you have lots of data to regex-voodoo on

# with open etc as before
# just now re.search(p, line).groups()
# and your split of the numbers on pipe

Your group capture with [^)]+ doesn’t make much sense, that is where you can simplify. Also I don’t see a reason to use \win case you get multi-word descriptions.

3 Likes

Wow, great suggestions, Nathan!
Thanks a lot, especially for the improved format string and suggesting re.compile(). Compiling the expression into an object sped things up quite a bit! :fire:

in my opinion, \w is necessary, since there are a couple of descriptions that are composed of two words or more, separated by whitespace (e.g. “window velux”, "front door local*, etc.). For the sake of simplicity, I’d simplified the descriptions in my example above.

If you have multiple world descriptions then you really should use .+ instead of \w+.

I always like to use online regexp editors/verifiers. Since we’re doing Python here I suggest

Use as input text:

one (1)
one two (0)
one two three (0|1|123|34)

In the regular expression box first click the flag and unselect multiline.

Now first test with (\w+) \((.+)\), then with (.+) \((.+)\). You’ll find that the latter will give you the entire line as description, as you want. The former gives only one word - the last one. That is because with \w you tell the pattern there should be no spaces in there. The latter pattern gives two useable groups.

So you really want each and any character up until the space and opening parenthesis. If there is anything in the description you need to further parse you can do that with simple string manipulation, just like you do for the numbers splitting on pipe.

2 Likes

Oh my bad, I confused .+ with \.+, which escapes special characters. Should have read your first post more carefully. :wink:

I do too! If not, I wouldn’t have been able to come up with my first try. RegExr is great, although it’s not specifically for Python.

Thanks, for the explanation.

That’s why I like regex101.com, since you can choose for Python as well :slight_smile:

1 Like