Regex spiffification: obstacles

By Filip Salomonsson; published on September 21, 2006.

Okay, so I can already do concatenation and union on simple regular expressions, but - of course - a few obstacles has shown up on the radar that I may need to deal with.

Flags

First off, what happens if one of the expressions has one or more flags set? Say, the first is case insensitive or locale dependent (or even worse, verbose!) while the other is not. As far as I know, flags can't be toggled mid-expression (as in perl), so I don't have the option of just incorporating them in the subexpression using the (?...) syntax.

I'm leaning heavily towards just using the flags of the first operand, and tossing the rest. Another possibility is to just the union of flags from both operands. I'm not sure which I'd really prefer.

If the flags were set with the (?...) syntax to begin with, the behaviour might be a bit unpredictable, but I'm set on leaving that as a clear case of "then don't do that" for now.

Capturing groups

If there are backreferences, shit can definitely hit the fan. Consider the concatenation of r"f(oo)+" and r"(bar|baz)\1". The \1 reference will then falsely refer to (oo) instead of (bar|baz). I'm not overly keen on parsing and rewriting the expressions at the moment, so it'll probably end up on the "then don't" list, too.

Needless grouping

Mostly a cosmetic issue, but still...

I'm cheating a bit to avoid problems with operator precedence. I'm enclosing both subexpressions in non-capturing (?:...) groups to avoid having, for example, r"a|b" + r"c|d" end up as a|bc|d.

This, too, could be handled by some parsing to check if there are union operators that are not inside subgroups in the expression, but as noted above, reparsing is not a prioritized activity at the moment.

Multi-union

More cosmetics, closely related to needless grouping. Given three regular expressions A, B and C, the union A | B | C will be constructed as ((A|B)|C), while it could just as well be (A|B|C).

Again, one solution is parsing the expressions, so it too will be left untouched for now.

What I'm not doing

At least a few, possibly all (I'm don't know enough under-the-hood stuff to tell for sure), of these obstacles could probably be overcome by working on the underlying compiled state objects instead of fiddling with the patterns and compiling from scratch.

That's also how I would really like to be doing this, at least in the end, but I really don't feel like digging into the srecode just now, so I'm going to try to work around them instead. Or, to look at it another way, I'll keep prototyping it as a wrapper module, and we'll see where it goes from there.

Next up: free stuff!