Regex spiffification: obstacles
Okay, so I can already do concatenation and union on simple regular expressions, but - of course - a few obstacles has shown up on the radar that I may need to deal with.
Flags
First off, what happens if one of the expressions has one or more
flags set? Say, the first is case insensitive or locale dependent (or
even worse, verbose!) while the other is not. As far as I know, flags
can't be toggled mid-expression (as in perl), so I don't have the
option of just incorporating them in the subexpression using the
(?...)
syntax.
I'm leaning heavily towards just using the flags of the first operand, and tossing the rest. Another possibility is to just the union of flags from both operands. I'm not sure which I'd really prefer.
If the flags were set with the (?...)
syntax to begin with, the
behaviour might be a bit unpredictable, but I'm set on leaving that as
a clear case of "then don't do that" for now.
Capturing groups
If there are backreferences, shit can definitely hit the fan. Consider
the concatenation of r"f(oo)+"
and r"(bar|baz)\1"
. The \1
reference will then falsely refer to (oo)
instead of (bar|baz)
.
I'm not overly keen on parsing and rewriting the expressions at the
moment, so it'll probably end up on the "then don't" list, too.
Needless grouping
Mostly a cosmetic issue, but still...
I'm cheating a bit to avoid problems with operator precedence. I'm
enclosing both subexpressions in non-capturing (?:...)
groups to
avoid having, for example, r"a|b" + r"c|d"
end up as a|bc|d
.
This, too, could be handled by some parsing to check if there are union operators that are not inside subgroups in the expression, but as noted above, reparsing is not a prioritized activity at the moment.
Multi-union
More cosmetics, closely related to needless grouping. Given three
regular expressions A, B and C, the union A | B | C
will be
constructed as ((A|B)|C)
, while it could just as well be (A|B|C)
.
Again, one solution is parsing the expressions, so it too will be left untouched for now.
What I'm not doing
At least a few, possibly all (I'm don't know enough under-the-hood stuff to tell for sure), of these obstacles could probably be overcome by working on the underlying compiled state objects instead of fiddling with the patterns and compiling from scratch.
That's also how I would really like to be doing this, at least in the end, but I really don't feel like digging into the sre
code just now, so I'm going to try to
work around them instead. Or, to look at it another way, I'll keep
prototyping it as a wrapper module, and we'll see where it goes from
there.
Next up: free stuff!