Tuesday, November 11, 2008

Lazy Quantifiers and Negative Lookaheads


The time has come,' the walrus said, 'to talk of many things: of shoes and ships - and sealing wax - of cabbages and kings. 'Lazy Quantifiers' and 'Negative Lookaheads'. The names somehow remind me of characters Lewis Carrol would think up or insults Safire would dream up for Agnew. These are the Tweedle Dee and Tweedle Dum of regexes. Today I had a regex challenge and it was a doosie! I wanted to use a regex to verify certain HTML elements all showed up in a specific HTML table. The problem boiled down to matching a < TR > to a < /TR > without getting hung up on any tags inbetween.

Let's make this concrete and offer the text we want to match (extra spaces are added between angle brackets so blogger won't swallow up my html):
< TR > < TD id = "tweedledee"> tweedledum < /TD > < TD > Error < /TD >< /TR >
Given that HTML, I want a regex that could identify a table row that has a column with tweedledee in it and an error message. The first thing I thought of didn't work. I imagined a regex something like this
/< TR[^(<\/TR)*tweedledee[^(<\/TR)*error/mi
Which to me meant match something that starts out like a table row, then has some arbitrary number of characters that are NOT the end of a table row, then match tweedledee then match some other arbitrary number of characters that are NOT the end of a table row, then match error. Ignoring newlines and case. The problem there is with my understanding of character classes. As you know [] denote a character class. One important feature of a regular expression character class is that it can only match one character at a time. Ohhhh Noooo! So how do we make this work? At first I thought the answer was the lazy quantifier (as opposed to standard greedy quantifiers), but according to my trusty book, "Mastering Regular Expressions," by Friedl the appropriate Tweedle brother is the Negative Lookahead. A Negative Lookahead, denoted (?! ...), is a positional matcher that is successful if at a given point in a regex it cannot match what is to the right of it. After confirming that Ruby does indeed support the Negative Lookahead I came up with the following solution.

/< TR>((?!<\/TR).)*?tweedledee((?!<\/TR).)*?error/mi


Which says match something that starts out like a table row, then any character which is not the end of a table row, but only up to the point that tweedledee occurs, then any character which is not the end of a table row, but only up to the point that error occurs. Ignoring newlines and case. So it turns out I needed to use both tweedle brothers to solve my delimma. Now when am I going to get a chance to use the other lookarounds, the positive and negative lookbehinds? And will I ever be able to look at a pair of TD tags again without thinking of Tweedledee and Tweedledum? Let's ask the king. 'Begin at the beginning,' the King said, very gravely, 'and go on till you come to the end: then stop.'

No comments: