06 February 2016

From Regular Expressions to Grammars: Pt. 1

Old-School Text Matching


This tutorial is based around Perl6-Regex-To-Javascript, a talk I gave at FOSDEM 2016. We'll start out writing simple regular expressions, and slowly build the expressions out into a full-blown compiler. Our starting point will be the humble ISO-8601 timestamp, used worldwide in all sorts of logging programs. There are a few minor variants, but this is the one we're most concerned with:

2016-02-06T14:36+02:00
Regular expressions, for those of you that don't know, are a way to find partial matches in a potentially very large document. For the moment, you can consider it the equivalent of the 'Find' dialog in your favorite text editor.

Suppose you want to find this particular timestamp in... say, a blog posting, much like this one. Most languages have a function like Perl 6's index method, which searches for a substring inside a larger string, and returns either the offset if the string matches, or -1 if it can't find a match. In Perl 6, using that would look like this:

say index( $logfile, '2016-02-06T14:36+02:00 );
And it would print out either -1 for no match, or a positive number telling us where to look in the logfile for our match. This is efficient and fast, and works with any string, even something your user has given as input. If you were searching for timestamps on a given day, you could even cut down your search string to just '2016-02-06T' to find all timestamps on a given day.

We've added the trailing 'T' to the timestamp because it's entirely possible that somewhere in your logfile there may be a REST query that looks like '/post/2016-02-06/' that you wouldn't want to match. That's just a user request for a posting from a given day, which could be a user in March requesting a post from last month.

You could even use the same technique to search for all timestamps in a given timezone, just looking for '+02:00'. You'd likely get even more false positives, because '+02:00' could easily appear in other places in our web logfile. One way to cut down on the number of false positives is to do 10 searches, one for '0+02:00', one for '1+02:00' and so on up to '9+02:00'.

Into the Breach


Another way is to use a regular expression, that will search for exactly the timestamp format you're looking for with no false positives. For this, instead of the index function, we'll ask Perl 6 to perform smart matching on our string, like so:

say $logfile ~~ /2016-02-06T14:36+02:00/;
This will do exactly what the index function did above, but uses Perl 6's smartmatch operator.  Well, not quite. While it should return whether we've found a match in our logfile, what we actually get is this:

===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
at /home/jgoff/Talks/Javascript/slide03.pl6:12
------> /2016-01-29T13:25+01:00/;
Unable to parse regex; couldn't find final '/' at /home/jgoff/Talks/Javascript/slide03.pl6:12
------> /2016-01-29T13:25+01:00/;

If you're used to languages like C where you get a terse one-line syntax error, or Java where you can get page-long stacktraces, this may seem like a mystery.  For the moment let's ignore the "Unrecognized..." and "Unable..." bits and look at "(must be quoted to match literally)".

This is telling us that we need to quote the dashes in our expression. In fact, this is an instance of a general rule in Perl 6 regular expressions. Anything that's not an alphanumeric character ('a'..'z', 'A'..'Z', '0'..'9') has to be quoted. So, let's do just that:

say $logfile ~~ /2016 '-' 02 '-' 06T14 ':' 36 '+' 02 ':' 00/;
And now we get back this equally strange expression:

 「2016-01-29T13:25+01:00」
This is just telling us that the ~~ operator matched some text, and here's the text that it matched. The 「」 are Japanese quotation marks, deliberately designed to stand out from the rest of the text. When you're experimenting with regular expressions, what used to be common in Perl 5 was printing out the matched string surrounded by '' so you could see exactly where the match boundaries lie. This isn't always obvious, and whitespace and return characters could cause a problem.

So now, Perl 6 defaults to printing the matched object with unambiguous markers that tell you exactly where the start and end of a match lie, without making you go back and add debugging statements to your code.

Generalizing

Earlier I promised that we would make our timestamp matching more general, so it's time to make good on that promise. This logfile might have been written last year, so one thing we might want to do is match on a timestamp from either 2015 or 2016. Luckily, regular expressions like what we're using have a built-in 'or' operator, called '|'.

At first blush, you might think that:

say $logfile ~~ /2015 | 2016 '-' 02 '-' 06T14 ':' 36 '+' 02 ':' 00/;

would do the trick. And it will, for timestamps in 2016. It will also match anything in 2015, and by 'anything' I mean any string that happens to have '2015' in it. The '|' operator says "Match anything on the left of me or anything on my right", so while it still matches '2016-02-06T14:36+02:00' (what's on the right of the '|', it will also match anything to its left, which includes '2015', '/post/2015/02' or even '/number/120153'.

So, basically we need to stop the operator from running amok, and tell it to just apply the 'or' bit to the '2015' and '2016' portion of the string, and not the entire expression. We can do that by bracketing the extent of the match, quite literally:

say $logfile ~~ / [ 2015 | 2016 ] '-' 02 '-' 06T14 ':' 36 '+' 02 ':' 00/;
This now will match both '2015-02-06T14...' and '2016-02-06T14...'. Which is fine if you want to match a timestamp from 2015 or 2016, but this logfile goes all the way back to 1997, and who wants to type '[ 1997 | 1998 | 1999 | 2000... 2015 ]' out in full? You could apply what we've learned above and try to be sneaky, like so:

say $logfile ~~ / [1|2] [9|0] [9|0|1] [0|1|2|3|4|5|6|7|8|9].../;
But thankfully there's a shortcut.

Learning Shorthnd

The [0|..|9] expression there is so commonly used that there's a convenient shortcut. Instead of writing out the range of '0'..'9' in full, we'll just tell Perl 6 that we want to match digits. While we're at it, we don't know how far back in time this logfile goes, so let's just match 4 digits instead of worrying about whether it's between 1997 and now:

say $logfile ~~ / \d \d \d \d '-' 02 '-' 06T14.../;
Of course, this works for *any* of the digits in our string, so we can take what we have and rewrite it using this shorthand like so:

say $logfile ~~ / \d\d\d\d '-' \d\d - \d\d T \d\d ':' \d\d '+' \d\d ':' \d\d/;

Which now matches any timestamp that looks like <digits> - <digits> - <digits> etcetera. Almost any timestamp. The '+' <digits> : <digits> will only match timezones between +01 and +12, the other timezones are between -11 and -01, so we'll use the 'or' trick we learned above to match either a '+' or '-' sign, like so:


say $logfile ~~ / \d\d\d\d '-' \d\d - \d\d T \d\d ':' \d\d [ '+' | '-' ] \d\d ':' \d\d/; 
 There, negative and positive timezone offsets accounted for. Well, not quite. Negative and positive timezone offsets are specified relative to Greenwich, but it has its own timezone that's not even a number, it's just called 'Z', for historical reasons. (Timezones used to be assigned letters, and Z was the end of the alphabet.)

So, one more change, and nesting timezone statements:

say $logfile ~~ / \d\d\d\d '-' \d\d - \d\d T \d\d ':' \d\d [ [ '+' | '-' ] \d\d ':' \d\d | Z ] /;

Refactoring

But that '[ '+' ... Z ]' expression is getting pretty long in the tooth, it'd be nice to be able to factor that out somehow. The regex object comes to the rescue, and helps us clean up the code. It's basically a way to treat our regular expression as a separate object we can use later on, so if we ever find a timestamp that has a '+01:00', 'Z' and '-01:00' offset format we'll be ready for it.

The regex object looks almost like the matching expression, except that it uses braces to say when to start and stop:

my regex Timezone { Z | [ '+' | '-' ] \d\d ':' \d\d };

say $logfile ~~ / \d\d\d\d '-' \d\d '-' \d\d T \d\d ':' \d\d <Timezone>/;
The <..> visually sets off the refactored expression from the main text, and having the expression for the Timezone separated out means that we can use this elsewhere in the code, or even put it into our own library of expressions if we wanted to. In fact, let's do that for all of the separate bits here, like so:

my regex Date { \d\d\d\d '-' \d\d '-' \d\d };
my regex Time { \d\d ':' \d\d };
my regex Timezone { Z | [ '+' | '-' ] \d\d ':' \d\d };

say $logfile ~~ / <Date> T <Time> <Timezone> /;
Having all of those \d\d sitting in a row is really a bit visually disturbing, so let's factor that out:

my regex Integer { \d+ };
my regex Date { <Integer> '-' <Integer> '-' <Integer> };
my regex Time { <Integer> ':' <Integer> };
my regex Timezone { Z | [ '+' | '-' ] <Integer> ':' <Integer> };

say $logfile ~~ / <Date> T <Time> <Timezone> /;
The '+' character in the Integer regular expression is new. It's not surrounded by quotes, so it has a special meaning. It just means "Match whatever is before me at least once." In this case "Whatever is before me" is '\d', which is shorthand for "a digit from 0 to 9" (This applies to digits in other languages as well, like ٤ in Arabic.)

Review Time

  • Regular expressions match portions of a string containing arbitrary text.
  • They can match literal text, such as /Jane Bloggs/.
  • They can also match more complicated expressions, like /\d+ Main Street/.
  • You can take pieces out of an expression, like /<Integer> Main Street/.
  • These pieces are named, and can be put into their own libraries.
We've only talked here about the shortcut for digits, of course there are many others, and for those you'll want to read Regexes at docs.perl6.org, or your local manpages. The most common shorthand expressions are \d (which we've used above), \w which is any "word" character like 'a'..'z', 'A'..'Z' and '0'..'9', and \s which is any whitespace such as ' ', <tab> or <carriage returm>. These all work with the '+' special character, along with a few others we haven't mentioned.

If you want to make something optional, add '?' after it. For instance, if you were looking for someone's name, you might want to check for a leading honorific, like this: / 'Dr. '? John Fredericks/ which would match 'Dr. John Fredericks' or just 'John Fredericks'.

Floating-point numbers can sometimes get you into interesting circumstances. In most languages, '123.456' is a legitimate floating-point number, which you can match with the tools you've learned with /\d+ '.' \d+/. This is to say "One or more digits, followed by a decimal, followed by one or more digits."

The catch is that some languages require there to be digits after the decimal point, in other languages they're optional, like '123.'. Given what we've talked about above, you might think that one solution would be / \d+ '.' [ \d+ | ] /, and just leaving the "after" part of the '|' blank. This leaves behind a null (empty) regular expression, which is illegal in Perl 6. Does this mean we can't match it?

Of course we can. The trick is to use the other special character, '*'. This can be read as "Match nothing, or whatever is before me at least once." So instead of /\d+/, we use /\d*/, which we can read as "Zero or more digits."

Confusion abounds

Using this special character can seem counterintuitive or ever confusing. Writing our full regular expression to match a floating-point number now looks like / \d+ '.' \d* /, which can be thought of as "Match at least one digit, a decimal point, and any number of digits, or even no digits." This matches '123.', '1.23', '0.001' and of course '1.1', all of which have zero or more digits on the right side of the decimal.

It's pretty clear when matching digits, but when matching strings, the water gets muddier. For instance, suppose you're a biochemist searching for a sequence 'ATTT...' in a gene. Matching against /AT+/ lets you match 'AT', 'ATT', 'ATTTTTT' etcetera. All seems fine, but in reality the 'TTTTT' part is optional, so you change your expression to /AT*/. 'A' matches, 'AT', 'ATTTTT' all match, as they should. But 'AGT' matches. So does 'AGTTTTT' and 'AGGTA'. How did that 'G' get in the middle?

The simple answer is that it didn't. When you asked it to look for "A followed by zero or more Ts", that doesn't mean the same as "A followed by nothing or a string of T's." It searched for "A", found no "T" after it, so it stopped the search and reported a match found. The special characters '?', '+' and '*' don't "look ahead" any more than they have to.

This is one reason why expressions like /'"' .+ '"'/ should be viewed with a bit of caution, especially if you're dealing with strings that could have nested strings inside them. We'll talk about those expressions and more in part II of this tutorial series on expressions and grammars.

Stay Tuned for Part II


2 comments:

  1. Please format your code with code tags. It will make your code so much more readable.

    ReplyDelete
  2. Thanks for providing this informative information. it is very useful you may also refer- http://www.s4techno.com/blog/category/perl/

    ReplyDelete