13 February 2016

From Regular Expressions to Grammars: Pt. 2

If you're not already familiar with regular expressions, please read Part 1 of this tutorial first, to get caught up on the basic idea of what regular expressions are, and why you would want to use them in the first place.

Recap

When we last left our heroes [sorry, the Game of Thrones soundtrack is on in the background] they were learning about how to refactor Perl 6 regular expressions into easy-to-digest bits. We started off with the task of searching for '2016-02-06T14:36+02:00' in a sample logfile. This slowly evolved into searching for any ISO-8601 timestamp in our sample file, by replacing the explicit digits (0 .. 9) with a generic placeholder '\d' meaning any digit.

This fixes the problem of only being able to match a single timestamp but leaves behind an ugly string of '\d\d\d\d' when matching the year (I.E. four digits in a row.) We solved this by replacing '\d\d\d\d' with '\d+', which simply means "Any number of digits in a row."

This declutters our regular expression, but at a slight cost. Our expression matches '2016-02-13T13:10+02:00', but it also matches '19116-02-13T13:10+02:00' because '\d+' doesn't tell the regular expression engine how many digits to match, it just says "Match as many as you can." This may be what you want in some situations, but not here.

A person has got to know their limitations.

Luckily, Perl 6 has just the expression we need. A quick scan through Regexes on docs.perl6.org leads us to the quantifier section, and near the end of the list we find "min .. max" which tells the regular expression engine to match what's before it at least min times, and at most max times. There's even a shortcut which tells us the regex engine to match what's before it exactly N times.

my regex Date { \d ** 4 '-' \d ** 2 '-' \d ** 2 }
This also stops the regular expression from matching '19116-02-100', which, while the sort of thing you worry about if you work at the Long Now foundation, isn't the sort of time problem that comes up in general.

By way of recapping, /literal string here/ matches a sequence of alphanumerics. Anything that's not alphanumeric (by the way, alphanumeric here isn't restricted to US ASCII, any character with the 'Letter' or 'Number' Unicode property qualifies) has to be quoted or escaped in some fashion, lest it be confused with a current or future metacharacter.

If you want to make something optional, follow it with '?', like in:

"Skyfall" ~~ /Sky 'fall'?/;
This matches both 'Sky' and 'Skyfall', thus treating 'fall' as an option when matching 'Sky'. If you're being particularly diligent in writing tests, you might notice that 'Skyfalling' also matches this expression, as does 'Skyfail'.

Perl 6 regular expressions, like most RE engines, stop when they've found a match. Going from left to right, in the case of 'Skyfalling', the process looks like this, roughly:

  • "Skyfalling" ~~ /Sky 'fall'?/ # Try 'Sky', succeed
  • "Skyfalling" ~~ /Sky 'fall'?/ # Try 'fall', succeed, report success
The RE engine has matched all of the terms in the regular expression, so after matching 'Skyfall', it's done, so even though there's more of the string left over, the match succeeds. If only there were a way to tell it the match isn't quite done yet.

Lucky for us, there is. In fact, true to Perl's nature, there are several ways, each appropriate for different situations. The most common case is that 'Skyfall' appears in a sentence, and you want to look for it in "Come to Skyfall, Mr. Bond", "Mr. Bond, come to Skyfall." or "The Sky is Falling, Mr. Bond."

As you can see, the term 'Skyfall' (or just 'Sky', remember the optional clause) is followed by either a ',', ' ' or '.'. You could try to match all of those combinations, and possibly miss many more because of the wealth of Unicode punctuation, or you could use the built-in '>>' shortcut. This stops the match at the end of the word, so this expression will now fail as you expect:

"Skyfalling" ~~ /Sky 'fall'? >>/

For this to match, 'Skyfall' or 'Sky' have to be followed by either the end of a string, or something that's not an alphanumeric character, like '.' or ';' (which we left out in the example above, to prove a point.) So now, this regular expression matches "Sky diving today!" or "Skyfall, as it crumbles..." but not "Skyfail" or "Isle of Skye".

I want my M(atch)TV

Regular expressions are pretty powerful on their own, there are entire UNIX tools dedicated to nothing but regular expressions (I.E. grep.) But sometimes it's helpful to be able to do more than just tell whether you've found a match in your input.

Suppose, for instance, that you're doing some DBA work, and you've been given 2 different SQL files that both have to be merged into the same database table. One looks like this:

INSERT INTO tag VALUES( 1, 'Perl 6' );
INSERT INTO tag VALUES( 2, 'Moose' );
And the other set looks almost the same:
INSERT INTO tag VALUES( 1, 'Pearlbee' );
INSERT INTO tag VALUES( 2, 'Dancer' );
All of this data has to get into your tag table somehow, and the tag IDs have to be unique to boot. This sort of thing happens in real life when trying to reconstruct tables from sharded data, so it's not a theoretical problem. The solution is simple once you stumble across it - modify the first file to insert at IDs 1, 3, 5, 7, 9... and modify the other file to insert at IDs 2, 4, 6, 8 and so on.

So, let's dig in. We already have most of the tools we need, so let's try to match our first statement:

say "INSERT INTO tag VALUES( 1, 'Perl 6' );" ~~
/'( ' \d+ ','/
Previously we've matched the entire statement, but here we'll just match the "( 1," portion. Regular expressions don't have to match the entire string in order to find a hit, they just have to find enough to be a unique match.

It's looking for '(' followed by an integer followed by a comma, which is exactly what '( 1,' is in our input. So we've found a match, wonderful. We can even do this in a loop like so:
for @lines -> $line {
  say $line ~~ /'( ' \d+ ','/
}
which is wonderful, and tells us that we've matched all of our input, confirming that we can read our SQL file. Great. Now what?

It's time to capture our data so we can do something with it. Let's introduce the capturing group, '( .. )'. This captures what the regular expression engine matches, and puts it into a variable that we can access later, like so:
for @lines -> $line {
  if $line ~~ /'( ' ( \d+ ) ','/ {
    say "Found ID $0"
  }
}
We can even capture both the ID and the tag name like so, modify the ID and write the file back out, like this:
for @lines -> $line {
  if $line ~~ /'( ' ( \d+ ) ',' ( \' .+ \' )/ {
    say "INSERT INTO tag VALUES( {2 * $0 - 1}, $1 );"
  }
}
This recreates the old file with "INSERT INTO tag VALUES(  1, 'foo' ); INSERT INTO tag VALUES( 3, 'bar' );" and so on, skipping every other ID. Change the formula from {2 * $0 - 1} to just {2 * $0}, and you now have two new files, one with odd IDs:
INSERT INTO tag VALUES( 1, 'Perl 6' ); -- 1 -> (2 * 1 - 1) == 1
INSERT INTO tag VALUES( 3, 'Moose' ); -- 2 -> (2 * 2 - 1) == 3
And the other file with even IDs:
INSERT INTO tag VALUES( 2, 'Pearlbee' ); -- 1 -> (2 * 1) == 2
INSERT INTO tag VALUES( 4, 'Dancer' ); -- 2 -> (2 * 2) == 4<
So now you can load the two files in any old order, and since one has the odd-numbered IDs and the other has even-numbered IDs, at the end you'll have both sets of data inserted into the database, collision-free. Incidentally, just change the constant 2 to 3 or more, and this trick generalizes to any number of files - rewrite 1,2,3 to 1, 4, 7, the next file to 2, 5, 8, and the last file to 3, 6, 9. Again, the IDs won't collide because each file's IDs goes exactly in the space left by the other two files.

You might be wondering to yourself, though, what that '.+' is doing there. We've used the '+' above, when dealing with dates, so you remember that means "One or more of what's before it." In this case, what's "before it" is just a bare '.', which we haven't seen before. This is another placeholder, and it stands for "any character", so the entire expression "( \' .+ \' )" can be read as 'Capture (...) everything that starts with a single quote, has one or more characters (we don't care what) inside, and ends with another single quote."

 Coverage

We've now covered the basic regular expression metacharacters: +, *, ? and '.'. At the end we also talked about how to actually use the captured text in Perl 6 code, and combined what we've learned along the way into a final expression that uses grouping, metacharacters and actually processing the data using a real-world example of munging a database table together from independent shards.

Stay tuned for the next installment in this series, where we pull what we've learned together into a Perl 6 grammar that reads and interprets JavaSrcript code in a JIT compiler.

6 comments:

  1. Thanks for these tutorials, I'm learning a lot!

    ReplyDelete
    Replies
    1. You're more than welcome, I'm just glad to see people actually reading them :)

      Delete
  2. Perl 6 has substitution too of course. A substitution solution might look something like:

    my @sql_files = <upd-db-id-1.sql upd-db-id-2.sql>;

    for 0 .. 1 –> $i {
        for @sql_files[$i].IO.lines –> $_ is copy {
            .say if s/'(' \s* ( \d+ ) \s* ',' /( {2 * $0 + $i -1},/;
        }
    }

    ReplyDelete
    Replies
    1. Of course, and thank you for bringing up the possibility. That portion of the tutorial focused on capturing results, and while your code is a more canonical way to accomplish the same task, introducing both file I/O and substitution in a section devoted to matching felt like too much.

      Delete