07 January 2016

Bottoms Up

After an evening of playing with bottom-up parsing, I can now explain some more of the decision-making process that goes into a grammar. Yesterday was a bit long-winded, so I'm going to gloss over a few steps. Today's PHP selection is this:

function simple() {
  $a = 0;
  for ($i = 0; $i < 1000000; $i++)
    $a++;

  $thisisanotherlongname = 0;
  for ($thisisalongname = 0; $thisisalongname < 1000000; $thisisalongname++) {
    $thisisanotherlongname++;
  }
}
This very quickly tokenizes to:

<FUNCTION> <FUNCTION-NAME> '(' ')' '{'
  <SCALAR> '=' <DIGITS> ';'
  <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <SCALAR> '++' ')'
    <SCALAR> '++' ';'

<SCALAR> '=' <DIGITS> ';'
<FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <SCALAR> '++' ')' '{'
  <SCALAR> '++' ';'
  '}'
'}'

The first thing I'd like to point out is that I've left structural details such as braces, parens and semicolons as text strings. In a more traditional lex/yacc (or flex/bison, if you prefer GNU) file we'd have to give all of these characters unwieldy names, so you'd end up with <SCALAR> <POSTINCREMENT> <SEMICOLON> or some equivalent nonsense. It's much easier to just keep the characters as they are, plus we're going to lose most of those characters as we go along.

I'm going to assume a test-driven development style, so the first test (and only test, for a while) will be to see that our grammar matches our sole test file:

use Test; my $g = PHP.new; ok $g.parsefile( "benchmark.php" )

This assumes that the PHP code at the start is in a file called "benchmark.php" somewhere that your test code can run it. And of course that you're running Perl 6.

This test should pass at all times, and you should be able to regularly commit your changes without having to undo too far down the path. In order to make sure that the test passes at all times, it's critical that we aren't too greedy. So we wouldn't start off creating a `rule function { .. }`, as you might in a more top-down design.

This is also a psychological measure to keep frustration levels to a minimum. By taking small bites, we can commit small deltas at a time, and even document each rule as we go along.

Taking a bite out of crime

Rarely has the advice "Do the least that could possibly work" been more apt. We'll start by creating a tiny, tiny rule that just factors out the '$a++', '$i++' and '$thisisalongname++' in our code. It'll look like:

rule postincrement-expression { <SCALAR> '++' }

and our first change to our grammar will look like:

 grammar PHP {
        rule postincrement-expression { <SCALAR> '++' }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <SCALAR> '=' <DIGITS> ';'
      <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'

    <SCALAR> '=' <DIGITS> ';'
    <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
      '}'
    '}'
        }
}
You might have noticed that we've ignored the trailing semicolon entirely, even though we could have made it optional. We could have chosen to do just that, and the rule would have looked like:

rule postincrement-expression { <SCALAR> '++' ':'? }

Consider this in a larger context, for the moment. We're going to want to have some way to match entire statements, eventually. Our `postincrement-expression` could just as well be part of a much larger rule, where it could look like:

rule postincrement-operator { <SCALAR> '++' ';'? }
rule postdecrement-operator { <SCALAR> '--' ';'? }
rule statement
        { <postincrement-operator>
        | <preincrement-operator>
        }
rule statements
        { <statement>+ }
This will work, and `statements` will correctly parse '$a++;$b--;' - '$a++;' is a valid postincrement-operator, and '$b--;' is a valid postdecrement-operator. It will also correctly parse '$a++$b--;' which is not legal.

Also, if we keep the ';' out of the expression we can reuse this all over the place. So, the two things to take away from this short lesson are to keep your grammar rules short, and keep the structural elements like ';' out of play for as long as possible.

Assignments

Keeping to our rule, we'll turn our attention to '$i = 0', '$a = 0' and '$isalongname = 0'. This gives us a new rule, and replacing every instance of `<SCALAR> '=' <DIGITS>` gives us:

grammar PHP {
        rule assignment-expression { <SCALAR> '=' <DIGITS> }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <FOR> '(' <assignment-expression> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
    <FOR> '(' <assignment-expression> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
     
      '}'
    '}'
        }
}       

 The Virtues of the Approach

Perl 6 is still a Perl, so the normal 'laziness' virtue still applies here. If you miss an instance of `<SCALAR> '=' <DIGITS>` somewhere in your test file, the test will still pass. You can replace just enough to make sure your test passes, and it's still completely valid.

This way, too, there's no "flag day" where you have to change every test because you found some tiny detail and have to backtrack changes in the grammar. If we were designing the grammar top-down, changing the top-level `function-declaration` would cascade throughout the entire test suite.

Comparisons

Similar reasoning applies to `<SCALAR> '<' <DIGITS>`, because this fragment too can be used elsewhere, say in an 'if ( $a < 3 )' comparison. I'll just show the modified grammar, because the rule is fairly trivial:

grammar PHP {
        rule comparison-expression { <SCALAR> '<' <DIGITS> }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
    <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
     
      '}'
    '}'
        }
}   

Naked Blocks

We've got two styles of for loop here, one with a block and one with just a single statement after it. The usual "shortest string" principle applies here, so we're going to create a `for-expression` rule that we can use before a statement or the start of a block, like so:

rule for-expression { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')' }

And the resultant grammar looks like:

grammar PHP {
        rule for-expression
                { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <for-expression>
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
      <for-expression> '{'
        <postincrement-expression> ';'

      '}'
    '}'
        }

The TOP rule looks much simpler now, and we can almost see our goal. Let's consolidate the expressions into one place:
grammar PHP {
        rule expression
                { <assignment-expression>
                | <postincrement-expression>
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <expression> ';'
      <for-expression>
        <expression> ';'
       
      <expression> ';'
      <for-expression> '{'
        <expression> ';'
      '}'
    '}'
        }
}   
Since '$a++' and '$a = 0' can occur in regular statements, we're going to group them into one `expression` rule for later use. Now, iterating on the "Don't repeat yourself" principle, we'll create one last trivial rule and make a `statement` rule.

grammar PHP {
        rule statement { <expression> ';' }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <statement>
      <for-expression>
        <statement>
     
      <statement>
      <for-expression> '{'
        <statement>
      '}'
    '}'
        }
}

Grand Unification

Let's create a `for-statement` rule that collapses both kinds of 'for expressions, the ones with and without blocks:

grammar PHP {
        rule for-statement
                { <for-expression <statement>
                | <for-expression> '{' <statement> '}'
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <statement>
      <for-statement>
       
      <statement>
      <for-statement>
    '}'
        }
}   

And finally, we'll create a `line` rule that unifies `statement` and `for-statement`:

grammar PHP {
        rule line
                { <statement>
                | <for-expression>
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <line>
      <line>

      <line>
      <line>
    '}'
        }
}    

Lo and behold, we've unified all of our code into a single rule, so we can trivially replace the four instances of `<line>` with just one, and account for the possibility that there can be no lines in a function:

grammar PHP {
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <line>*
    '}'
        }

And there you have it. Rename 'TOP' to 'function', and we have a complete grammar:
grammar PHP
        {
        rule assignment-expression { <SCALAR> '=' <DIGITS> }
        rule comparison-expression { <SCALAR> '<' <DIGITS> }
        rule for-expression
                           
                { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
       
                }
        rule expression
                { <assignment-expression>
                | <postincrement-expression>
                }
        rule statement { <expression> ';' }
        rule for-statement
                { <for-expression <statement>
                | <for-expression> '{' <statement> '}'
                }
        rule line
                { <statement>
                | <for-expression>
                }
        rule function
                {
                <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
                <line>*
                '}'
                }
        rule TOP { <function> }
        }    

This will parse our original benchmark.php and so much more. If you want to explore this, there are some obvious starting points, for instance you could add different comparison types like this:

        rule comparison-expression
                { <SCALAR> '<' <DIGITS>
                | <SCALAR> '>=' <DIGITS>
                }
Or instead of assignment, add bitshifts with `<SCALAR> '<<' <DIGITS>`. Be well, and write good code.
 

05 January 2016

Parsing the easy way

Caveat Emptor

There are many Perl 6 tutorials out there, this is not one of them. It assumes some familiarity with the language and some familiarity with regular expressions, the mini-language that Perl 6 grammars derived from. You should probably also have an idea of why someone would want to write one.

I'm going to simply talk about being able to take a file in a given language (PHP is my chosen victim, as being able to interpret PHP code in Perl 6 seems like a useful goal) and create a grammar (giant regular expression) that turns your PHP code into a Perl 6 match object which you can then walk.

If this doesn't sound useful to you, well, it isn't without a bit of additional work. In Perl 6 you can combine this matching object with a set of actions that trigger when a match is made, and those can do many wondrous things. For instance, an action could be made that triggers when '$a = 5;' is matched, and that could set the Perl 6 variable $a to 5, thus setting a Perl 6 variable from inside PHP 5 code.

Inspiration

I'm working on some talks for upcoming conferences  (FOSDEM, OSCON to name a few), and was stuck on how to explain a clean way to write and debug a parser. I'd rather not go the CS 401 generic route of creating the usual expression grammar thing that everyone uses to prove what a pain left-recursion can be, so I figured that PHP would be worthwhile, and if I can get a full PHP grammar out of a talk, so be it.

Now, PHP being a Perl derivative means that the grammar is a bit ... dodgy. If you've ever seen the Perl lexer, you know the challenges we face. There's also the additional challenge of Perl 6 being relatively new, so it's even more important that we work in a stepwise fashion.

When I created perl6-ANTLR4 I had the benefit of a grammar to work from, and a reasonable idea of how I was going to convert it. With PHP I really had no such guarantee, so through trial-and-error I came across the following approach. It might not scale to the full language, but it's one way of approaching the problem.

I'll put the full PHP source up on GitHub once I'm convinced this is a workable approach, but the basic idea I've come up with is as follows:
  1. Collect the sample source into a corpus/ directory.
  2. Write a test file that parses each file in the corpus.
  3. Write a stub grammar whose TOP rule consists of each file in the corpus separated by an '|' alternation symbol.
  4. Run the test file over the corpus.
  5. For each token you find, create a 'token FOO { ... }' rule. Most of these should be formulaic, and if you're creative with an edge-case test case the rest should fall out.
  6. Search/replace each occurrence of the token with '<FOO>' (and the appropriate quotes, this does get tedious.)
  7. Repeat steps 5 and 6 until the grammar is tokenized.
  8. For each common sequence of tokens you find, create a 'rule bar { ... }' rule that matches them.
  9. Replace the common sequence of tokens with '<bar>' as need be.
  10. Repeat steps 8 and 9 until the language is parsed, or you're out of patience.
This won't make much sense without some examples, so I'll draw on my current tackling of PHP for inspiration. Incidentally I'm impressed that the existing Perl 6 grammar construct can handle this approach, which I've already applied to sequences of upwards of several thousand tokens in a row.

Here's a sample (suitably truncated) Perl 6 test file:

use v6;
use PHP::Grammar;
use Test;      

plan 1;

my $g = PHP::Grammar.new;

ok $g.parsefile('corpus-5/addglob.php');
ok $g.parsefile('corpus-5/addpattern.php';
In a nutshell, this simply tests that our PHP grammar matches the content of addglob.php and addpattern.php in our corpus-5/ directory. It should do so trivially, because the grammar file consists of almost nothing but these two files verbatim.

The starting grammar file should look like this:
use v6;
grammar PHP::Grammar
        {
        rule TOP
                {
# corpus-5/addglob.php below, just stick it in quotes.               
'<?php
class a {
        public function test($arg = c::TESTCONSTANT) {
                echo __METHOD__ . "($arg)\n";
        }      
       
        static public function staticTest() {
        }
}'     

|
# corpus-5/addpattern.php below, remember to escape "'<'.'?php';?>"
'#!/usr/bin/php
<?php
echo \'<\'.\'?php\';?>

foreach(array("SPL", "Reflection", "Phar") as $ext) {
        if (!extension_loaded($ext)) {
                echo "$argv[0] requires PHP extension $ext.\n";
                exit(1);
        }      
} ?>'
                }

        }
 
This is, of course, a horrible abuse of the Perl 6 grammar system, but it's a testament to the fact that the system works, and you can incrementally develop your grammar from this baseline. This "parser" just checks its input string to see if it matches the content of addglob.php or addpattern.php, and if so, returns True. You can also ask it for the match contents, and lo and behold, it will return one or the other of the match strings.

Of course, this is only useful if you want your grammar to match the contents of these two files. We're aspiring to something a little greater than this, so bear with me. Incidentally, if you're copy/pasting the source code from this posting, please make sure the "} ?>'" is flush-left in your editor.

Enclosing the entire string in quotes means that the match is open to the vagaries of whitespace, so if you're going to follow me on this journey I'd recommend copying and pasting the files yourself rather than relying on the browser to keep whitespace intact.

Bottoms Up!

We're going to start by giving each term (like '3', 'foreach' or '"Hello world!"') in our program a name. This is backwards from the usual process, because ordinarily we'd have a grammar description in another language to guide us. The problem is that, well, most grammars like yacc, bison and ANTLR simply aren't Perl, and don't quite map onto the tools we have at our disposal.

Basically, you can think of this process as refactoring a regular expression, but an RE that's far more powerful than what you've had at your disposal with Perl 5 or its brethren and sistren.

Let's start this process by looking at '!extension_loaded($ext)' in the source. We'll say that everything that corresponds to a single string in the source gets labeled in all UPPER CASE. Later on this will keep us from confusing things.

So, in this context, `!` is the negation operator, so we'll name it NEGATION-OP. `extension_loaded` is a function name, so we'll call it FUNCTION-NAME. Likewise `$ext` is a scalar variable, so we'll just call it SCALAR.

In the grammar, we can replace `!` with `<NEGATION-OP>`, `extension_loaded` with `<FUNCTION-NAME>` and `$ext` gets replaced with `<SCALAR>`. You'll notice that `(` and `)` didn't get names. They're really more placeholders than anything, so we'll keep `(`, `)', `{`, `}` and so on just as raw text.

Break it Down

For the moment, though, we'll just give the simple things in the program names, like `static`, `function` and `class`. You can already start to see how things might group though, just looking at this new version:
use v6;
grammar PHP::Grammar
        {
        token STATIC { 'static' }
        token CLASS { 'class' }
        token PUBLIC { 'public' }
        token FUNCTION { 'function' }
        token FOREACH { 'foreach' }
        token AS { 'as' }
        token IF { 'if' }
        token ECHO { 'echo' }
        token EXIT { 'exit' }
        rule TOP
                {
      
'<?php'
<CLASS> 'a {'
        <PUBLIC> <FUNCTION> 'test($arg = c::TESTCONSTANT) {'
                <ECHO> '__METHOD__ . "($arg)\n";
        }'     

        <STATIC> <PUBLIC> <FUNCTION> 'staticTest() {
        }
}'     

|

'#!/usr/bin/php
<?php'
<ECHO> '\'<\'.\'?php\';?>'

<FOREACH> '(array("SPL", "Reflection", "Phar")' <AS> '$ext) {'
        <IF> '(!extension_loaded($ext)) {'
                <ECHO> '"$argv[0] requires PHP extension $ext.\n";'
                <EXIT> '(1);
        }      
} ?>'  

                }
        }
The `token ECHO {...}` is how you assign the string 'echo' its own name, and instead of the raw text 'echo' in the program, you replace it with `<ECHO>`. Don't forget to replace the quotes, because you're effectively interpolating `<ECHO>` into the string.

We've only replaced the static text here, so it's it's still only going to be able to match the original file contents, but that will change soon enough. You can start to see a bit of how we might be able to refactor some of this at a later date.

Keeping in mind that Perl 6 grammars are essentially your old regular expressions on steroids, look at `<STATIC> <PUBLIC> <FUNCTION>`. PHP doesn't require that all functions be statically declared, so we could change that to `<STATIC>?` to make both 'static public function foo(...)' and 'public function foo(...)' legal.

Careful about how far you take this though. If you make both 'static' and 'public' optional like so, `<STATIC>? <PUBLIC>? <FUNCTION>`, this makes 'static function foo(...)' a legal expression. Luckily it is legal in PHP, but other languages might not allow that.

Functionality

The goal is to give every term here a label, so let's pick on something else. `test(...)` and `selfTest()` seem like the next obvious candidates. Now, our ultimate goal is not to just match these two files, but any PHP program text, and my years of programming experience has led me to deduce that people are going to want to name functions something other than `test` and `selfTest`, so we need to accommodate those people.

We need a way to match a whole range of identifier characters, and lucky for us, this looks just like it did in Perl 5, using the `\w` character class.

 grammar PHP::Grammar
        {
        token FUNCTION-NAME { \w+ }
        rule TOP
                {
               
'<?php'        
<CLASS> 'a {'  
        <PUBLIC> <FUNCTION> <FUNCTION-NAME> '($arg = c::TESTCONSTANT) {'
                <ECHO> '__METHOD__ . "($arg)\n";
        }'     
       
        <STATIC> <PUBLIC> <FUNCTION> <FUNCTION-NAME> '() {
        }
}'     

|

'#!/usr/bin/php
<?php'
<ECHO> '\'<\'.\'?php\';?>'

<FOREACH> '(' <FUNCTION-NAME> '("SPL", "Reflection", "Phar")' <AS> '$ext) {'
        <IF> '(!' <FUNCTION-NAME> '($ext)) {'
                <ECHO> '"$argv[0] requires PHP extension $ext.\n";'
                <EXIT> '(1);
        }
} ?>'
                }
        }

A brief aside

You might be thinking "Whoa there, if we're just going to match any old string of letters, haven't we lost something here? Where does this go?" Well, it's really a named capture in a slightly different guise. Back in Perl 5 you were used to doing:

"2+3" =~ /(\d+) \+ (\d+)/x
 in order to capture the values `2` and `3` in $1 and $2 respectively, and may even have used the named capture feature. In Perl 6, these get renumbered to $0 and $1 respectively, and even better, you get a hash that you can play with, so a statement like:

"doStuff()" ~~ /<FUNCTION-NAME> '(' ')'/
 will capture 'doStuff' in $/{'FUNCTION-NAME'}. Not only do you have a hash to play with, but later on we'll exploit the fact that this $/ object is actually a multi-tiered data structure, so you can put named captures inside named captures to create a hash of hashes for you, and this all happens automatically.

Syntax and Semantics

There's one other thing that got swept under the rug here. Since `test` and `staticTest` both got abstracted away under `<FUNCTION-NAME>`, it's entirely possible that once the parser is fully complete, `function dup(){} function dup(){}` will actually match, even though it might not be legal PHP code.

What we're building doesn't distinguish between legal code and syntactically correct code. Generally speaking, if the compiler immediately returns a syntax error, then your parser should be able to catch it. Otherwise don't worry about it.

For testing purposes it is a good idea to add some corner cases like these, but in general the parser should be very laid-back about what it accepts. `my $x; my $x;` may throw a "redefined value" warning, but the parser should just see two variable declarations in a row, put them into a syntax tree, and go on about its business.

And back to our story.

Now, let's factor out the strings. For our purposes, strings are nothing but a `"`, some stuff that's not a `"`, then a final `"` to cap it all off. In Perl 6 regex terms, that looks like this (with single-quoted strings thrown in as well):

grammar PHP::Grammar
        {
        token DQ-STRING { '"' <-[ " ]> '"' }
        token SQ-STRING { '\'' <-[ ' ]> '\'' }
        rule TOP
                {

'<?php'
<CLASS> 'a {'
        <PUBLIC> <FUNCTION> <FUNCTION-NAME> '($arg = c::TESTCONSTANT) {'
                <ECHO> '__METHOD__ .' <DQ-STRING> ';
        }'

        <STATIC> <PUBLIC> <FUNCTION> <FUNCTION-NAME> '() {
        }
}'

|

'#!/usr/bin/php
<?php'
<ECHO> <SQ-STRING> ';?>'

<FOREACH> '(' <FUNCTION-NAME> '(' <DQ-STRING> ',' <DQ-STRING> ',' <DQ-STRING> ')' <AS> '$ext) {'
        <IF> '(!' <FUNCTION-NAME> '($ext)) {'
                <ECHO> <DQ-STRING> ';'
                <EXIT> '(1);
        }
} ?>'
                }
        }
Let's now do the big one, scalars. They're just like function names, but with a leading `$`.

grammar PHP::Grammar
        {
        token SCALAR { '$' \w+ }
        rule TOP
                {              
'<?php'        
<CLASS> 'a {'
        <PUBLIC> <FUNCTION> <FUNCTION-NAME> '(' <SCALAR> '= c::TESTCONSTANT) {'
                <ECHO> '__METHOD__ .' <DQ-STRING> ';
        }'     
       
        <STATIC> <PUBLIC> <FUNCTION> <FUNCTION-NAME> '() {
        }
}'     

|

'#!/usr/bin/php
<?php'
<ECHO> <SQ-STRING> ';>?'

<FOREACH> '(' <FUNCTION-NAME> '(' <DQ-STRING> ',' <DQ-STRING> ',' <DQ-STRING> ')' <AS> <SCALAR> ') {'
        <IF> '(!' <FUNCTION-NAME> '(' <SCALAR> ')) {'
                <ECHO> <DQ-STRING> ';'
                <EXIT> '(1);
        }
} ?>'

I'd like to point out that if you've been following along in your own Perl 6 terminal window, that the test cases we've been running haven't failed, or had to be rewritten, yet we've been refactoring the grammar like mad. We may be able to continue this way until the end, running more test files through our corpus of parsing until it handles PHP in all its glory. We may not. At each stage, though, the diffs between files make it clear what we've factored out, and what might have gone wrong.

New Rules

Now that a lot of the underlying confusion has been stripped away, let's think a little bigger. A lot of dealing with grammars is repeating the mantra "Don't repeat yourself." See how `<PUBLIC> <FUNCTION> <FUNCTION-NAME>` repeats? Let's refactor that out by creating a new rule. We'll also create another convention that when we clump UPPER-CASE names together, we'll give the clump a lower-case name.

Since it's almost a function declaration, let's call it that, like so:

grammar PHP::Grammar
        {
        rule function-declaration { <PUBLIC> <FUNCTION> <FUNCTION-NAME> }
        rule TOP
                {
'<?php'
<CLASS> 'a {'
        <function-declaration> '(' <SCALAR> '= c::TESTCONSTANT) {'
                <ECHO> '__METHOD__ .' <DQ-STRING> ';
        }'

        <STATIC> <function-declaration> '() {
        }
}'

|

'#!/usr/bin/php
<?php'
<ECHO> <SQ-STRING> ';?>'

<FOREACH> '(' <FUNCTION-NAME> '(' <DQ-STRING> ',' <DQ-STRING> ',' <DQ-STRING> ')' <AS> <SCALAR> ') {'
        <IF> '(!' <FUNCTION-NAME> '(' <SCALAR> ')) {'
                <ECHO> <DQ-STRING> ';'
                <EXIT> '(1);
        }
} ?>'
                }
        }
And that's how you create your own rules. Nothing to it, really. You'll notice that we've sort of left `<STATIC>` behind, though. When you're just starting out, it's best to keep your new rules simple, which means they should be just common sequences of terms.

But again, you can remember too that these rules are just regular expressions in disguise, so you can use your old friend `?` to declare something optional:

grammar PHP::Grammar
        {
        rule function-declaration { <STATIC>? <PUBLIC> <FUNCTION> <FUNCTION-NAME> }
        rule TOP
                {

'<?php'
<CLASS> 'a {'
        <function-declaration> '(' <SCALAR> '= c::TESTCONSTANT) {'
                <ECHO> '__METHOD__ .' <DQ-STRING> ';
        }'

        <function-declaration> '() {
        }
}'

|

'#!/usr/bin/php
<?php'
<ECHO> <SQ-STRING> ';?>'

<FOREACH> '(' <FUNCTION-NAME> '(' <DQ-STRING> ',' <DQ-STRING> ',' <DQ-STRING> ')' <AS> <SCALAR> ') {'
        <IF> '(!' <FUNCTION-NAME> '(' <SCALAR> ')) {'
                <ECHO> <DQ-STRING> ';'
                <EXIT> '(1);
        }
} ?>'
                }
        }
Just like in regular expressions, `?` makes a name optional. You could make `<PUBLIC>` optional as well, since having a `function foo()` with no `static` or `public` declaration is probably legal, but it might be better to verify that with another corpus file.

Expressionless


Now we're going to focus our attention on the `<ECHO> <DQ-STRING>` and `<ECHO> <SQ-STRING>` elements. Since most of the syntax we're using has much in common with regular expressions, you might guess that we can say "either <DQ-STRING> or <SQ-STRING>" with `( <DQ-STRING> | <SQ-STRING> )`, and you'd be right:

grammar PHP::Grammar
        {
        rule echo-expression { <ECHO> ( <SQ-STRING> | <DQ-STRING> ) }
        rule TOP
                {

'<?php'
<CLASS> 'a {'
        <function-declaration> '(' <SCALAR> '= c::TESTCONSTANT) {'
                <ECHO> '__METHOD__ .' <DQ-STRING> ';
        }'

        <function-declaration> '() {
        }
}'

|

'#!/usr/bin/php
<?php'
<echo-expression> ';?>'

<FOREACH> '(' <FUNCTION-NAME> '(' <DQ-STRING> ',' <DQ-STRING> ',' <DQ-STRING> ')' <AS> <SCALAR> ') {'
        <IF> '(!' <FUNCTION-NAME> '(' <SCALAR> ')) {'
                <echo-expression> ';'
                <EXIT> '(1);
        }      
} ?>'
                }
        }
There can be a little trap hiding here, though. This alternation works because the two string types differ in their first character, so the parser doesn't have to look ahead to choose which term to follow. If they differ further down, especially if there are `+` or `*` involved, there could be issues.

Generally you can get rid of these issues by reordering the alternatives, but sometimes it can be something of a waterbed problem. Let's slightly refactor the text after the `<function-declaration>` so we can see another DRY hiding:

 grammar PHP::Grammar
        {
        rule echo-expression { <ECHO> ( <SQ-STRING> | <DQ-STRING> ) }
        rule TOP
                {

'<?php'
<CLASS> 'a {'
        <function-declaration> '(' <SCALAR> '= c::TESTCONSTANT' ')' {'
                <ECHO> '__METHOD__ .' <DQ-STRING> ';
        }'

        <function-declaration> '(' ')' {
        }
}'

|

'#!/usr/bin/php
<?php'
<echo-expression> ';?>'

<FOREACH> '(' <FUNCTION-NAME> '(' <DQ-STRING> ',' <DQ-STRING> ',' <DQ-STRING> ')' <AS> <SCALAR> ') {'
        <IF> '(!' <FUNCTION-NAME> '(' <SCALAR> ')) {'
                <echo-expression> ';'
                <EXIT> '(1);
        }
} ?>'
                }
        }

Getting argumentative

Separating the `(` `)` makes it a little clearer that we have `<function-declaration> '(' {stuff} ')'` and `<function-declaration> '(' ')'`, which kind of makes {stuff} an option, doesn't it?

grammar PHP::Grammar
        {
        rule function-declaration
                {
                <STATIC>? <PUBLIC> <FUNCTION> <FUNCTION-NAME>
                '('
                ( <SCALAR> '= c::TESTCONSTANT' )?
                ')'
                }
        rule TOP
                {
               
'<?php'
<CLASS> 'a {'
        <function-declaration> '{'
                <ECHO> '__METHOD__ .' <DQ-STRING> ';
        }'     

        <function-declaration> '{'
        '}
}'

|

'#!/usr/bin/php
<?php'
<echo-expression> ';?>'

<FOREACH> '(' <FUNCTION-NAME> '(' <DQ-STRING> ',' <DQ-STRING> ',' <DQ-STRING> ')' <AS> <SCALAR> ') {'
        <IF> '(!' <FUNCTION-NAME> '(' <SCALAR> ')) {'
                <echo-expression> ';'
                <EXIT> '(1);
        }
} ?>'
                }
        }
So, by treating the `<SCALAR> '=...'` as an option, we've factored even more code out, and I hope that by now you can see where to take this. Basically, from here the idea is to take a file, add it to your test suite, boil it down as far as you can, then repeat with a new test file. After two or three iterations of this process, you don't really even need to add the new test file, you'll just be able to add a new test case (please, please don't forget the test case and corner cases) and plug in the new bit of grammar.

And of course most of this really isn't necessary if you have a grammar to work from, but I've found that even when you do have a grammar to work from, sometimes it helps to tear it down and rebuild it from scratch.

Denoument

I'll be the first to admit that going through an entire corpus of PHP files this way is impractical. But I've also tried going at this problem from the other end, namely taking a large grammar and beating on it until it managed to process the files I wanted it to read. (At least until the GLR, that is, but that's a separate issue.)

Now, even for a fairly large language, you don't have to take each file and tokenize them by hand, it soon becomes apparent what are tokens and what aren't, and Perl 6 has convenient whitespace rules to help the boundaries fall where they may.

I think the issue is the grammar debugger/tracer isn't quite tooled up to where I'd like it to be, and I'm going to take a crack at it after or maybe during FOSDEM.

The last time I checked, the tracer spit out the entire AST in raw .perl format, albeit with some indentation. What I really need to get out of that is simply the text with an <eject> character where the parser bailed, and a traceback of what rules it was running at the time it failed. That would help me more than the raw .perl format, and again I can probably do it, and want to take some time to straighten out the issue. I think the tools give me enough access to do what I want, and at that point I'd love to be able to throw a live debugger into a tmux session, and be able to tweak/rerun the debugging, say "Oh, it was in the expression-core rule at 'a := foo' and that binding expression hasn't been written yet."

In my copious free time, I guess. In any case, I hope I've laid bare some of the mysteries awaiting you in the depths of the Perl 6 grammars.

Thanks and until next time,

Happy coding!