07 January 2016

Bottoms Up

After an evening of playing with bottom-up parsing, I can now explain some more of the decision-making process that goes into a grammar. Yesterday was a bit long-winded, so I'm going to gloss over a few steps. Today's PHP selection is this:

function simple() {
  $a = 0;
  for ($i = 0; $i < 1000000; $i++)
    $a++;

  $thisisanotherlongname = 0;
  for ($thisisalongname = 0; $thisisalongname < 1000000; $thisisalongname++) {
    $thisisanotherlongname++;
  }
}
This very quickly tokenizes to:

<FUNCTION> <FUNCTION-NAME> '(' ')' '{'
  <SCALAR> '=' <DIGITS> ';'
  <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <SCALAR> '++' ')'
    <SCALAR> '++' ';'

<SCALAR> '=' <DIGITS> ';'
<FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <SCALAR> '++' ')' '{'
  <SCALAR> '++' ';'
  '}'
'}'

The first thing I'd like to point out is that I've left structural details such as braces, parens and semicolons as text strings. In a more traditional lex/yacc (or flex/bison, if you prefer GNU) file we'd have to give all of these characters unwieldy names, so you'd end up with <SCALAR> <POSTINCREMENT> <SEMICOLON> or some equivalent nonsense. It's much easier to just keep the characters as they are, plus we're going to lose most of those characters as we go along.

I'm going to assume a test-driven development style, so the first test (and only test, for a while) will be to see that our grammar matches our sole test file:

use Test; my $g = PHP.new; ok $g.parsefile( "benchmark.php" )

This assumes that the PHP code at the start is in a file called "benchmark.php" somewhere that your test code can run it. And of course that you're running Perl 6.

This test should pass at all times, and you should be able to regularly commit your changes without having to undo too far down the path. In order to make sure that the test passes at all times, it's critical that we aren't too greedy. So we wouldn't start off creating a `rule function { .. }`, as you might in a more top-down design.

This is also a psychological measure to keep frustration levels to a minimum. By taking small bites, we can commit small deltas at a time, and even document each rule as we go along.

Taking a bite out of crime

Rarely has the advice "Do the least that could possibly work" been more apt. We'll start by creating a tiny, tiny rule that just factors out the '$a++', '$i++' and '$thisisalongname++' in our code. It'll look like:

rule postincrement-expression { <SCALAR> '++' }

and our first change to our grammar will look like:

 grammar PHP {
        rule postincrement-expression { <SCALAR> '++' }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <SCALAR> '=' <DIGITS> ';'
      <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'

    <SCALAR> '=' <DIGITS> ';'
    <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
      '}'
    '}'
        }
}
You might have noticed that we've ignored the trailing semicolon entirely, even though we could have made it optional. We could have chosen to do just that, and the rule would have looked like:

rule postincrement-expression { <SCALAR> '++' ':'? }

Consider this in a larger context, for the moment. We're going to want to have some way to match entire statements, eventually. Our `postincrement-expression` could just as well be part of a much larger rule, where it could look like:

rule postincrement-operator { <SCALAR> '++' ';'? }
rule postdecrement-operator { <SCALAR> '--' ';'? }
rule statement
        { <postincrement-operator>
        | <preincrement-operator>
        }
rule statements
        { <statement>+ }
This will work, and `statements` will correctly parse '$a++;$b--;' - '$a++;' is a valid postincrement-operator, and '$b--;' is a valid postdecrement-operator. It will also correctly parse '$a++$b--;' which is not legal.

Also, if we keep the ';' out of the expression we can reuse this all over the place. So, the two things to take away from this short lesson are to keep your grammar rules short, and keep the structural elements like ';' out of play for as long as possible.

Assignments

Keeping to our rule, we'll turn our attention to '$i = 0', '$a = 0' and '$isalongname = 0'. This gives us a new rule, and replacing every instance of `<SCALAR> '=' <DIGITS>` gives us:

grammar PHP {
        rule assignment-expression { <SCALAR> '=' <DIGITS> }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <FOR> '(' <assignment-expression> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
    <FOR> '(' <assignment-expression> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
     
      '}'
    '}'
        }
}       

 The Virtues of the Approach

Perl 6 is still a Perl, so the normal 'laziness' virtue still applies here. If you miss an instance of `<SCALAR> '=' <DIGITS>` somewhere in your test file, the test will still pass. You can replace just enough to make sure your test passes, and it's still completely valid.

This way, too, there's no "flag day" where you have to change every test because you found some tiny detail and have to backtrack changes in the grammar. If we were designing the grammar top-down, changing the top-level `function-declaration` would cascade throughout the entire test suite.

Comparisons

Similar reasoning applies to `<SCALAR> '<' <DIGITS>`, because this fragment too can be used elsewhere, say in an 'if ( $a < 3 )' comparison. I'll just show the modified grammar, because the rule is fairly trivial:

grammar PHP {
        rule comparison-expression { <SCALAR> '<' <DIGITS> }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
    <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
     
      '}'
    '}'
        }
}   

Naked Blocks

We've got two styles of for loop here, one with a block and one with just a single statement after it. The usual "shortest string" principle applies here, so we're going to create a `for-expression` rule that we can use before a statement or the start of a block, like so:

rule for-expression { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')' }

And the resultant grammar looks like:

grammar PHP {
        rule for-expression
                { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <for-expression>
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
      <for-expression> '{'
        <postincrement-expression> ';'

      '}'
    '}'
        }

The TOP rule looks much simpler now, and we can almost see our goal. Let's consolidate the expressions into one place:
grammar PHP {
        rule expression
                { <assignment-expression>
                | <postincrement-expression>
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <expression> ';'
      <for-expression>
        <expression> ';'
       
      <expression> ';'
      <for-expression> '{'
        <expression> ';'
      '}'
    '}'
        }
}   
Since '$a++' and '$a = 0' can occur in regular statements, we're going to group them into one `expression` rule for later use. Now, iterating on the "Don't repeat yourself" principle, we'll create one last trivial rule and make a `statement` rule.

grammar PHP {
        rule statement { <expression> ';' }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <statement>
      <for-expression>
        <statement>
     
      <statement>
      <for-expression> '{'
        <statement>
      '}'
    '}'
        }
}

Grand Unification

Let's create a `for-statement` rule that collapses both kinds of 'for expressions, the ones with and without blocks:

grammar PHP {
        rule for-statement
                { <for-expression <statement>
                | <for-expression> '{' <statement> '}'
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <statement>
      <for-statement>
       
      <statement>
      <for-statement>
    '}'
        }
}   

And finally, we'll create a `line` rule that unifies `statement` and `for-statement`:

grammar PHP {
        rule line
                { <statement>
                | <for-expression>
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <line>
      <line>

      <line>
      <line>
    '}'
        }
}    

Lo and behold, we've unified all of our code into a single rule, so we can trivially replace the four instances of `<line>` with just one, and account for the possibility that there can be no lines in a function:

grammar PHP {
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <line>*
    '}'
        }

And there you have it. Rename 'TOP' to 'function', and we have a complete grammar:
grammar PHP
        {
        rule assignment-expression { <SCALAR> '=' <DIGITS> }
        rule comparison-expression { <SCALAR> '<' <DIGITS> }
        rule for-expression
                           
                { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
       
                }
        rule expression
                { <assignment-expression>
                | <postincrement-expression>
                }
        rule statement { <expression> ';' }
        rule for-statement
                { <for-expression <statement>
                | <for-expression> '{' <statement> '}'
                }
        rule line
                { <statement>
                | <for-expression>
                }
        rule function
                {
                <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
                <line>*
                '}'
                }
        rule TOP { <function> }
        }    

This will parse our original benchmark.php and so much more. If you want to explore this, there are some obvious starting points, for instance you could add different comparison types like this:

        rule comparison-expression
                { <SCALAR> '<' <DIGITS>
                | <SCALAR> '>=' <DIGITS>
                }
Or instead of assignment, add bitshifts with `<SCALAR> '<<' <DIGITS>`. Be well, and write good code.
 

No comments:

Post a Comment