07 January 2018

Tree Surgery

HTML and I tend to get along like a house on fire. I can work with it when the need arises, but I still prefer to do semantic markup and let CSS do with it what the end-user wants to see, which these days is shiny sidebars, blogroll posts and code blocks that neatly let the user just drag-and-select the text they want and copy it into their editor of choice.

As you can see by this posting, I haven't quite gotten there yet. Equally, you can see that I haven't given up altogether, because I'm still here, posting and tweaking things. A couple of months ago (okay, it just feels that way) I figured out that it'd be much simpler for me to take a Perl 6 POD document and add the markup that I need to that.

After some pondering I realized "Wait, isn't there already a Perl 6 POD converter out there? Called perl6 --doc? And can't I re-purpose what it does to generate Blogspot-ready HTML?"

Pod::To::HTML already is out there, but the way the code is written made it hard if not impossible to subclass, because most of the important stuff was inside sub blocks, not out in methods where they could easily be sub-classed.

So I started rewriting it. A few days later, after grumbling and wondering why this particular bit had been written the way it was, I went onto the usual suspect Perl IRC channel where the author likely hung out, so I could get a useful answer. He wasn't around, but someone pointed me to another module, Pod::TreeWalker, and told me that he too was interested in fixing this.

So, as things happen, conversation started. I rewrote the guts of what I had with Pod::TreeWalker, and found that it did some things I really didn't want it to. All I wanted... hoo boy. I know that phrase all too well, it usually means I'm going to end up rewriting some module only to run into the problems they encountered. Well, once more into the breach, and all that.

So, I decided that while I liked Pod::TreeWalker's style - it's an event-driven system, it did place some limitations on how you could work with it. For example, when a pod-table-start event came along, it was pretty easy to figure out "Hey, this is the start of a POD table, so I can generate a `<table>' string and add it to my running HTML." And this worked fine, for all of about 20 minutes. Because it also generated other events in places where I didn't think it should, such as paragraphs inside a list item, which made the generated HTML look odd.

For instance,

=item foo

generates this sequence, when marshalled back out to HTML:

<li><p>foo</p></li>

What's happening here is that the library sends out:
  • An 'item-start' event, so I tack on '<li>' to the HTML.
    • A 'paragraph-start' event, so I tack on '<p>' to the HTML.
    • A 'text' event, with 'foo' as the txt, so I tack on 'foo'.
    • A 'paragraph-end' event, so I tack on '</p>' to the HTML.
  • An 'item-end' event, so I tack on '</li>' to the HTML.
Now, as I've mentioned earlier, I didn't particularly like the fact that the sequencer created a paragraph event, when there's no particular need to do so. But I'm stuck with it. I still have possibilities, though. I can...
  1. Post-process the HTML and do a quick regex to remove the <p>..</p> tags, but, and repeat after me, friends don't let friends run regex on HTML.
  2. When a 'paragraph-start' event is encountered, see if the HTML has <li> already there, and ignore it if so. But see #1.
  3. Hack the module to pass along the "parent" event when firing off an event, so I could look at the "parent" event and if that's a paragraph, ignore it.
  4. Wait a minute, parent... <digs_through_source/> it's already got a tree of POD it's walking, if I pull out just the tree, then it's actually less code to walk the tree, and when I encounter the <p> node I can tell it to look at its parent... right.
Armed with this realization, I sally forth. Lucky for me, Perl 6 POD is already laid out as a tree, so it's pretty simple to start walking it. Now, there are a bunch of straightforward ways to write a walker like this, but I rather prefer to use the multiple-dispatch features of Perl 6 for this purpose.

method walk( $node ) {
  self.visit( $node );
  if $node.^can('contents') {
    for @( $node.contents ) -> $child {
      self.walk( $child );
    }
  }
}

POD nodes are laid out in a pretty simple fashion. They're either "leaves" like text blocks (our infamous paragraph contents) or they're shells, like a list of items. An easy way to tell whether a node is a leaf or not is whether it has 'contents', and we do that by the "^can('contents')" test. Just calling the method directly would work as well, but every time we called it on a leaf node, we'd get a runtime error. Not good.

Once you know that bit, the code sort of falls into place.
  • Visit $node (AKA "do something with it", like print its text)
  • If it's got children:
    • For each child (that's the "@( $node.contents ) -> $child" bit)
    • Walk over that child.
So your user-declared visit() method will get called once on every node in the entire tree, in a depth-first search, so it's in the perfect order to return HTML to you. Well, almost, but the exceptions aren't worth talking about.

Great, we can walk the tree in depth-first order, and we've got a handy visit() method that'll do something. We can even add a $.html attribute that we can accumulate our HTML into as we go along, problem solved!

has Str $.html;
method visit( $node ) {
  if $node ~~ Pod::Table {
    $.html ~ '<table>'; # .. hey, wait a minute...
  }
}

Hold the phone, this just tells me when we've encountered, say, a Table node. I wanted to be able to write something when a table starts, and when it ends. And I wanted to know what the table's parent was, like we talked about lo those many paragraphs ago.

No worries, we're really close, honest. I'll change the walk() method just to show you.

method walk( $node, $parent = Nil ) {
  self.start-node( $node, $parent );
  if $node.^can('contents') {
    for @( $node.contents ) -> $child {
      self.walk( $child, $node );
    }
  }
  self.end-node( $node, $parent );
}

The '= Nil' is a handy shortcut so that you can call the walk() without having to specify a Nil parent. In your code you can just call walk($pod) without anything special, Perl 6 will just fill in the missing argument for you.

Also, you'll see that the generic visit() call is gone, there's now in its place a start-node($node,$parent) and end-node($node,$parent) call. We can easily use them like this:

has $.html;
method start-node( $node, $parent ) {
  given $node {
    when Pod::Table { $!html ~= '<table>' }
  }
}
method end-node( $node, $parent ) {
  given $node {
    when Pod::Table { $!html ~= '</table>' }
  }
}

And voilá! start-node() gets called when a table starts, and its companion end-node() gets called after all of the table contents are displayed, so we can write in the '</table>' tag at the right time. And we can even check out the table's parent node at $parent. If there isn't one, then we're at the top of the tree!

There are a few minor downsides to this, though. For one, every time we learn about a new Pod node, we're going to have to update both the start-node() and end-node() method. But we can fix that simply. Perl 6 lets us dispatch methods by type, using the multi keyword. So, let's try that.

has $.html;
method start-node( Pod::Table $node, $parent ) { $!html ~= '<table>' }
method end-node( Pod::Table $node, $parent ) { $!html ~= '</table>' }

Much less noise, and Perl 6 will know exactly how to dispatch our types. But when the code out in the wild encounters a new Pod node that we didn't know about, it'll break with a horrible stacktrace, so let's fix that right now.

has $.html;
method start-node( $node, $parent ) { die "Unknown node " ~ $node.^WHAT.perl }
method end-node( $node, $parent ) { die "Unknown node " ~ $node.^WHAT.perl }

There, now our code will gracefully die when it's encountered a node that it's never seen before, and report exactly what the node is so that when someone makes a bug report on GitHub we'll know what to do.

Now, I should reveal that my upcoming Pod::To::HTMLBody module doesn't quite work like this. I do use some of these techniques behind the scenes, and ultimately I walk the tree almost exactly in the same way, but I've done things differently for several different reasons. I guess you'll have to wait for the next part of this article to learn what's going on, and what new challenges I faced making this particular module.

Until then, this is your humble author signing off.

No comments:

Post a Comment