Parsing with Riders

A Plain English compiler, as you might expect, needs to parse text simply, flexibly, and quickly. We do that with a thing we call a rider. It’s a generalized version of an idea (with the same name) that Niklaus Wirth taught us about in his magnificent Oberon system. To understand our implementation and use of riders, however, you must first understand the way we implement…

Strings

The prototype string record, in Plain English, is defined like this:

A string has a first byte pointer and a last byte pointer.

The data bytes of a string are dynamically allocated in a contiguous sequence on the Heap. A string is considered blank if the first byte pointer is nil (or greater than the last byte pointer). When a string consists of just one-byte, the pointers are equal. The length of a string is thus the last byte’s address minus the first byte’s address plus one. Our compiler generates all the code and calls necessary to manage string memory.

And that brings us to…

Substrings

substring is any contiguous part of a string, up to and including the whole string. In Plain English it is defined like this:

A substring is a string.

In other words, the substring type not only looks like a string (with a first byte pointer and a last byte pointer), but is also compatible with string type, which allows most of our string routines to operate on either type.

And now we’re ready to talk about…

Riders

rider, in Plain English, is defined like this:

A rider has an original substring, a source substring and a token substring.

When we…

Slap a rider on a string.

…the pointers in the original and source substrings are set to span the entire string. At the same time, the token’s first byte pointer is set to the first byte of the string, and the token’s last byte pointer is set to nil (making the token substring initially blank, but ready to be extended).

Now when we…

Bump a rider.

…we add 1 to the source’s first byte pointer, and add 1 to the token’s last byte pointer. It thus appears that we have “moved” a byte of the source into the token, though no physical movement of string data has actually taken place. Very fast, even on huge strings. We keep the original substring pointers intact so we can check, as necessary, to make sure we don’t fall off either end of the source.

Note that we can peek back at previous bytes in the source string simply by subtracting from the source’s first byte pointer.

Note also that we can have as many riders on a string as we need, so we can parse different parts of the source at the same time.

Note, thirdly, that we can save substrings of the source of any length (as just two pointers) so we can process them later without having to hunt them down again in the source.

Finally, note that we can code up a wide variety of “Move a rider” routines to extract any kind of token from any kind of source. Here, from our compiler, are some…

Examples

Our compiler ignores spaces, tabs, linefeeds, carriage returns, and other noise between meaningful characters. When we find ourselves sitting on a character that doesn’t interest us, we use this routine to move past it:

To move a rider (code rules – noise):
Bump the rider.
If the rider’s source is blank, exit.
If the rider’s source’s first’s target is noise, repeat.

comment, in Plain English, starts with a backslash and ends at the end of a line. When we find ourselves sitting on a comment (ie, the first remaining byte of the rider’s source is a backslash), we suck it up into a token using this routine:

To move a rider (code rules – comment):
Bump the rider.
If the rider’s source is blank, exit.
If the rider’s source’s first’s target is not the return byte, repeat.

remark, in Plain English, is an “inline” comment surrounded by square brackets. When we find ourselves sitting on a remark, we call this guy:

To move a rider (code rules – remark):
If the rider’s source is blank, exit.
If the rider’s source’s first’s target is the return byte, break.
If the rider’s source’s first’s target is the left-bracket byte,
add 1 to a count.

If the rider’s source’s first’s target is the right-bracket byte,
subtract 1 from the count.

Bump the rider.
If the count is 0, break.
Repeat.

Remarks can be nested, so that routine is a little more complex.

When we find ourselves at the beginning of a literal string (ie, the first remaining byte of the source is a double-quote mark), we call this routine to suck the string into a token:

To move a rider (code rules – string):
Bump the rider.
If the rider’s source is blank, exit.
If the rider’s source’s first’s target is the return byte, exit.
If the rider is on any nested double-quote, bump the rider; repeat.
If the rider’s source’s first’s target is the double-quote byte,
bump the rider; exit.

Repeat.

Note that we don’t allow strings to span multiple lines (to avoid common errors), and that we use doubled-up double-quotes in string to allow for double quotes within strings. For example…

“This is a string with the next word “”in”” double quote marks.”

…is interpreted like this:

This is a string with the next word “in” double quote marks.

Qualifiers are used to distinguish special cases of similar routines. In Plain English, they’re enclosed in parentheses. This is the “move a rider” routine that handles qualifiers:

To move a rider (code rules – qualifier):
If the rider’s source is blank, exit.
If the rider’s source’s first’s target is the return byte, break.
If the rider’s source’s first’s target is the left-paren byte,
add 1 to a count.

If the rider’s source’s first’s target is the right-paren byte,
subtract 1 from the count.

Bump the rider.
If the count is 0, break.
Repeat.

Qualifiers can also be nested, so we have to take that into account as we did with remarks.

Punctuation marks in Plain English are all single characters, so we just suck ’em up into the token:

To move a rider (code rules – mark):
Bump the rider.

Possessives typically come at the end of names and can either be an apostrophe followed by the letter “s”, or an apostrophe all by itself, if the preceding letter is “s”. This is the routine that parses possessives:

To move a rider (code rules – possessive):
Bump the rider.
If the rider’s source is blank, exit.
If the rider’s source starts with “s”, bump the rider.

Below is the higher-level routine that calls the above routines, and one more at the end:

To move a rider (code rules):
Position the rider’s token on the rider’s source.
If the rider’s source is blank, exit.
If the rider’s source’s first’s target is noise,
move the rider (code rules – noise); exit.
If the rider’s source’s first’s target is the backslash byte,
move the rider (code rules – comment); exit.
If the rider’s source’s first’s target is the left-bracket byte,
move the rider (code rules – remark); exit.
If the rider’s source’s first’s target is the double-quote byte,
move the rider (code rules – string); exit.
If the rider’s source’s first’s target is the left-paren byte,
move the rider (code rules – qualifier); exit.
If the rider’s source’s first’s target is any mark,
move the rider (code rules – mark); exit.
If the rider is on any possessive,
move the rider (code rules – possessive); exit.
Move the rider (code rules – glom).

glom is any character or collection of characters that can be processed at the next level up — the semantic, rather than the syntactic level. Since noise, comments, remarks, strings, qualifiers, punctuation marks and possessives have been weeded out at this point, the “move a rider” routine for gloms is quite simple:

To move a rider (code rules – glom):
Bump the rider.
If the rider’s source is blank, exit.
If the rider is on any possessive, exit.
If the rider’s source’s first’s target is any glom byte, repeat.

We have to check for possessives a second time in case one pops up at the end of a glom.

Glom bytes are defined as follows:

To decide if a byte is any glom byte:
If the byte is any letter, say yes.
If the byte is any digit, say yes.
If the byte is the tilde byte, say yes.
If the byte is the at-sign byte, say yes.
If the byte is the number-sign byte, say yes.
If the byte is the percent-sign byte, say yes.
If the byte is the ampersand byte, say yes.
If the byte is the underscore byte, say yes.
If the byte is the single-quote byte, say yes.
If the byte is the dash byte, say yes.
If the byte is the cross byte, say yes.
If the byte is the slash byte, say yes.
Say no.

I imagine you get the idea. Using riders in this way, we can simply, flexibly, and quickly parse our source files without having to think about too much at any one time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s