modgrammar – The Modular Grammar Engine

This module provides a full-featured pure-python framework for building tokenizing LR language parsers and interpreters for context-free grammars. (The modgrammar parsing engine is implemented as a recursive-descent parser with backtracking, using an object-oriented grammar model.)

The modgrammar parser is designed such that language grammars can be defined in python modules using standard python syntax. To create a new grammar, simply create a new class definition (or multiple class definitions) derived from the Grammar base class, and set its grammar attribute to a list of sub-grammars to match. (Such definitions can be combined together into full grammar trees.) Several basic pre-defined grammar constructs are also available in this module which larger grammars can be built up from.

Once a grammar is defined, the parser() method can be called on the toplevel grammar class to obtain a GrammarParser object, which can be used to parse text against the defined grammar.

Grammar Classes

class modgrammar.Grammar

This class is not intended to be instantiated directly. Instead, it is a base class to be used for defining your own custom grammars.

Subclasses of Grammar serve two purposes. The first is to actually define the grammar used for parsing. The second is to serve as a base type for result objects created as the result of parsing.

To define a new grammar, you should create a new class definition, descended from Grammar. In this class definition, you can override several class attributes and class methods to customize the behavior of the grammar.

Class Attributes

Grammar.grammar

A list of sub-grammars used to make up this grammar. This attribute almost always needs to be specified when defining a new grammar class, and is the way in which full grammars can be built up from smaller grammar elements. In order for this grammar to match an input text, it must completely match each of the sub-grammars listed in its grammar attribute, in order.

Grammar.grammar_tags

A list of “tags” to be associated with match result objects produced by this grammar. These tags can be used with the find_tag() and find_tag_all() methods to extract specific elements from a parse tree after a successful match.

Grammar.grammar_collapse

If set to True, indicates that this grammar element should be “collapsed” when constructing the final parse tree. Any place in the parse tree where an instance of this grammar would normally occur will instead be replaced by the list of its sub-elements. This can be used to make result parse trees simpler in cases where a grammar element has been defined for convenience of the grammar definition, but does not provide a lot of useful information in the resulting parse tree.

Grammar.grammar_name

A string used to identify this grammar. This is used in a variety of places, including repr(), str(), generate_ebnf(), etc. (Defaults to the name of the grammar class.)

Grammar.grammar_desc

A descriptive string for the grammar to be used in ParseError error messages. (Defaults to the same value as grammar_name.)

Grammar.grammar_error_override

If set, this grammar will report all match failures by its subgrammars as if it had failed itself. This effectively “hides” the subgrammars in any ParseError (which will use this grammar’s location and grammar_desc instead when constructing error messages, etc).

Grammar.grammar_whitespace

If set to True (the default), this grammar will automatically skip over any whitespace found between its sub-grammars (it will be “whitespace consuming”). If set to False, whitespace between elements will not be treated specially.

In the case where you want a grammar to be “whitespace consuming” but want something other than the normal definition of “whitespace”, you can also set grammar_whitespace to a custom regular expression object to be used instead. This regular expression should attempt to match as much whitespace as possible, starting at the specified position in the string (the actual match result is not used, except that its length is used to determine how far to skip ahead in the string).

Note: In general, you will want to set this universally for your whole grammar. The best way to do this is to define a grammar_whitespace module-level variable in the same module as your grammar classes are defined. If this is present, it will be used as the default for all grammar classes in that module.

There are also a few less-commonly-used class attributes which may be useful when inspecting grammars, or may be overridden in special cases:

Grammar.grammar_terminal

Determines whether this grammar is considered to be a “terminal” for the purposes of terminals(), tokens(), and generate_ebnf(). By default, any grammar which contains no sub-grammars (an empty grammar attribute) is considered to be a terminal, and any grammar which has sub-grammars is not.

Grammar.grammar_greedy

If set to True (default), indicates that in cases where this grammar could match multiple instances of a sub-text (i.e. for grammars that match repetitions), it should attempt to match the longest possible string first. By contrast, if set to False, the grammar will attempt to match the shortest repetition first.

Note: This attribute does not have any affect on most custom grammars (because most custom grammars are not themselves repetition grammars (instances of Repetition)). If you are looking to change this behavior in your own grammar definitions, you likely want to use the collapse parameter of REPETITON() (and related functions) instead. Changing this attribute is mainly useful if for some reason you want to make a custom subclass of Repetition, or if you are making a custom grammar element (with a custom grammar_parse() definition) for which this setting might be significant.

Grammar.grammar_collapse_skip

Specifies that, if an enclosing grammar is set to collapse, and this grammar is in its sub-grammar list, instances of this sub-grammar should also be left out of the resulting parse tree.

Note: There is usually no reason to set this attribute. (It is enabled by default for LITERAL() grammars, as it is often desirable to leave literal matches out when collapsing grammars since they usually provide no information which isn’t already known to the grammar designer.)

Overridable Class Methods

The following methods may be overridden in subclasses to change the default behavior of a grammar class:

classmethod Grammar.grammar_details(depth=-1, visited=None)

Returns a string containing a description of the contents of this grammar definition (such as used by repr()).

depth specifies a recursion depth to use when constructing the string description. If depth is nonzero, this method will recursively call grammar_details() for each of its sub-grammars in turn to construct the final description. If depth is zero, this method will just return this grammar’s name (grammar_name). If depth is negative, recursion depth is not limited.

visited is used for detecting circular references during recursion. It can contain a tuple of grammars which have already been seen and should not be descended into again.

classmethod Grammar.grammar_ebnf_lhs(opts)

Determines the string to be used for this grammar when it occurs in the left-hand-side (LHS) of an EBNF definition. This can be overridden to customize how this grammar is represented by generate_ebnf().

Returns a tuple (string, grammars), where string is the EBNF LHS string to use, and grammars is a list of other grammars on which this one depends (i.e. grammars whose names are referenced in string).

classmethod Grammar.grammar_ebnf_rhs(opts)

Determines the string to be used to describe this grammar when it occurs in the right-hand-side (RHS) of an EBNF definition. This can be overridden to customize how this grammar is represented by generate_ebnf().

Returns a tuple (string, grammars), where string is the EBNF RHS string to use, and grammars is a list of other grammars on which this one depends (i.e. grammars whose names are referenced in string).

classmethod Grammar.grammar_parse(text, index, sessiondata)

This method is called by the modgrammar parser system to actually attempt to match this grammar against a piece of text. This method is not intended to be called directly by an application (use the parser() method to obtain a GrammarParser object and use that). In advanced cases, this method can be overridden to provide custom parsing behaviors for a particular grammar type.

NOTE: Overriding this method is very complicated and currently beyond the scope of this documentation. This is not recommended for anyone who does not understand the modgrammar parser design very well. (Someday, with luck, there will be some more documentation written on this advanced topic.)

Grammar.grammar_collapsed_elems(sessiondata)

Note: This is an instance method, not a classmethod

Return the list of elements to be used in place of this one when collapsing (this is only used if grammar_collapse is True).

Useful Class Methods

The following methods are intended to be called on grammar classes by the application:

classmethod Grammar.parser(sessiondata=None, tabs=1)

Return a GrammarParser associated with this grammar.

If provided, sessiondata can contain data which should be provided to the elem_init() method of each result object created during parsing.

The tabs parameter indicates the width of “tab stops” in the input (i.e. how far a “tab” character will advance the column position when encountered). This is only used to correctly report column numbers in ParseErrors. If you don’t care about that, or your input does not contain tabs, you can ignore this parameter.

classmethod Grammar.grammar_resolve_refs(refmap={}, recurse=True, follow=False, missing_ok=False, skip=None)

Resolve any REF() declarations within the grammar and replace them with the actual sub-grammars they refer to. The following optional arguments can be provided:

refmap
If provided, contains a dictionary of reference-name to grammar mappings to use. If a reference’s name is found in this dictionary, the dictionary value will be used to replace it, instead of using the standard name-lookup procedure.
recurse
If set to True, will perform a recursive search into each of this grammar’s sub-grammars, calling grammar_resolve_refs() on each with the same parameters.
follow
If set to True (and recurse is also True), will also call grammar_resolve_refs() on the result of each REF() after it is resolved.
missing_ok
If True, it is not considered an error if a REF() construct cannot be resolved at this time (it will simply be left as a REF() in the resulting grammar). If False, then all references must be resolvable or an UnresolvedReference exception will be raised.
skip
An optional list of grammars which should not be searched for REF() constructs (useful in conjunction with recurse to exclude certain parts of the grammar).

Result Objects

As match result objects are actually instances of the grammar class which produced the match, it is also possible, when defining a new grammar class, to override or add new instance methods which will affect the behavior of any associated result objects. Result objects also posess a number of attributes and methods which can be useful when examining parse results.

Overridable Instance Methods

Grammar.elem_init(sessiondata)

This method is called on each result object after it is fully initialized, before the resulting parse tree is returned to the caller. It can be overridden to perform any custom initialization desired (the default implementation does nothing).

Useful Instance Attributes

Grammar.string

Contains the portion of the text string which this match corresponds to.

Grammar.elements

Contains result objects for each of the sub-grammars that make up this grammar match. There is typically one entry in elements for each entry in grammar (though there may not be a direct correspondence if things like grammar_collapse are used)

Useful Instance Methods

Grammar.get(*type_path)

Return the first immediate sub-element of the given type (or by descending through multiple types, in the same way as get_all()).

This is equivalent to .get_all(*type_path)[0] except that it is more efficient, and will return None if there are no such objects (instead of raising IndexError).

Grammar.get_all(*type_path)

Return all immediate sub-elements of the given type.

If more than one type parameter is provided, it will treat the types as a “path” to traverse: for all sub-elements matching the first type, retrieve all sub-elements of those matching the second type, and so on, until it reaches the last type in the list. (It will thus return all elements of the parse tree which are of the final type, which can be reached by traversing the previous types in order.)

Grammar.find(*type_path)

Return the first element anywhere in the parse tree matching the given type (or by descending through multiple types, in the same way as find_all()).

This is equivalent to .find_all(*type_path)[0] except that it is more efficient, and will return None if there are no such objects (instead of raising IndexError).

Grammar.find_all(*type_path)

Return all elements anywhere in the parse tree matching the given type.

Similar to get_all(), if more than one type parameter is provided, it will treat the types as a “path” to traverse in order. The difference from get_all() is that, for each step in the path, the elements found do not have to be direct sub-elements, but can be anywhere in the sub-tree.

Grammar.find_tag(*tag_path)

Return the first element anywhere in the parse tree with the given tag (or by descending through multiple tags, in the same way as find_tag_all()).

This is equivalent to .find_tag_all(*tag_path)[0] except that it is more efficient, and will return None if there are no such objects (instead of raising IndexError).

Grammar.find_tag_all(*tag_path)

Return all elements anywhere in the parse tree with the given tag.

This functions identically to find_all(), except that the criteria for matching is based on tags, rather than object types.

Grammar.terminals()

Return an ordered list of all result objects in the parse tree which are terminals (that is, where grammar_terminal is True).

Grammar.tokens()

Return the parsed string, broken down into its smallest grammatical components. (Another way of looking at this is that it returns the string values of all of the terminals().)

Parser Objects

class modgrammar.GrammarParser

Parser objects are the way in which an application can actually make use of a grammar definition. They provide the core interface to take input texts and attempt to match them against an associated grammar definition.

GrammarParser objects are not generally instantiated directly. Instead, to obtain one, call the parser() method on the appropriate grammar class.

Parser objects have the following useful attributes:

char

The number of characters we’ve successfully parsed since the beginning of parsing (or the last reset()).

line

The number of lines we’ve successfully parsed since the beginning of parsing (or the last reset()). This is measured based on the number of line-end sequences we’ve seen thus far.

col

The position of the current line we’re at.

Methods:

parse_string(string, bol=None, eof=None, reset=False, multi=False, data=None, matchtype='first')

Attempt to match string against the associated grammar. If successful, returns a corresponding match object. If there is an incomplete match (or it is impossible to determine yet whether the match is complete or not), save the current text in the match buffer and return None to indicate more text is required. If the text does not match any valid grammar construction, raise ParseError.

Optional parameters:
reset
Call reset() before starting to parse the supplied text.
multi
Instead of returning a single match result, keep matching as many times as possible before returning, and return a list of matches, in sequence.
eof
Indicates that no more text will be coming, and the parser should return the best match it can with the supplied text instead of asking for more. (If eof is set, the parser will never return a None result, unless the buffer is completely empty.)
data
Use the provided data instead of the default sessiondata during this parse run.
matchtype
If a grammar could match multiple ways, determine how the best match is chosen:
“first” (default)
The first successful match the grammar comes up with (as determined by normal grammar test ordering).
“last”
The last successful match.
“longest”
The match which uses up the longest portion of the input text.
“shortest”
The match which uses up the shortest portion of the input text.
“all”
Return all possible matches, in a list. Note that in this case the buffer position will not be automatically advanced. You must call skip() manually.
bol
Treat the input text as starting at the beginning of a line (for the purposes of matching the BOL grammar element). It is not usually necessary to specify this explicitly.
parse_lines(lines, bol=False, eof=False, reset=False, data=None, matchtype='first')

(generator method)

Attempt to match a list (or actually any iterable) of strings against the associated grammar. This is effectively the same as calling parse_string() repeatedly for each string in the list to obtain all matches in sequence.

Return values, exceptions, and optional parameters are all exactly the same as for parse_string().

Note: Be careful using matchtype="all" with parse_lines/parse_file. You must manually call skip() after each yielded match, or you will end up with an infinite loop!

parse_file(file, bol=False, eof=True, reset=False, data=None, matchtype='first')

(generator method)

Open and process the contents of a file using the associated grammar. This is basically the same as opening the specified file, and passing the resulting file object to parse_lines().

Return values, exceptions, and optional parameters are all exactly the same as for parse_string().

Note: Be careful using matchtype="all" with parse_lines/parse_file. You must manually call skip() after each yielded match, or you will end up with an infinite loop!

remainder()

Return the remaining contents of the parse buffer. After parsing, this will contain whatever portion of the original text was not used by the parser up to this point.

clear_remainder()

Clear any un-matched text left in the buffer.

reset()

Reset this parser back to its initial state.

This will clear any remainder in the buffer and reset all (line, column, etc) counters to zero.

skip(count)

Skip forward the specified number of characters in the input buffer (discarding the text skipped over).

Built-In Grammars

The following basic grammar classes/factories are provided from which more complicated grammars can be constructed. For those that take arguments, in addition to the arguments listed, there are a number of standard keyword arguments which can also be provided to alter the default behaviors:

  • For any grammars which involve repetition, the min and max parameters can be used to change the minimum and maximum number of repetitions which are allowed. count can also be used to set min and max to the same value.

  • There are also several standard keyword parameters which correspond to the standard class attributes for the Grammar class. Setting these keyword arguments will have the same effect as if the corresponding class attribute had been specified in a class definition:

    Keyword

    Class Attribute

    collapse

    grammar_collapse

    collapse_skip

    grammar_collapse_skip

    greedy

    grammar_greedy

    tags

    grammar_tags

    whitespace

    grammar_whitespace

modgrammar.GRAMMAR(*subgrammars, **kwargs)

Allows the construction of “anonymous grammars”, that is, creating a grammar without explicitly defining a named class derived from the Grammar superclass. This can be useful for some simple grammars where a full class definition is not necessary.

subgrammars is a list of other grammars which the new grammar should be made up of, the same as would be given as the grammar attribute in a grammar class definition.

modgrammar.G(*subgrammars, **kwargs)

This is a synonym for GRAMMAR()

modgrammar.LITERAL(string, **kwargs)

Create a simple grammar that only matches the specified literal string. Literal matches are case-sensitive.

modgrammar.L(string, **kwargs)

This is a synonym for LITERAL()

modgrammar.WORD(startchars, restchars=None, **kwargs)

Match any text consisting of a sequence of the specified characters. If restchars is not provided, all characters in the sequence must be in the set specified by startchars. If restchars is provided, then startchars specifies the valid options for the first character of the sequence, and restchars specifies the valid options for all following characters.

startchars and restchars are each strings containing a sequence of individual characters, or character ranges, in the same format used by python regular expressions for character-range ([]) operations (i.e. "0123456789" or "A-Za-z"). If the first character of startchars or restchars is ^, the meaning is also inverted, just as in regular expressions, so "^A-Z" would match anything except an upper-case ascii alphabet character.

modgrammar.ANY_EXCEPT(charlist, **kwargs)

Match a string of any characters except those listed in charlist.

This is functionally equivalent to WORD("^"+charlist).

modgrammar.OR(*grammars, **kwargs)

An either-or grammar that will successfully match if any of its subgrammars matches. OR() grammars can also be created by combining other grammars in python expressions using the or operator (|).

Note: Each of the possible grammars are attempted in left-to-right order. This means that if more than one of the listed grammars could potentially match, the leftmost one will always match first.

modgrammar.EXCEPT(grammar, exc_grammar, **kwargs)

Match grammar, but only if it does not also match exception_grammar. (This is equivalent to the - (exception) operator in EBNF) EXCEPT() grammars can also be created by combining other grammars in python expressions using the except operator (-).

Note: In many cases there are more efficient ways to design a particular grammar than using this construct. It is provided mostly for full EBNF compatibility.

modgrammar.REPEAT(*grammar, **kwargs)

Match (by default) one-or-more repetitions of grammar, one right after another. If the min or max keyword parameters are provided, the number of matches can be restricted to a particular range.

modgrammar.OPTIONAL(*grammar, **kwargs)

Specify that grammar is optional. It will match if present, or it will match the empty string if grammar cannot be matched.

If grammar is present in the matched input text, this element of the parse tree will contain a single grammar match object (the same as it would have if GRAMMAR(*grammar) had matched). If grammar was not found, this element of the parse tree will contain None.

This construct is functionally equivalent to OR(grammar, EMPTY). (It is is also functionally similar to REPEAT(*grammar, min=0, max=1, collapse=True), except that in the case of REPEAT(), an empty-match produces no elements at all in the resulting parse tree (not even None).)

modgrammar.ZERO_OR_MORE(*grammar, **kwargs)

This is a synonym for REPEAT(*grammar, min=0)

modgrammar.ONE_OR_MORE(*grammar, **kwargs)

This is a synonym for REPEAT(*grammar, min=1)

modgrammar.LIST_OF(*grammar, sep=", ", **kwargs)

Match a list consisting of repetitions of grammar separated by sep. As with other repetition grammars, the min and max keywords can also be used to restrict the number of matches to a certain range.

Note: Although this is most commonly used with a literal separator (such as the default ","), actually any (arbitrarily-complex) subgrammar can be specified for sep if desired.

modgrammar.NOT_FOLLOWED_BY(*grammar, **kwargs)

Returns a successful match as long as the next text after this point does NOT match the specified grammar.

When successful (that is, the next text in the input does not match the specified grammar), this element of the parse tree will contain None, and no input text will be consumed. When unsuccessful (that is, the next text does match), a ParseError will be raised.

modgrammar.ANY

Match any single character.

modgrammar.EMPTY

Match the empty string.

Note: In most cases, None is also equivalent to EMPTY

modgrammar.BOL

Match the beginning of a line.

This grammar does not actually consume any of the input text, but can be used to ensure that the next token must occur at the beginning of a new line (i.e. either the beginning of the file, or following an EOL).

modgrammar.EOL

Match the end of a line.

This grammar will match most common forms of line-end (newline, carriage-return, carriage-return + newline, or newline + carriage-return). If you need something more specific, you may just want to use LITERAL() instead.

modgrammar.EOF

Match the end of the file.

Note: This grammar will only match if the parse function is called with eof=True to indicate the end-of-file has been encountered.

modgrammar.REST_OF_LINE

Match everything up to (but not including) the next EOL.

modgrammar.SPACE

Match any string of whitespace.

Note: This may not match as you expect if your grammar is whitespace-consuming (see the grammar_whitespace attribute).

The modgrammar.extras module also contains some additional built-in grammars which can be useful in some contexts.

References

In some cases, it is necessary to refer to a portion of your grammar before it has actually been defined (for example, for recursive grammar definitions). In this case, the REF() function can be used to refer to a grammar by name, which will be resolved to an actual grammar later. (This construct can also be used to define a grammar which includes some “user-defined” sub-grammar, which the calling application can then provide at runtime.)

modgrammar.REF(ref_name, module=DEFAULT, default=None)

Create a reference to a grammar named ref_name, to be resolved later.

This can either be resolved by calling grammar_resolve_refs() prior to parsing, or, alternately, modgrammar will automatically attempt to resolve any REF() whenever it is used in parsing, and will treat it the same as if it were actually an occurrence of the resolved grammar.

By default, resolving a reference involves searching for a grammar class with the same name in the same python module. The python module is determined based on the location where the REF() call occurred. If you wish to use a different module to look for the grammar this REF() refers to, it can be provided in the module parameter. If module is given as None, then no module will be searched.

If provided, default should contain a grammar which will be used if the given reference cannot be resolved.

Exceptions

exception modgrammar.ParseError

Raised by the parser when the provided text cannot be matched against the grammar.

This exception has several useful attributes:
grammar

The top-level grammar the parser was attempting to match.

buffer

The contents of the text buffer the parser was attempting to match against the grammar.

pos

The position within the buffer at which the problem occurred.

char

The (total parsing) character position at which the problem occurred (similar to the GrammarParser.char attribute).

line

The line at which the problem occurred (similar to the GrammarParser.line attribute).

col

The column position within the line at which the problem occurred (similar to the GrammarParser.col attribute).

expected

A list of possible sub-grammars which the parser expected to find at this position (but didn’t).

message

The text message which would be printed if this exception were printed. (This is of the form “Expected ...: Found ...”)

exception modgrammar.ReferenceError

This is the base class for UnknownReferenceError and BadReferenceError. It can be used to easily catch either exception.

exception modgrammar.UnknownReferenceError

An attempt was made to resolve a REF() reference, but no grammar with the given name could be found, and no default was provided in the REF() declaration.

exception modgrammar.BadReferenceError

An attempt was made to resolve a REF() reference, and the reference name was resolved to an object, but the object is not a valid grammar object.

exception modgrammar.InternalError

This exception is raised by the parser if something happens which should never happen. It usually indicates that a grammar with a custom grammar_parse() definition has done something it shouldn’t.

Miscellaneous

modgrammar.generate_ebnf(grammar, **opts)

(generator function)

Take a given grammar and produce a description of the grammar in Extended Backus-Naur Form (EBNF). This generator produces fully-formatted output lines suitable for writing to a file, etc.

As there are a few different variants of EBNF in common use, as well as some aspects which could be considered a matter of preference when producing such descriptions, this function also accepts a variety of configurable options, specified as keyword parameters:

wrap (default 80)
Wrap the output text at wrap columns.
align (default True)
Align each entry so that all of the RHSes start on the same column.
indent (default True)
The number of characters that subsequent (wrapped) lines should be indented. Can be set to either a number or to True. If set to True, the indent will be auto-calculated to line up with the position of the RHS in the first line.
expand_terminals (default False)
If grammars have subgrammars, show their expansion even if grammar_terminal is true.
special_style (default “desc”)
How some grammars (which can’t be easily represented as EBNF) should be represented inside EBNF “special sequences”. Valid options are “desc” (use the (human-readable) grammar_desc text), “name” (just use the grammar’s name), or “python” (use a repr-like syntax similar to the python syntax used to create them).

Additional options may also be offered by certain individual grammars.