Contents

General Specification Syntax

We use a simple Extended Backus-Naur Form (EBNF) notation; based on the one specified by the W3C here, and modified with additional regular expression conventions.

This particular notation was chosen because it is:

  1. Mostly well-specified, by a respected organization.
  2. Syntactically very similar to common Regular Expression syntax.

Each Production in the grammar defines one symbol, using the following form:

symbol ::= expression

To improve clarity for the language implementer, we use the following capitalization convention, which is guided by the production’s role in the Abstract Syntax Tree (AST):

Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:

These symbols may be combined to match more complex patterns as follows, where A and B represent simple expressions:

Other notations used in the productions are:

/* ... */ comment.

[ wfc: ... ] well-formedness constraint; this identifies by name a constraint on well-formed documents associated with a production.

Lookahead Assertions

To improve precision we extend the W3C’s EBNF notation with lookahead assertions. This syntax represent “looking ahead” in the source stream: it attempts to match the subsequent input with the given pattern, but it does not consume any of the input; if the match is successful, the current position in the input stays the same.

This extension is taken directly from existing regular expression notations, such as the web standard detailed by MDN here.

Character Classes

To improve readability and properly handle Unicode, we further extend the W3C’s EBNF notation with regex-style character classes based on Unicode properties.

Common regex character class shortcuts are used, such as \n; \p{...} syntax is used to match characters belonging to a specific Unicode general category or script.

More information about character classes can be found on MDN and TC39.

Short-code Legend
Abb.Description
\nAscii newline character 0A
\dAscii digit char [0-9]
\sUnicode whitespace other than \n
\p-code Legend
Abb.Long formAbb.Long formAbb.Long form
LLetterSSymbolZSeparator
LuUppercase LetterSmMath SymbolZsSpace Separator
LlLowercase LetterScCurrency SymbolZlLine Separator
LtTitlecase LetterSkModifier SymbolZpParagraph Separator
LmModifier LetterSoOther SymbolCOther
LoOther LetterPPunctuationCcControl
MMarkPcConnector PunctuationCfFormat
MnNon-Spacing MarkPdDash PunctuationCsSurrogate
McSpacing Combining MarkPsOpen PunctuationCoPrivate Use
MeEnclosing MarkPeClose PunctuationCnUnassigned
NNumberPiInitial Punctuation
NdDecimal Digit NumberPfFinal Punctuation
NlLetter NumberPoOther Punctuation
NoOther Number

Lexical vs. Syntactic Grammar

It is important to understand the relationship between the lexer (tokenizer) and the parser in the context of this grammar. Ribbon’s design deliberately keeps the lexer’s role minimal.

Therefore, once we move beyond the basic tokens defined in the Lexical Grammar, the EBNF productions in this document define syntactic grammar. They describe the valid sequences of tokens that the parser accepts. A production should be interpreted from the parser’s perspective of reading its token stream.

Example
Symbol ::= "'" Sequence (?! "'")

This production describes the following parser behavior:

Match a Punctuation token containing a single quote, followed by a Sequence token, but only if the next token in the stream is not another single-quote Punctuation token.

This approach allows the grammar to precisely specify parsing logic, including lookahead, while keeping the lexical analysis phase simple and fast.