It may be useful to allow comments in expressions. Since these are, by definition, ignored by the parser, adding comments only requires a change in the tokenizer. Defining the syntax of comments to ensure they are recognized and then treating them as white space will take care of this addition to the parser.
Three mathematical operations will be added. Integer division and modulo division require minimal changes to the grammar. The more useful addition to the parser, exponents, 123 for example, requires substantial changes to the grammar so that the parser will need to be overhauled. An additional, fourth, function to be called Exponent
will be added to handle the new syntactic element and some changes to the previous three methods Term
, Expression
and Factor
must be made.
Multi-character operators
The start of a comment will be marked with a ‘//’ and the end of the comment will be the end of the line. The tokenizer must recognize operators that may be two characters long. This is easily done by introducing the Peek
method that returns the character following the current token.
The Peek
function is easily implemented because the tokenizer’s private field FNextCharPos
already points to the first character following the current token. The only subtlety involved is the possibility that the end of the source has already been found. In that case, Peek
will return a #0 character which is fine because that can never be matched with the second character of an operator.
When the operator ‘/’ is found, the procedure GetOperator
uses Peek
to check if the following characters is also a ‘/’. If it is not, then a ttDiv
operator was found and that is what is returned. If Peek
finds a ‘/’ then the token type returned is ttComment
and the comment is parsed to the end of the current line.
The MoveToEndOfLine
method is invoked just after the comment deliminator ‘//’ has been found. Its task is to move the FNextCharPos
to the first character following the comment to the end of the current line.
This task is somewhat complicated by the various sequences of characters that are used on different systems to mark the end of a line. The following routine should work when the end of lines are marked by any combination of a carriage return (CR) and line feed (LF).
Initially, all characters that are not end of line markers are accumulated in the token and then all characters that are either a CR or LF are added to the token. This means that all empty lines following the comment will be included in the latter.
It could be useful to keep track of the line and column count of each token when parsing long mathematical expressions over multiple lines. That way the user could be informed of the line and column where a syntax error is found. In that case, it would be necessary to count the number of end of line markers found in the second while loop. This could be done in the MoveByOneChar
routine.
Note the addition of two fields: FLineCount
and FColCount
. This is by no means the complete story with respect to end of lines. The routine would work on Windows, Linux and OS X systems but not on older Apple systems. Here is quick survey of the ways end of lines are identified:
- a carriage return (CR) (as in Apple II and Mac OS up to version 9)
- a line feed (LF) (Unix and similar systems including Linux, OS X) or by
- the sequence CR LF (CP/M, MS-DOS, Windows, internet protocols etc.).
Unicode defines other codes beside CR and LF that can indicate an end of line (see Mark Davis’ Unicode Newline Guidelines).
To date, the parsers presented in this series of paper have not been used to parse multiple lines of input. The readers should consider the above as suggestions that need improvement. Also note that in Windows, character positioning in a control is done by an index from the first character without reference to line numbers which means there is no real need to keep track of them.
Wirth Syntax Notation
Adding exponentiation requires modification of the simple grammar of allowed mathematical expressions. Since the changes to the code are short, this is a good time to introduce a formal syntax notation called Wirth syntax notation (WSN), proposed by Niklaus Wirth in 1977.
Here is an example of a "production", basically a definition using WSN.
Four symbols are used in the above definition. The equals sign (=) indicates a production where the element to the left of the = is defined by the elements to the right. The terminal period (.) denotes the end of the definition. Items enclosed in quotation marks (") are literals. The vertical bar (|) stands for disjunction or. A digit is thus any one character from 0 to 9.
To simplify things, Wirth adds three constructs, curly brackets that denote repetition, square brackets that denote optional elements, and parenthesis that serve to group items.
- Repetition is denoted by curly brackets: {a} stands for <none> | a | aa | aaa | ….
- Optionality is expressed by square brackets: [a]b stands for ab | b.
- Groupings are indicated by parenthesis: (a|b)c stands for ac | bc.
Here’s the definition of a number in WSN:
Here is an alternative definition:
The sequence Digit {Digit}, may be slightly unsettling at first blush. It is necessary to specify that a number, or an integer, must start with at least one digit. The rule {Digit} does represent a sequence of digits containing at least one digit since it could be empty.
Using WSN, here is a definition of our simple grammar as developed so far:
Division
The result of a division such as 7 ÷ 3 yields a quotient 2 and a remainder 1 such that the dividend is equal to sum of the remainder and the quotient multiplied by the divisor: 7 = 3*2 + 1. It may be useful to obtain the quotient or the remainder. Two operators will be introduced to return these results. The quotient will be represented by the infix operator ‘:’ , the remainder by the infix operator ‘%’. Hence 7:3 = 2 and 7%3 = 1.
These operations have the same precedence as the normal division. Hence the production of a term is now
They will be handled in the Term routine.
The usual names for the functions are div and mod. They are found in Delphi. However, the domain of these functions in Pascal is limited to integers. It was decided to implement the functions instead of using Pascal’s primitive functions. For one, there are a number of definitions and implementations of the functions and it appears as if Delphi does not follow the specified definition in the ISO standard for Pascal. That is probably in itself not a bad thing as the definition found in the Pascal standard has been questioned (Boote (1992) and Leijen (2001)). But it does mean that the result would depend on the compiler used. It would also be nice to be able to obtain a quotient of 2 when 8.2 is divided by 4.1. Specially because the parser does not have a specific integer type, all numbers are real values.
Both Boote and Leijen seem to favour what they call the Euclidean definition of the functions, but testing shows that Delphi now implements what both authors call the truncated definition. The table below shows the difference between the two definition. D is the dividend or numerator, d is the divisor or numerator, q the quotient (an integer) and r the remainder.
Euclidean | Truncated | ||||||
D | d | q | q×d | r | q | q×d | r |
8 | 3 | 2 | 6 | 2 | 2 | 6 | 2 |
8 | −3 | −2 | 6 | 2 | −2 | 6 | 2 |
−8 | 3 | −3 | −9 | 1 | −2 | −6 | −2 |
−8 | −3 | 3 | −9 | 1 | 2 | −6 | −2 |
According to the Euclidean definition, q×d ≤ D whatever the signs of D and d. With the truncated definition q×d ≤ D if D > 0 and q×d ≥ D if D < 0 whatever the sign of d. Of course, in all cases, q and r are such that D = q×d + r. Which is better? While there are mathematical reasons to prefer the Euclidean definition, a programmer may very well prefer the truncated definition which will agree with Delphi for integer arguments. In any case, the latter was chosen be default but that behaviour can be modified by removing the definition of the compiler directive DELPHI_DIV_MOD
in the file Parser.inc
.
Exponentiation
As in LibreOffice.Calc, the caret "^" will be used to denote exponentiation which is the raising of a number to a given power. So ‘3^2’ will denote 32 which is 9. Users from with more intimate knowledge of Fortran or Perl might prefer the symbol "**". The tokenizer will be modified to return both strings as a token of type ttRaise
.
Wirth’s syntax notation will now be used to add exponents to the simple grammar developed in the previous paper.
There is no change to the definitions of a factor and an expression. A term which was a sequence of factors is now a sequence of exponents separated by either a "*" or "/" instead of a sequence of factors. That is a simple change to make :
While the expected definition of an exponent could have been
it was not chosen. The reason is that, according to that definition, 4^3^2 would yield (4^3)^2 = 4 096. Using the usual mathematical notation leads to a different answer
.
This is really a personal choice, because both definitions are used: in Excel, 4^3^2 = 4 096 and in Perl 4**3**2 yields 262 144. Here is the function for exponents. It’s the simplest of the functions implementing the syntactic elements.
If Excel like behaviour is preferred, then the directive EXPONENT_RIGHT_ASSOCIATIVE can be undefined in the file Parser.inc. The effect will be to change the line
to
Precedence
There is another ambiguity that needs to be addresses: the meaning of the expression -3^2. Written in standard mathematical notation −32 is usually interpreted as meaning −(32) = −9. However the parser will actually yield 9 as a result. This is because the unary − has the highest precedence. Looking at the grammar, the unary minus is handled in the Factor routine.
Operator | Operation | Priority |
---|---|---|
() | parenthesis | 5 |
- | negation | 4 |
^ ( or ** ) | exponentiation | 3 |
* | multiplication | 2 |
/ | division | 2 |
: | integer division | 2 |
% | modulo | 2 |
+ | addition | 1 |
- | subtraction | 1 |
Operators with higher priority are done first.
Except for exponentiation, when operations of the same priority are done in succession, they are grouped left to right. Hence 2+3−1 is actually computed as (2+3)−1 and 2−3+1 is computed as (2−3)+1. Similarly 8*2/4 is interpreted as (8*2)/4 = 4 and 8/2*4 is calculated as (8/2)*4 = 16.
Again it is best to use parenthesis to control the precedence of operations when in doubt.
Some Coding Details
The function Power in the implementation of Exponent is not defined in Pascal. It is however in the unit Math.pas which unfortunately was not delivered with all versions of Delphi. A simple implementation of Power is provided in the unit MiniMath.pas. Define or undefine the directive USE_MINIMATH in Parser.inc to control which unit will be used.