Adding Comments and Exponentiation

Part 2 of Parsing and Evaluating Mathematical Expressions

It may be useful to allow comments in expressions. Since these are, by definition, ignored by the parser, adding comments only requires a change in the tokenizer. Defining the syntax of comments to ensure they are recognized and then treating them as white space will take care of this addition to the parser.

Three mathematical operations will be added. Integer division and modulo division require minimal changes to the grammar. The more useful addition to the parser, exponents, 12³ for example, requires substantial changes to the grammar so that the parser will need to be overhauled. An additional, fourth, function to be called Exponent will be added to handle the new syntactic element and some changes to the previous three methods Term, Expression and Factor must be made.

Multi-character operators

The start of a comment will be marked with a ‘//’ and the end of the comment will be the end of the line. The tokenizer must recognize operators that may be two characters long. This is easily done by introducing the Peek method that returns the character following the current token.

The Peek function is easily implemented because the tokenizer’s private field FNextCharPos already points to the first character following the current token. The only subtlety involved is the possibility that the end of the source has already been found. In that case, Peek will return a #0 character which is fine because that can never be matched with the second character of an operator.

function TMdTokenizer.Peek: char; begin if FNextCharPos > length(FSource) then result := #0 else result := FSource[FNextCharPos]; end;

When the operator ‘/’ is found, the procedure GetOperator uses Peek to check if the following characters is also a ‘/’. If it is not, then a ttDiv operator was found and that is what is returned. If Peek finds a ‘/’ then the token type returned is ttComment and the comment is parsed to the end of the current line.

procedure GetOperator(c: char); begin MoveByOneChar; {skip c} case c of '-': result := ttMinus; '+': result := ttPlus; '/': if Peek <> '/' then result := ttDiv else begin result := ttComment; MoveToEndOfLine; end; ...

The MoveToEndOfLine method is invoked just after the comment deliminator ‘//’ has been found. Its task is to move the FNextCharPos to the first character following the comment to the end of the current line. This task is somewhat complicated by the various sequences of characters that are used on different systems to mark the end of a line. The following routine should work when the end of lines are marked by any combination of a carriage return (CR) and line feed (LF).

procedure TMdTokenizer.MoveToEndOfLine; begin while (FNextCharPos <= length(FSource)) and not (FSource[FNextCharPos] in [LF, CR]) do MoveByOneChar; while (FNextCharPos <= length(FSource)) and (FSource[FNextCharPos] in [LF, CR]) do MoveByOneChar; end;

Initially, all characters that are not end of line markers are accumulated in the token and then all characters that are either a CR or LF are added to the token. This means that all empty lines following the comment will be included in the latter.

It could be useful to keep track of the line and column count of each token when parsing long mathematical expressions over multiple lines. That way the user could be informed of the line and column where a syntax error is found. In that case, it would be necessary to count the number of end of line markers found in the second while loop. This could be done in the MoveByOneChar routine.

procedure TMdTokenizer.MoveByOneChar; begin if FSource[FNextCharPos] = LF then begin inc(FLineCount); FColCount := 0; end; inc(FNextCharPos); inc(FTokenLength); end;

Note the addition of two fields: FLineCount and FColCount. This is by no means the complete story with respect to end of lines. The routine would work on Windows, Linux and OS X systems but not on older Apple systems. Here is quick survey of the ways end of lines are identified:

a carriage return (CR) (as in Apple II and Mac OS up to version 9)
a line feed (LF) (Unix and similar systems including Linux, OS X) or by
the sequence CR LF (CP/M, MS-DOS, Windows, internet protocols etc.).

Unicode defines other codes beside CR and LF that can indicate an end of line (see Mark Davis’ Unicode Newline Guidelines).

To date, the parsers presented in this series of paper have not been used to parse multiple lines of input. The readers should consider the above as suggestions that need improvement. Also note that in Windows, character positioning in a control is done by an index from the first character without reference to line numbers which means there is no real need to keep track of them.

Wirth Syntax Notation

Adding exponentiation requires modification of the simple grammar of allowed mathematical expressions. Since the changes to the code are short, this is a good time to introduce a formal syntax notation called Wirth syntax notation (WSN), proposed by Niklaus Wirth in 1977.

Here is an example of a "production", basically a definition using WSN.

Digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" .

Four symbols are used in the above definition. The equals sign (=) indicates a production where the element to the left of the = is defined by the elements to the right. The terminal period (.) denotes the end of the definition. Items enclosed in quotation marks (") are literals. The vertical bar (|) stands for disjunction or. A digit is thus any one character from 0 to 9.

To simplify things, Wirth adds three constructs, curly brackets that denote repetition, square brackets that denote optional elements, and parenthesis that serve to group items.

Repetition is denoted by curly brackets: {a} stands for <none> | a | aa | aaa | ….
Optionality is expressed by square brackets: [a]b stands for ab | b.
Groupings are indicated by parenthesis: (a|b)c stands for ac | bc.

Here’s the definition of a number in WSN:

Number = Digit{Digit} [ "." {Digit} [("E" | "e") ["-" | "+"] Digit{Digit}]].

Here is an alternative definition:

Integer = Digit {Digit} . SignedInteger = ["-" | "+"] Integer . Number = Integer [ "." {Digit} [ ("E" | "e") SignedInteger]] .

The sequence Digit {Digit}, may be slightly unsettling at first blush. It is necessary to specify that a number, or an integer, must start with at least one digit. The rule {Digit} does represent a sequence of digits containing at least one digit since it could be empty.

Using WSN, here is a definition of our simple grammar as developed so far:

Term = Factor { ("*" | "/" ) Factor } . Expression = Term { ("-" | "+") Term } . Factor = Number | "(" Expression ")" | ("-" | "+") Factor .

Division

The result of a division such as 7 ÷ 3 yields a quotient 2 and a remainder 1 such that the dividend is equal to sum of the remainder and the quotient multiplied by the divisor: 7 = 3*2 + 1. It may be useful to obtain the quotient or the remainder. Two operators will be introduced to return these results. The quotient will be represented by the infix operator ‘:’ , the remainder by the infix operator ‘%’. Hence 7:3 = 2 and 7%3 = 1.

These operations have the same precedence as the normal division. Hence the production of a term is now

Term = Factor { ("*" | "/" | ":" | "%" ) Factor } .

They will be handled in the Term routine.

function Term: float; begin result := Exponent; repeat if Tokenizer.TokenType = ttMult then begin Tokenizer.NextTrueToken; result := result * Exponent; end else if Tokenizer.TokenType = ttDiv then begin Tokenizer.NextTrueToken; result := result / Exponent; end else if Tokenizer.TokenType = ttColon then begin Tokenizer.NextTrueToken; result := DivE(result, Exponent); end else if Tokenizer.TokenType = ttPercent then begin Tokenizer.NextTrueToken; result := ModE(result, Exponent); end else if Tokenizer.TokenType = ttUnknown then RaiseParseException(peUnexpectedElement) else begin exit; end; until false; end;

The usual names for the functions are div and mod. They are found in Delphi. However, the domain of these functions in Pascal is limited to integers. It was decided to implement the functions instead of using Pascal’s primitive functions. For one, there are a number of definitions and implementations of the functions and it appears as if Delphi does not follow the specified definition in the ISO standard for Pascal. That is probably in itself not a bad thing as the definition found in the Pascal standard has been questioned (Boote (1992) and Leijen (2001)). But it does mean that the result would depend on the compiler used. It would also be nice to be able to obtain a quotient of 2 when 8.2 is divided by 4.1. Specially because the parser does not have a specific integer type, all numbers are real values.

Both Boote and Leijen seem to favour what they call the Euclidean definition of the functions, but testing shows that Delphi now implements what both authors call the truncated definition. The table below shows the difference between the two definition. D is the dividend or numerator, d is the divisor or numerator, q the quotient (an integer) and r the remainder.

		Euclidean			Truncated
D	d	q	q×d	r	q	q×d	r
8	3	2	6	2	2	6	2
8	−3	−2	6	2	−2	6	2
−8	3	−3	−9	1	−2	−6	−2
−8	−3	3	−9	1	2	−6	−2

According to the Euclidean definition, q×d ≤ D whatever the signs of D and d. With the truncated definition q×d ≤ D if D > 0 and q×d ≥ D if D < 0 whatever the sign of d. Of course, in all cases, q and r are such that D = q×d + r. Which is better? While there are mathematical reasons to prefer the Euclidean definition, a programmer may very well prefer the truncated definition which will agree with Delphi for integer arguments. In any case, the latter was chosen be default but that behaviour can be modified by removing the definition of the compiler directive DELPHI_DIV_MOD in the file Parser.inc.

Exponentiation

As in LibreOffice.Calc, the caret "^" will be used to denote exponentiation which is the raising of a number to a given power. So ‘3^2’ will denote 3² which is 9. Users from with more intimate knowledge of Fortran or Perl might prefer the symbol "**". The tokenizer will be modified to return both strings as a token of type ttRaise.

procedure GetOperator(c: char); begin MoveByOneChar; {skip c} case c of '%': result := ttPercent; '(': result := ttLeftParenthesis; ')': result := ttRightParenthesis; '*': if Peek <> '*' then result := ttMult else begin result := ttRaise; MoveByOneChar; end; '+': result := ttPlus; '-': result := ttMinus; '/': if Peek <> '/' then result := ttDiv else begin result := ttComment; MoveToEndOfLine; end; ':': result := ttColon; '^': result := ttRaise; else result := ttUnknown; end; end;

Wirth’s syntax notation will now be used to add exponents to the simple grammar developed in the previous paper.

Exponent = Factor { ("^" | "**") Exponent } . Term = Exponent { ("*" | "/" ) Exponent } . Expression = Term { ("-" | "+") Term } . Factor = Number | "(" Expression ")" | ("-" | "+") Factor .

There is no change to the definitions of a factor and an expression. A term which was a sequence of factors is now a sequence of exponents separated by either a "*" or "/" instead of a sequence of factors. That is a simple change to make :

While the expected definition of an exponent could have been

Exponent = Factor { ("^" | "**") Factor }

it was not chosen. The reason is that, according to that definition, 4^3^2 would yield (4^3)^2 = 4 096. Using the usual mathematical notation leads to a different answer

4^3^2 = 4^9 = 262 144 not = (4^3)^2 .

This is really a personal choice, because both definitions are used: in Excel, 4^3^2 = 4 096 and in Perl 4**3**2 yields 262 144. Here is the function for exponents. It’s the simplest of the functions implementing the syntactic elements.

function Exponent: float; begin result := Factor; while Tokenizer.TokenType = ttRaise do begin Tokenizer.NextTrueToken; result := Power(result, Exponent); end; end;

If Excel like behaviour is preferred, then the directive EXPONENT_RIGHT_ASSOCIATIVE can be undefined in the file Parser.inc. The effect will be to change the line

result := Power(result, Exponent);

result := Power(result, Factor);

Precedence

There is another ambiguity that needs to be addresses: the meaning of the expression -3^2. Written in standard mathematical notation −32 is usually interpreted as meaning −(32) = −9. However the parser will actually yield 9 as a result. This is because the unary − has the highest precedence. Looking at the grammar, the unary minus is handled in the Factor routine.

Operator	Operation	Priority
()	parenthesis	5
-	negation	4
^ ( or ** )	exponentiation	3
*	multiplication	2
/	division	2
:	integer division	2
%	modulo	2
+	addition	1
-	subtraction	1

Operators with higher priority are done first.

Except for exponentiation, when operations of the same priority are done in succession, they are grouped left to right. Hence 2+3−1 is actually computed as (2+3)−1 and 2−3+1 is computed as (2−3)+1. Similarly 8*2/4 is interpreted as (8*2)/4 = 4 and 8/2*4 is calculated as (8/2)*4 = 16.

Again it is best to use parenthesis to control the precedence of operations when in doubt.

Some Coding Details

The function Power in the implementation of Exponent is not defined in Pascal. It is however in the unit Math.pas which unfortunately was not delivered with all versions of Delphi. A simple implementation of Power is provided in the unit MiniMath.pas. Define or undefine the directive USE_MINIMATH in Parser.inc to control which unit will be used.

{The Math unit is not included with some versions of Delphi; define this directive to use MiniMath.pas instead. Note that the implementation in Math.pas is probably better.} {$DEFINE USE_MINIMATH}

Homework

2. Modify the tokenizer to allow for inline comments that are delineated
a) { comment } - Pascal style
b) (* comment *) - older Pascal style
c) /* comment */ - C style.
Of course 2.a) cannot be combined with question 1.

3. Modify the tokenizer or parser to use the words div and mod instead of the operators ‘:’ and ‘%’.

Source code

The demonstration program compiles with Delphi version 2 and up. It should also compile with Lazarus/FreePascal with minimal changes. Unit tests require DUnit. The tests were done in Delphi 2010.

Bibliography

Boute, Raymond T. (1992) "The Euclidean Definition of the Functions Div and Mod", ACM Transactions on Programming Languages and Systems Volume 14 Issue 2, pp 127-144. http://dl.acm.org/citation.cfm?id=128862

Davis, Mark Unicode Newline Guidelines, http://unicode.org/reports/tr13/tr13-9.html

Leijen, Daan (2001) "Division and Modulus for Computer Scientists.", http://legacy.cs.uu.nl/daan/pubs.html

Wirth, Niklaus (1977), "What Can We Do about the Unnecessary Diversity of Notation for Syntactic Definitions?", Communications of the ACM, Vol 20, No 11 pp 822, 823.