2014-05-04, 13:50
md
Adding Constants and Functions
Part 4 – National Language Support-> <-Part 2 – Adding Comments and Exponentiation
 Part 3 of Parsing and Evaluating Mathematical Expressions 

Scientific calculators tend to have a plethora of built in functions which usually include the trigonometric functions (sin, cos, tan, etc.), hyperbolic functions (sinh, cosh, tanh, etc.) , their inverses, and some mathematical functions (abs, sqrt, ln etc.). Also constants such as pi and Euler’s number e have a dedicated key. Perhaps many more are included especially if the calculator is used by scientists.

Adding such functions and constants is the subject of this third article on parsing and evaluating mathematical expressions. However, there is no question of adding buttons here. To paraphrase GNU Octave’s documentation, functions are named calculations and constants are named numbers. The parser will be modified to recognize these names and to call the routine that performs the calculation or retrieves the number.

The addition of built-in functions and constants requires a small change to the tokenizer so that it will recognize the names of the functions. It also requires the addition of a unit that will contain the definitions of the built-in functions. Finally, the grammar and parser must be modified to allow calling the built-in functions. These changes are considered in turn.

Identifiers

It is no longer possible to use single letter operators to identify all the possible functions that could be added to the parser. Indeed, functions and constants will be named. Thus a new type of syntactic element called an identifier must be introduced. It is actually simply defined, using Wirth’s syntax notation:

Digit = "0" | ... | "9" . Letter = "a" | ... | "z"|"A" | ... | "Z" . Identifier = Letter { Letter | Digit } .

According to that definition, an identifier is any sequence of letters and digits that begins with a letter. Some languages include the underscore "_" as a letter and some allow the period "." to be included in an identifier as long as it is not the first character. Since no other token type begins with a letter, lexical analysis of the source remains simple. The tokenizer can still use the first letter to identify the token type and then accumulate the subsequent letters and digits.

identifier.

Only a couple of lines need to be added to the NextToken method. After it is determined that the first character of a token is not a space, nor a digit (which would be the start of a number) a test is made to see if it is a letter. In the affirmative, the letter marks the beginning of an identifier which is fetched with the GetIdentifier procedure.

function NextToken: TMdTokenType; procedure GetWhite; ... remains unchanged procedure GetNumber; ... remains unchanged procedure GetOperator(c: char); ... remains unchanged procedure GetIdentifier; begin result := ttIdentifier; MoveByOneChar; while (FNextCharPos <= length(FSource)) and isAlphaNumeric(FSource[FNextCharPos]) do MoveByOneChar; end var c: char; begin // move past current token FTokenStart := FNextCharPos; FTokenLength := 0; if FNextCharPos > length(FSource) then result := ttEOS else begin c := FSource[FTokenStart]; if isWhite(c) then GetWhite else if isDigit(c) then GetNumber else if isLetter(c) then GetIdentifier else GetOperator(c); end; FTokenType := result; end;

That procedure merely accumulates in the token all subsequent letters and digits. The Boolean function isAlphaNumeric(c: char) returns true if c is a digit or a letter. As can be seen, there really is not much involved in the lexical analysis of identifiers.

Built-in Functions

Most of the functions that will be added to the parser are simple transformations: real functions of a real variable. But one, log, is defined over two variables another, clamp, has three arguments. There are also some functions such as avg or var that have an undetermined number of variables.

The grammar will allow constructs such as min(sin(pi/2), cos(pi/2)). However when it is time to calculate the value of the functions, expressions such as pi/2 must be calculated and the result will be supplied to sin and cos. Similarly, min will be called with two numerical values.

All built-in functions and constants are descendants of an abstract class called TTarget. The name was chosen because these classes will be the targets of a Call method of the parser. There is only one property in the abstract class, Count which is the number of parameters or arguments of the target. For constants this will be equal to 0. For most functions, Count is a fixed value, usually 1. For that reason the virtual method SetCount in TTarget does nothing. The abstract method Calc will be called each time the value of a function or constant is needed.

Type TTarget = class private FCount: integer; protected procedure SetCount(value: integer); virtual; public function Calc: float; virtual; abstract; property Count: integer read FCount write SetCount; end; TConstant = class(TTarget) private FValue: float; public constructor create(value: float); function Calc: float; override; end;

The simplest targets are constants. They have an additional private field called FValue which is set when the constant is created. The Calc method merely returns that value.

function TConstant.Calc: float; begin result := FValue; end;

Most built-in functions are implementations of the class TFunction.

Type TFunction = class(TTarget) private FArgs: pFloatArray; function GetArg(index: integer): float; public constructor create(aCount: integer); property Args[index: integer]: float read GetArg; property ArgArray: pFloatArray read FArgs write FArgs; end;

When created, the specified number of expected arguments, aCount, is copied into the internal field FCount and cannot be changed. The property ArgArray is a pointer set by the calling method to an array of real numbers created on the heap by the latter. ArgArray must point to aCount = Count values. The property Args which uses the private GetArg function to access the values in the array is used by the descendants. Strictly speaking this is not necessary and FArgs^[i] could be used just as well as Args[i] when calculating the function’s value. Its just not as pretty.

Here is the implementation of the log function. By default, the name of all built-in functions and constants begins with an underscore instead of the traditional T. Built-in functions and constants are all defined in the implementation part of the unit Targets.pas. None of the details concerning built-in functions need to be exposed. When looking at that lengthy unit, it is useful to remember the simple convention used. If the identifier for a function is ‘foo’, the associated class is called _foo.

Type _log = class(TFunction) public function Calc: float; override; end; function _log.Calc: float; begin result := logn(Args[1], Args[0]); end;

When creating this function, the constructor will be called with aCount = 2:

_log.create(2)

Most functions with a fixed number of arguments are just as simple or even simpler. Functions with a variable number of arguments are descendants of the type TVariableFunction.

TVariableFunction = class(TFunction) private FMinCount: integer; protected procedure SetCount(value: integer); override; public constructor create(aMinCount: integer); property MinCount: integer read FMinCount; end;

There is an extra field in this class called MinCount. It is set to the minimum number of arguments the function requires. The class' constructor sets the number of expected arguments FCount to 0. The class also overrides the SetCount method and allows the caller to specify the number of arguments that were found. Here is the implementation of the average function which shows how Count is used in those functions.

function _avg.Calc: float; var i: integer; begin result := 0; for i := 0 to Count-1 do result := result + args[i]/count; end;

All built-in functions have to be registered, as it were, so that the parser knows their name and the address of the function that calculates the value of the function. This is a simple matter of storing the names of the functions in a string list along with a record which contains the number of parameters for the function and the address of the function to execute when it needs to be evaluated.

Type TTargetList = class private FList: TStrings; public constructor Create; destructor Destroy; override; procedure Add(const aName: string; target: TTarget); procedure Clear; function Delete(const aName: string): boolean; function Find(const aName: string; var target: TTarget): boolean; function HasTarget(const aName: string): boolean; end; var InternalTargets: TTargetList;

The most used methods of this class are Add and Find. An internal function is registered by calling the Add method with the function’s name and it’s instantiated class. Adding the _log function defined above is done with the following statement:

InternalTargets.add('log', _log.create(2));

If an error occurs, because a function with the given name is already in the list, the target is freed and an exception is raised.

Use of the list is straight forward. When the parser finds an identifier, it searches for it with the method InternalTargets.Find. If the identifier is found, Find will return true and the matching function or constant in the variable target. Otherwise it returns false. It is up to the parser to raise an exception in that case.

While the usual functions are added to InternalTargets in Targets’ initialization code, it is possible to add or delete a function to the list at any time. The demonstration program shows how to do so. However this is not the same as having the user add a function that he defines at run time. That complication is considered in a later article.

Parsing a built-in Function

Very little needs to be added to the parser. Two productions need to be added to the grammar, the optional list of arguments of a function which will be called Params is defined as follows in Wirth syntax notation :

Params = "(" Expression { ListChar Expression} ")"

where ListChar is a locale dependant list separator. In English language systems this is usually the comma, in French language systems it is the semi-colon. A call to a function is a statement of the following form

Call = Identifier [ Params ]

A call to a function without any parameters is actually a call to a constant. A call is equivalent to a number. While it may require very convoluted calculations, eventually a called function must resolve into a single number and the call itself can occur only in a factor:

Factor = Number | "(" Expression ")" | Call | ("-" | "+") Factor .

So the only method of the parser that needs to be changed is Factor.

... if Tokenizer.TokenType = ttNumber then begin result := StrToNumber(Tokenizer.Token); Tokenizer.NextTrueToken; end else if Tokenizer.TokenType = ttIdentifier then begin result := Call; end else if Tokenizer.TokenType = ttPlus then begin Tokenizer.NextTrueToken; // skip and ignore leading '+' result := Factor; ...

The function Call must take care of parsing the argument list, if there is one, and evaluate the function and return it’s value. This is easily the most complicated routine in the whole parser. To simplify the work is divided into two parts, parsing the argument list is done by another function called Params. That method must handle an unknown number of arguments. So the idea is relatively simple, parse expressions separated by the ListChar until a right parenthesis is found and return the number of number of arguments found as well as their values. Here is its code.

function Params(var FoundValues: pFloatArray): integer; var list: TList; i: integer; aValue: float; apfloat: pFloat; procedure FreeList; var i: integer; begin for I := 0 to List.Count - 1 do dispose(pFloat(List[I])); list.Free; list := nil; end; begin if Tokenizer.NextTrueToken <> ttLeftParenthesis then RaiseParseException(peExpectedLeftPar); Tokenizer.NextTrueToken; FoundValues := nil; result := 0; list := TList.Create; try repeat avalue := Expression; new(aPFloat); aPFloat^ := aValue; list.Add(aPFloat); if Tokenizer.TokenType = ttRightParenthesis then begin // skipping ')' will be done in Call. result := list.Count; if result > 0 then begin getmem(FoundValues, result*sizeof(float)); for i := 0 to result - 1 do FoundValues^[i] := pFloat(List[i])^; end; FreeList; exit; end; if Tokenizer.TokenType <> ttSeparator then RaiseParseException(peExpectedSeparator); Tokenizer.NextTrueToken; // skip ',' until false; except FreeList; freemem(FoundValues); raise; end; end;

The major complication is that the number of arguments is not known because the grammar allows for functions with variable numbers of parameters. The best solution would probably be to use a dynamic array that grows to hold the values of arguments as they are found. However, Delphi 2 did not have such animals so a TList was used. As each expression is found, it is evaluated and its result is stored on the heap and a pointer to it is added to the list :

avalue := Expression; new(aPFloat); aPFloat^ := aValue; list.Add(aPFloat);

Once a right parenthesis is found the process ends and an array of floats, called FoundValues, is allocated on the heap. Then the values stored in the TList are copied into the array. The function ends by destroying the TList which is no longer required. If an error occurs not only the TList must be destroyed, but the array FoundValues must be freed also assuming it was allocated on the heap.

It is now possible to examine the function Call.

function Call: float; var ArgList: pFloatArray; ArgCount: integer; Target: TTarget; begin if not InternalTargets.Find(Tokenizer.Token, Target) then RaiseParseException(peUnknownIdentifier); ArgList := nil; try if Target is TFunction then begin ArgCount := Params(ArgList); if Target is TVariableFunction then begin if ArgCount < TVariableFunction(Target).MinCount then RaiseParseException(peTooFewParams); Target.Count := ArgCount end else if ArgCount < Target.Count then RaiseParseException(peTooFewParams) else if ArgCount > Target.Count then RaiseParseException(peTooManyParams); TFunction(Target).ArgArray := ArgList; end; result := Target.Calc; Tokenizer.NextTrueToken; // skip ')' finally freemem(ArgList); end; end;

The first task of Call is to check that the identifier found in Factor refers to an actual built-in function or constant. If not, Call it raises an exception and the whole process is abandoned. If it finds it, then the associated target is returned in the variable Target and Call knows if it is a function (with fixed or variable number of arguments). Because all functions have at least one parameter, Call invokes Params to get the number of parameters found. If the target is a function with a variable number of arguments, Call checks that at least the minimum number of arguments have been found and if so, sets the count of arguments in the target. If the target is a function with a fixed number of arguments, Call checks that the number of arguments found is correct. Then the target’s argument array is set to the array allocated on the heap by Params. Finally, Call returns the value returned by the target’s Calc function and disposes of the array of float values.

The "magic" in these definitions flows from the fact that Params uses Expression to obtain the values of the called function’s arguments. Thus, it is possible to write convoluted statements such as max(min(5, e^10), pi) or even recursive things such as sin(max(min(5, 10*sin(e)), 32*sin(pi))) = -0,823 which we can verify:

32*sin(pi) = -1,735E-18 10*sin(e) = 4,108 so max(min(5, 10*sin(e)), 32*sin(pi)) = max(min(5, 4,108), -1,735E-18) = 4,108

And the sinus of that number is -0,823.

The reader will notice that each call to log(x, b) will allocate and then dispose of an array of numbers. It would be tempting to have the target _log allocate the array once and have Call/Params fill that array instead of allocating it on the heap at each call. It’s not a good idea although the author of this article did do that. It makes the functions non recursive. If the following expression is evaluated log(log(9, 2), 4) the correct answer is approximately 0.832:

log(9, 2) = 3.16992500144231 log(3.16992500144231, 4) = 0.832224353726945.

But if the parameter array was an element of _log, this is what would be calculated

starting with the outer log:ArgArray filled with [??, 4]
then with the inner log:ArgArray filled with [9, 2] to calculate log(9, 2)
then back to the outer log:ArgArray filled with [0.832, 2]

which will calculate the outer log but with the wrong base and give 1,66.

Coding Details

As mentioned above, adding functions with multiple parameters requires a new operator, the list separator. In accordance with the decision concerning the decimal symbol, the symbol to separate arguments of a function is locale dependant.

The first step is to add a public property to the tokenizer.

{: Character that separates a function's arguments (floating values). This is set to the system's locale value when the object is created.} property ListChar: char read FListChar write FListChar;

When the tokenizer is created, ListChar is set to a default value.

constructor TMdTokenizer.create; begin inherited Create; FDecimalChar := DecimalSeparator; {$IFDEF HAS_LISTSEPARATOR} FListChar := ListSeparator; {$ELSE} if FDecimalChar = '.' then FListChar := ',' else FListChar := ';'; {$ENDIF} Reset; end;

It is slightly more complicated to implement because early versions of SysUtils did not include ListSeparator. if the directive, HAS_LISTSEPARATOR is defined in Parser.inc, then SysUtilsListSeparator is assigned by default to FListChar. Otherwise, a choice is made between ‘,’ and ‘;’ according to the decimal symbol. This is correct for English and French, but may need to be adjusted for other languages.

As can be seen, ListChar can be set to any character. Thus it is possible to make a mess by setting ListChar equal to DecimalChar with unpredictable results. As with the decimal symbol, access functions to the private tokenizer’s ListChar are made available in SimpleParser.pas.

{: Character used as list separator symbol. By default this is defined in the computer's locale (usually a comma for English language systems and a semi-colon for French language systems).} function ParserListChar: char; {: Sets the character to be used as the list separator symbol. Returns the previous list separator symbol.} function SetParserListChar(value: char): char;

The warning about having the same character for the decimal symbol and the list separator applies when using these functions.

Some versions of Delphi do not include the Math.pas unit which contains the implementation of many of the functions that are built into the parser. Simple implementations of the needed functions are supplied in an augmented version of MiniMath.pas included with the source code provided with this text. The implementation in Math.pas is better, use that unit if possible.

However there was an error in the function ArcCscH found in Math.pas included in versions of Delphi prior to Rad Studio 8.0 (Delphi XE). If using an older Math.pas be sure to define the compiler directive FIX_ARCCSCH in Parser.inc so that a correct version of that function is used to compute values for the internal function acsch (the inverse hyperbolic cosecant function). MiniMath.pas evaluates this function correctly.

Users of newer versions of Delphi would not necessarily have handled internal functions quite in the way done here. They would probably use a dynamic array of floats to pass parameters to an internal function instead of a pointer to a fixed array of floats allocated on the heap. Then the Params procedure could be called without the number of expected parameters. The method would just add floats to a dynamic array whose length would be increased as needed. Functions such as average or variance, that have a variable number of parameters, could be handled just as functions with fixed numbers of variables with that approach. And it would be possible to replace the TFunction record with a simple pointer to the evaluator of each internal function because the field argcount would no longer be needed.

Of course using dynamic arrays would not be compatible with the earlier versions of Delphi. Furthermore, the field argcount will be used for used to distinguish constants from built in variables later on which will be important when trying to optimize evaluation. And of course when evaluating a function it would be almost mandatory to check that the dynamic array passed to it contains the correct number of parameters. This check would have to be done at the start of every built in function instead of once when parsing the source. So while not the most elegant approach, the method used here may be more efficient.

Homework

1. Modify the syntax so that an identifier can begin and include underscore "_".

2. Modify the syntax so that an identifier can include but not begin with a period ".".

3. Is it possible to forgo the use of the list separator so that the min function could be called as follows : min(4 5) ?

4. Can the multiplication symbol be made optional so that 3 pi would mean the same as 3*pi?

Source code

The source code for the tokenizer and parser is available along with a simple demonstration programme and test code. It is in the archive ParsingMathExpr-3.zip.

The demonstration program compiles with Delphi version 2 and up. It should also compile with Lazarus/FreePascal with minimal changes. As before, unit tests require DUnit.pas which works with newer versions of Delphi only. Unit testing was done with Delphi 2010.

Part 4 – National Language Support-> <-Part 2 – Adding Comments and Exponentiation