Skip to content

How Numbas' Judged Mathematical Expression system works

christianp edited this page Sep 20, 2011 · 1 revision

How Numbas’ Judged Mathematical Expression system works

This document will describe the various algorithms used in the process of interpreting, manipulating and evaluating mathematical expressions written in JME syntax. There will also be some discussion of the design decisions made about data structures and interfaces.

Parsing a JME expression

Numbas.jme.compile takes a string representing a mathematical expression and returns a syntax tree.

Compilation consists of:

  1. tokenize the string (convert string of characters to a list of tokens representing syntactical elements)
  2. shunt the tokens into an evaluation tree corresponding to the expression
  3. typecheck all operations in the tree, if required

Tokenizing step

Tokenizing an expression string consists of scanning across it, recognising substrings as tokens, and storing them in a list in the same order as they were found.

The algorithm: (parameter: expr – a string representing the expression)

While expr is not the empty string:

  • trim whitespace off the beginning of expr
  • test the string against regular expressions for each type of token. They match only from the beginning of the string, and are greedy (they match as long a string as possible)
  • create a token object of the appropriate type, and add in any implicit tokens (e.g. a multiplication operation between adjacent variable names)
  • chop the matched string off the start of expr

Every token object has a type property. Additionally, data types (string, number, etc.) have a value property, while operations, functions and variables have a name property.

The objects used to represent tokens are the same as the data objects used when evaluating expressions, in order to avoid a conversion step.

The recognised token types are these:

  • number: a number, real or complex.
  • string: a string of characters.
  • boolean: true or false.
  • list: an array of data objects of any type.
  • range: a closed range of numbers, either continuous or discrete.
  • name: a variable name. Function names are first tokenised as this type of object, before the shunt phase.
  • function: a function. This type can not be produced by the tokenizer, but is created during the shunt phase.
  • op: a binary or unary operation.
  • special: a special character. (I think this is just a relic from i-assess, in order to produce things like greek symbols. MathJax and unicode are used now instead)
  • conc: a virtual operation representing concatenation of two special characters. Again, no longer used.
  • ( ) , [ ]: punctuation characters.

The constructors for these data types are stored in Numbas.jme.types

Here is a (potentially incomplete) list of the extra fiddling that tokenize does with expressions, to be more accepting of student input:

  • extraneous whitespace is removed, so you can put as many spaces between operators and operands as you like
  • right bracket followed by a number, eg )3, has a multiplication operator inserted in the middle: )*3
  • number or right bracket followed by a left bracket has a multiplication operator inserted in the middle
  • + preceded by ( , or another operator, or at the start of an expression, is interpreted as unary plus.
  • the name of a builtin function prepended with a single-letter variable name, eg xsin, is interpreted as a multiplication, like x*sin
  • built-in constants e, pi and i are replaced with their respective number values.

Syntactical errors are caught at this stage — invalid characters, etc. cause the tokenising process to fail. Grammatical errors, such as unmatched brackets, are not caught yet.

Shunting step

Once the expression has been converted into a list of tokens, it is “shunted” into an abstract syntax tree.

The algorithm began life as the shunting yard algorithm, which produces an expression in RPN, but I eventually decided it was better just to go straight to a tree structure.

The output of the shunt algorithm is an “expression” object with two properties: tok and args.

tok is the corresponding token object created in the tokenising step, so either a primitive data type (eg number), or an operation or function.

args is a list of arguments, if the token is an operation or function. The arguments are also expression objects, of the same form.

Typecheck step

The final step is to check whether the expression can be evaluated. This consists of looking at every operation and function call in the expression, checking that there is a corresponding definition taking the right number and type of arguments.

As a side effect of this process, the relevant javascript code is attached to the nodes on the tree, so they don’t have to be looked up for each evaluation.

If there are free variables in the expression, they are marked as unbound, which makes substituting in values a bit quicker when the expression finally does get evaluated.

The typecheck step is optional – if the compiled expression is only ever going to be displayed and not evaluated, it is unnecessary.

Substituting values into a JME expression

The function Numbas.jme.substituteTree takes a compiled JME syntax tree and a dictionary of variable values to substitute in, and returns a copy of the tree with the given values substituted in appropriately.

The values to be substituted should be either syntax trees themselves, or single data objects created from Numbas.jme.types.

A deep copy of the tree is made, so that the original can be re-used, and then the copy is traversed depth-first. Whenever an object of type name is encountered, the corresponding value from the dictionary of values is inserted at that point.

If the dictionary does not contain a value for a particular variable, the operation fails, unless the parameter allowUnbound is true, in which case the name token is left in place.

Evaluating a JME expression

Numbas.jme.evaluate takes either a string expression or a syntax tree, and dictionaries of variables and functions to substitute into the expression.

If the given expression is a string, it is compiled to a syntax tree. The variables and functions parameters can be omitted if not needed.

First of all, the given variables are subsituted into the tree, as described above. Then, Numbas.jme.bind is run on the tree to attach any new functions or operations substituted in.

Finally, the tree is evaluated to a single value. The token at the top of the tree is inspected, and what happens depends on what type it is.

  • Tokens of type number, boolean and range are returned without modification.
  • string tokens are passed through Numbas.jme.contentsubvars before being returned. Since contentsubvars applies the Textile parser, you might want to remove this step if you just want to deal with mathematical expressions.
  • list tokens have each item evaluated before being returned.
  • If the top of the tree is an op or a function token, the relevant bit of code is evaluated, and should return a token object from Numbas.jme.types. Note that the arguments to the operation are not evaluated in advance; this allows for “lazy evaluation” of things like conditional statements or recursive algorithms.

The function which evaluates an operation should take three arguments: args, variables and functions. The args are the arguments of the operation, and the other two parameters are needed in case some string substitution needs to happen.

Functions are free to evaluate all, some, or none of their arguments, using Numbas.jme.evaluate.

Extending and customising the JME system

The JME system has been designed to be fairly easily customised. In particular, new functions can be added very easily. It should also be quite straightforward to add new data types.

Adding a new function

Exam authors can define new functions in the .exam file using JME syntax. These are passed through to the compile and evaluate steps by the Numbas runtime.

When writing an extension, however, it’s easier just to write a function in pure javascript and add it to the list of builtin functions. To do that, write your function in javascript, and then create a funcObj object to go with it:

function addtwo(n) { return n+2; }

new Numbas.jme.funcObj('addtwo', [TNum], TNum, addtwo);

The arguments for the funcObj constructor are:

  • name – the name used to call the function in JME expressions
  • intype – an array of Numbas.jme.types constructors, saying which types the arguments to the function must be. You can use the string '?' instead of a type constructor if an argument can be of any type, or if you want to do your own typechecking.
  • outcons – the constructor to use on the value your function returns. Again, this should be one of the constructors from Numbas.jme.types.
  • fn – your javascript function

Adding a new data type

The constructors for data types are stored in Numbas.jme.types. They’re just normal javascript object constructors (see the MDN page on the new operation), with an extra property called type. Normally, a data element’s value is stored in a property called value, though this is not required: for example, TFunc elements have a property called name.

Once you’ve created a new data type, you can add functions to construct and use elements of that type. If you’re adventurous, you could add some code to the tokeniser to parse literals of your new type.