Skip to content
blackwhale edited this page Aug 8, 2011 · 6 revisions

##Intermediate representation ###Desired qualities

  • Compact O(N) in size, where N is pattern length in characters
  • Executable directly with minimum overhead by both types of engine (Backtracking NFA & Thompson NFA)
  • Single pass generation
  • Easy to patch and/or optimize

Format

In general it's plain dense array of bytecode obtained by transforming each element of regex pattern. Concatenation is thus implicit as engine fetches instructions sequently from this array. Currently it's assumed that compilation and execution of bytecode happens on the same machine thus values are stored in native endiannes, and no extra checks are performed.

Most instructions are one 32bit word, where top 8 bits are operation code, and low 24 hold additional data. 
opcode data +--8---+----------24-------+

In case of simple instructions like Char data just holds codepoint. Complex instructions like {n,m} repetitions need to hold additional arguments which are stored as plain words right after argument. To discern betweeen raw data and IR instructions a top bit of opcode is always 1. Amoung other this makes it easy to implement backwards execution of bytecode without touching the buffer. Another special case worth noting is OrChar instruction or more generally bundled series of instructions. In this case top two bits of data contain a sequence field, total number of instructions in this serie is 2+sequence, since there is no point in 0 or 1 instructions long serie. Each instruction in serie has same opcode and sequence field (to preserve the bidirectional trit of IR), and as a whole conceptually treated as one instruction, going futher with case of OrChar it holds a few codepoints with intent ot match input with one of them.

Character sets could be stored in a number of ways (depending on charset size and topology) ranging from emitting a serie of fixed OrChar instructions to placeing a 'link' opcode to an external datastructure where data contains the index of external table that holds these objects.

Now let's go through all of other syntax artifacts Our notation is name(data) for operation 'name' stored with 24 bit value of 'data'. First:

(?:abc) is stored as:

char(a), char(b), char(c)

No special stuff/optimizations needed. In next sections a, b, c... are assumed to be arbitrary subexpressions.

(?:a|b|c)x

OrStart Option(full length up to x), Option(length of a), a, GotoOrEnd, option(length of b), b, x OrEnd GotoOrEnd contins a relative jump to OrEnd.

(a (b))

startgroup(0), a, startgroup(1), b, endgroup(1), endgroup(0)

a*

start infinite(length of a), a, infinite(length of a)

a{n,m}

This one is thougher, as it involves the only non-single word instruction - counted repeat. All extra operands are stored as 32bit words right after 'repeat':

start repeat(length of a), a, repeat(length of a, step, nstep , mstep)

That step it is used to implment nested loops, single counter it's used as "compressed" stack.

a{n,} is translated as a{n,n}a*

Processing repetitions & Thompson VM

There at least two ways to process fixed sized counted loops:

  • Go cowboy and unroll them i.e. a{n,m} --> aaaaa...a?a?a?...
  • Use special instructions and additional state per execution thread, namely stack of counters

Second is more general and scalable but involves extra operations and requires to carry stack of counters around. There is a way to get rid of stack if one uses scaled counters (counter with different steps) to represent nested loop: e.g. (?:a{3,4}){1,2} will use the same counter with step of 1 for the innermost loop and step of 5 for the outermost (upscaling the the min & max accordingly). This is exactly waht we'll do, of course, it limits maximum nested repetitions count.

This all might be good and well but to implement Thompson VM approach one need a way to quickly identify if two threads have the same execution state. The proposed way is to mark all instructions which have more then one entry point (jumps or sequental) as hotspot. Now for each hotspot instruction we should provide space to store traces of threads as they go through. To that end we fit another word right after hotspot instruction with index to the 'merge' table where we prepared 0..maxCounterAtThisPoint words of storage to identify threads with different counter values. With this setup each thread hitting a hitspot instruction looks at corresponding slot in merge table and checks if its index in input is greater. If greater it overwrites and goes on, otherwise it's removed and it's memory block is reused in subsequent matching.

Clone this wiki locally