In its most simple form, a parsing run only returns whether (a portion of) the input matches the grammar. To actually do something useful during a parsing run it is necessary to attach (user-defined) actions to one or more grammar rules.
Actions are essentially functions that are called during the parsing run whenever the rule they are attached to successfully matched. When an action is applied, the corresponding function receives the states, an arbitrary list of (user-defined) objects, as arguments.
- Overview
- Example
- States
- Apply
- Apply0
- Inheriting
- Specialising
- Changing Actions
- Changing States
- Changing Actions and States
- Match
- Nothing
- Backtracking
- Troubleshooting
- Legacy Actions
Actions are implemented as static member functions called apply()
or apply0()
of specialisations of custom class templates (which is not quite as difficult as it sounds).
States are additional function arguments to tao::pegtl::parse()
that are forwarded to all actions.
To use actions during a parsing run they first need to be implemented.
- Define a custom action class template.
- Specialise the action class template for every rule for which a function is to be called and
- either implement an
apply()
orapply0()
static member function, - or derive from a class that implements the desired function.
- either implement an
The very first step, defining a custom action class template, usually looks like this.
template< typename Rule >
struct my_action
: tao::pegtl::nothing< Rule > {};
Instantiations of the primary template for my_action< Rule >
inherit from tao::pegtl::nothing< Rule >
to indicate that, by default, neither my_action< Rule >::apply()
nor my_action< Rule >::apply0()
are to be called when Rule
is successfully matched during a parsing run, or, in short, that no action is to be applied to Rule
.
You then specialise the action class template for those rules that you do want to call an action on. An example for a simple action for a specific state might look like this.
template<>
struct my_action< my_rule >
{
template< typename ActionInput >
static void apply( const ActionInput& in, my_state& s )
{
// ... implement
}
};
Then the parsing run needs to be set up with the actions and any required states.
For this, the initial action can be passed as the second template parameter and the initial states can be passed as additional arguments to tao::pegtl::parse()
.
In order to manage the complexity in larger parsers and/or compose multiple grammars that each bring their own actions which in turn expect certain states, it can be useful to change the actions and/or change the states within a parsing run.
Here is a very short example that shows the basic way to put together a parsing run with actions and states.
// Define a simple grammar consisting of a single rule.
struct my_grammar
: tao::pegtl::star< tao::pegtl::any > {};
// Primary action class template.
template< typename Rule >
struct my_action
: tao::pegtl::nothing< Rule > {};
// Specialise the action class template.
template<>
struct my_action< tao::pegtl::any >
{
// Implement an apply() function that will be called by
// the PEGTL every time tao::pegtl::any matches during
// the parsing run.
template< typename ActionInput >
static void apply( const ActionInput& in, std::string& out )
{
// Get the portion of the original input that the
// rule matched this time as string and append it
// to the result string.
out += in.string();
}
};
template< typename ParseInput >
std::string as_string( ParseInput& in )
{
// Set up the states, here a single std::string as that is
// what our action requires as additional function argument.
std::string out;
// Start the parsing run with our grammar, action and state.
tao::pegtl::parse< my_grammar, my_action >( in, out );
// Do something with the result.
return out;
}
All together the as_string()
function is a convoluted way of turning an input into a std::string
byte-by-byte.
In the following we will take a more in-depth look at states and apply()
and apply0()
before diving into more advanced subjects.
There is not much more to say about the states other than what has already been mentioned, namely that they are a list (colloquial list, not std::list
) of objects that are
-
passed by the user as additional arguments to
tao::pegtl::parse()
, and then -
passed by the PEGTL as additional arguments to all actions'
apply()
orapply0()
static member functions.
The additional arguments to apply()
and apply0()
can be chosen freely, however all actions must accept the same list of states since they are all called with the same arguments by default.
States are not forwarded with "perfect forwarding" since r-value references don't make much sense when they will be used as action arguments many times.
The parse()
function still uses universal references to bind to the state arguments in order to allow temporary objects.
As seen above, the actual functions that are called when an action is applied are static member functions named apply()
of the specialisations of the action class template.
template<>
struct my_action< my_rule >
{
template< typename ActionInput >
static void apply( const ActionInput& in, /* all the states */ )
{
// Called whenever matching my_rule during a parsing run
// succeeds (and actions are not disabled). The argument
// named 'in' represents the matched part of the input.
// Can also return bool instead of void.
}
}
The first argument is not the input used in the parsing run, but rather a separate object of distinct type that represents the portion of the input that the rule to which the action is attached just matched. The remaining arguments to apply()
are the current state arguments.
The exact type of the input class passed to apply()
is not specified.
It is best practice to "template over" the type of the input as shown above.
Actions can then assume that the input provides (at least) the following interface.
The Input
template parameter is set to the class of the input used as input in the parsing run at the point where the action is applied.
For illustrative purposes, we will assume that the input passed to apply()
is of type action_input
.
Any resemblance to real classes is not a coincidence, see include/tao/pegtl/internal/action_input.hpp
.
template< typename ParseInput >
class action_input
{
public:
using input_t = ParseInput;
using iterator_t = typename ParseInput::iterator_t;
bool empty() const noexcept;
std::size_t size() const noexcept;
const char* begin() const noexcept; // Non-owning pointer!
const char* end() const noexcept; // Non-owning pointer!
std::string string() const; // std::string( begin(), end() )
std::string_view string_view() const noexcept; // std::string_view( begin(), size() )
char peek_char( const std::size_t offset = 0 ) const noexcept; // begin()[ offset ]
std::uint8_t peek_uint8( const std::size_t offset = 0 ) const noexcept; // similar
pegtl::position position() const noexcept; // Not efficient with tracking_mode::lazy.
const ParseInput& input() const noexcept;
const iterator_t& iterator() const noexcept;
};
Note that input()
returns the input from the parsing run which will be at the position after what has just been parsed, i.e. for an action input ai
the assertion ai.end() == ai.input().current()
will always hold true.
Conversely iterator()
returns a pointer or iterator to the beginning of the action input's data, i.e. where the successful match attempt to the rule the action called with the action input is attached to started.
More importantly the action_input
does not own the data it points to, it belongs to the original input used in the parsing run.
Therefore the validity of the pointed-to data might not extend (much) beyond the call to apply()
!
When the original input has tracking mode eager
, the iterator_t
returned by action_input::iterator()
will contain the byte
, line
and column
counters corresponding to the beginning of the matched input represented by the action_input
.
When the original input has tracking mode lazy
, then action_input::position()
is not efficient because it calculates the line number etc. by scanning the complete original input from the beginning
Actions often need to store and/or reference portions of the input for after the parsing run, for example when an abstract syntax tree is generated. Some of the syntax tree nodes will contain portions of the input, for example for a variable name in a script language that needs to be stored in the syntax tree just as it occurs in the input data.
The default safe choice is to copy the matched portions of the input data that are passed to an action by storing a deep copy of the data as std::string
, as obtained by the input class' string()
member function, in the data structures built while parsing.
When the return type of an action, i.e. its apply()
, is bool
, it can retro-actively let the library consider the attempt to match the rule to which the action is attached a (local) failure.
For the overall parsing run, there is no difference between a rule returning false
and an attached action returning false
, however the action is only called when the rule returned true
.
When an action returns false
, the library rewinds the input to where it was when the rule to which the action was attached started its successful match.
This is unlike match()
static member functions that have to rewind the input themselves.
In cases where the matched part of the input is not required, an action can implement a static member function called apply0()
instead of apply()
.
What changes is that apply0()
will be called without an input as first argument, i.e. only with all the states.
template<>
struct my_action< my_rule >
{
static void apply0( /* all the states */ )
{
// Called whenever matching my_rule during a parsing run
// succeeds (and actions are not disabled). Can also return
// bool instead of void.
}
};
Using apply0()
is never necessary, it is "only" an optimisation with minor benefits at compile time, and potentially more noteworthy benefits at run time.
We recommend implementing apply0()
over apply()
whenever both are viable.
Though an infrequently used feature, apply0()
can also return bool
instead of void
, just like apply()
and with the same implications.
We will use an example to show how to use existing actions via inheritance. The grammar for this example consists of a couple of simple rules.
struct plain
: tao::pegtl::utf8::range< 0x20, 0x10FFFF > {};
struct escaped
: tao::pegtl::one< '\'', '"', '?', '\\', 'a', 'b', 'f', 'n', 'r', 't', 'v' > {};
struct character
: tao::pegtl::if_must_else< tao::pegtl::one< '\\' >, escaped, plain > {};
struct text
: tao::pegtl::must< tao::pegtl::star< character >, tao::pegtl::eof > {};
Our goal is for a parsing run with the text
rule to produce a copy of the input where the backslash escape sequences are replaced by the character they represent.
When the plain
rule matches, the bytes of the matched UTF-8-encoded code-point can be appended to the result.
When the escaped
rule matches, the bytes corresponding to the character represented by the escape sequence must be appended to the result.
This can be achieved with appropriate specialisations of my_action
using some contrib classes from tao/pegtl/contrib/unescape.hpp
.
template<>
struct my_action< plain >
: tao::pegtl::append_all {};
template<>
struct my_action< escaped >
: tao::pegtl::unescape_c< escaped, '\'', '"', '?', '\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v' > {};
For step three the input for the parsing run is set up as usual.
In addition, the actions are passed as second template parameter, and a std::string
as second argument to parse()
.
Here unescaped
is the state that is required by the append_all
and unescape_c
actions; all additional arguments passed to parse()
are forwarded to all actions.
std::string unescape( const std::string& escaped )
{
std::string unescaped;
tao::pegtl::memory_input in( result, __FUNCTION__ );
tao::pegtl::parse< text, my_action >( in, unescaped );
return unescaped;
}
At the end of the parsing run, the complete unescaped string can be found in the aptly named variable unescaped
.
A more complete example of how to unescape strings can be found in src/example/pegtl/unescape.cpp
.
The rule class for which an action class template is specialised must exactly match the definition of the rule in the grammar. For example consider the following rule.
struct foo
: tao::pegtl::plus< tao::pegtl::alpha > {};
Now an action class template can be specialised for foo
, or for tao::pegtl::alpha
, but not for tao::pegtl::plus< tao::pegtl::alpha >
.
This is because base classes are not taken into consideration by the C++ language when choosing a specialisation, which might be surprising when being used to pointer arguments to functions where conversions from pointer-to-derived to pointer-to-base are performed implicitly and silently.
So although the function called by the library to match foo
is the inherited tao::pegtl::plus< tao::pegtl::alpha >::match()
, the rule class is foo
and the function known as foo::match()
, wherefore an action needs to be specialised for foo
instead of tao::pegtl::plus< tao::pegtl::alpha >
.
While it is possible to specialise the action class template for tao::pegtl::alpha
, it might not be a good idea since the action would be applied for all occurrences of tao::pegtl::alpha
in the grammar.
To circumvent this issue a new name can be given to the tao::pegtl::alpha
, a name that will not be "randomly" used in other places of the grammar.
struct bar
: tao::pegtl::alpha {};
struct foo
: tao::pegtl::plus< bar > {};
Now an action class template can be specialised for foo
and bar
, but again not for tao::pegtl::plus< bar >
or tao::pegtl::alpha
.
More precisely, it could be specialised for the latter two rules, but wouldn't ever be called unless these rules were used elsewhere in the grammar, a different kettle of fish.
Note that this is also the reason why you should not use type aliases instead of inheritance when defining your grammars.
The action class template can be changed, and actions enabled or disabled, in ways beyond supplying, or not, an action to tao::pegtl::parse()
at the start of a parsing run.
The tao::pegtl::enable<>
and tao::pegtl::disable<>
rules behave just like seq<>
but, without touching the current action, enable or disable calling of actions within their sub-rules, respectively.
The tao::pegtl::action<>
rule also behaves similarly to seq<>
but takes an action class template as first template parameter and, without enabling or disabling actions, uses its first template parameter as action for the sub-rules.
The following two lines effectively do the same thing, namely parse with my_grammar
as top-level parsing rule without invoking actions (unless actions are enabled again somewhere else).
tao::pegtl::parse< my_grammar >( ... );
tao::pegtl::parse< tao::pegtl::disable< my_grammar >, my_action >( ... );
Similarly the following two lines both start parsing my_grammar
with my_action
(again only unless something changes somewhere else).
tao::pegtl::parse< my_grammar, my_action >( ... );
tao::pegtl::parse< tao::pegtl::action< my_action, my_grammar > >( ... );
User-defined parsing rules can use action<>
, enable<>
and disable<>
just like any other combinator rules.
For example to disable actions in LISP-style comments the following rule could be used as per src/example/pegtl/s_expression.cpp
.
struct comment
: tao::pegtl::seq< tao::pegtl::one< '#' >, tao::pegtl::disable< cons_list > > {};
The ability to change the actions during a parsing run also allows using the same rules multiple times with different action class templates within a grammar.
The action classes tao::pegtl::disable_action
and tao::pegtl::enable_action
can be used to disable and enable actions, respectively, for any rule (and its sub-rules).
For example actions can be disabled for my_rule
in a parsing run using my_action
as follows.
template<>
struct my_action< my_rule >
: tao::pegtl::disable_action {};
tao::pegtl::parse< my_grammar, my_action >( ... );
Conversely tao::pegtl::change_action<>
takes a new action class template as its only template parameter and changes the current action in a parsing run to it.
Note that parsing proceeds with the rule to which the action changing action is attached to "as if" the new action had been the current action all along. The new action can even perform an action change on the same rule, however care should be taken to not introduce infinite cycles of changes.
The states, too, can be changed in ways beyond supplying them, or not, to tao::pegtl::parse()
at the start of a parsing run.
The state
rule behaves similarly to seq
but uses the first template parameter as type of a new object.
This new object is used replaces the current state(s) for the remainder of the implicit seq
.
The new object is constructed with a const-reference to the current input of the parsing run, and all previous states, if any, as arguments.
If no such constructor exists, the new object is default constructed.
If the implicit seq
of the sub-rules succeeds, then, by default, a member function named success()
is called on this "new" object, receiving the same arguments as the constructor.
At this point the input will be advanced by whatever the sub-rules have consumed in the meantime.
Please consult include/tao/pegtl/internal/state.hpp
to see how the default behaviour on success can be changed by overriding tao::pegtl::state<>::success()
in a derived class when using that class instead.
Embedding a state change into the grammar with state<>
is only recommended when some state is used by custom parsing rules.
The actions tao::pegtl::change_state<>
and tao::pegtl::change_states<>
can be used to change from the current to a new set of states while parsing the rules they are attached to.
The differences are summarised in this table; note that change_state
is more similar to the legacy change_state
control class as included with the 2.z versions of the PEGTL.
Feature | change_state |
change_states |
---|---|---|
Number of new states | one | any |
Construction of new states | optionally with input and old states | default |
Success function on action | if not on new state | required |
With change_state
only a single new state type can be given as template parameter, and only a single new state will be created.
The constructor of the new state receives the same arguments as per tao::pegtl::state<>
, the current input from the parsing run and all previous states.
A success()
static member function is supplied that calls the success()
member function on the new state, again with the current input from the parsing run and all previous states.
The supplied success()
can of course be overridden in a derived class.
With change_states
, being a variadic template, any number of new state types can be given and an appropriate set of new states will be created (nearly) simultaneously.
All new states are default-constructed, if something else is required the reader is encouraged to copy and modify the implementation of change_states
in their project.
The user must implement a custom success()
static member function that takes the current input from the parsing run, the new states, and the old states as arguments.
Note that, unlike the tao::pegtl::state<>
combinator, the success functions are only called when actions are currently enabled!
Using the changing actions is again done via inheritance as shown in the following example for change_states
.
template<>
struct my_action< my_rule >
: tao::pegtl::change_states< new_state_1, new_state_2 >
{
template< typename ParseInput >
static void success( const ParseInput&, new_state_1&, new_state_2&, /* the previous states*/ )
{
// Do whatever with both the new and the old states...
}
};
For a more complete example of how to build a generic JSON data structure with change_state
and friends see src/example/pegtl/json_build.cpp
.
The actions change_action_and_state<>
and change_action_and_states<>
combine change_action
with one of the change_state<>
or change_states<>
actions, respectively.
For change_action_and_state<>
and change_action_and_states<>
the new action class template is passed as first template parameter as for change_action
, followed by the new state(s) as given to change_state<>
and change_states<>
.
Note that change_action_and_state<>
and change_action_and_states<>
behave like change_action<>
in that they proceed to match the rule to which the changing action is attached to "as if" the new action had been the current action all along.
Besides apply()
and apply0()
, an action class specialization can also have a match()
static member function.
The default control class template normal
will detect the presence of a suitable match()
function and call this function instead of tao::pegtl::match()
.
template<>
struct my_action< my_rule >
{
template< typename Rule,
apply_mode A,
rewind_mode M,
template< typename... > class Action,
template< typename... > class Control,
typename ParseInput,
typename... States >
static bool match( ParseInput& in, States&&... st )
{
// Call the function that would have been called otherwise,
// in this case without changing anything...
return tao::pegtl::match< Rule, A, M, Action, Control >( in, st... );
}
}
Implementing a custom match()
for an action is considered a rather advanced feature that is not used directly very often.
All "changing" action classes mentioned in this document are implemented as actions with match()
.
Their implementations can be found in <tao/pegtl/change_*.hpp>
and should be studied before implementing a custom action with match()
.
Letting the primary template of an action class template derive from tao::pegtl::nothing
is recommended, but not necessary.
When using nothing
, some assertions are enabled that are usually very helpful while developing a parser.
When not using nothing
, simply by never mentioning it (as base class), these assertions are disabled and it is possible for an action's apply()
or apply0()
implementation to be silently ignored.
In the following let a
be an action template class, i.e. the instantiation of an action class template action
for some rule r
, or using a = action< r >
for short.
We say that apply()
is callable when it is the name of a static member function of a
that returns either void
or bool
and can be called with an input and the current states.
We say that apply0()
is callable when it is the name of a static member function of a
that returns either void
or bool
and can be called with the current states.
The following assertions are always enabled.
- There must be at most one callable
apply
orapply0()
ina
. - If
nothing< r >
is an accessible base class ofa
thena
must not have a callableapply()
. - If
nothing< r >
is an accessible base class ofa
thena
must not have a callableapply0()
. - If
require_apply
is an accessible base class ofa
then it must have a callableapply()
. - If
require_apply0
is an accessible base class ofa
then it must have a callableapply0()
.
The classes require_apply
and require_apply0
are also explained in the State Mismatch section.
The following assertion is only enabled when std::is_base_of_v< tao::pegtl::nothing< void >, action< void > >
is true
.
- Either
nothing
must be an accessible base class ofa
, or maybe_nothing
must be an accessible base class ofa
, ora
must have a callableapply()
orapply0()
.
The class tao::pegtl::maybe_nothing
is an accessible base class of all the changing actions explained above.
This make is possible, but not necessary, to implement apply()
or apply0()
for actions derived from them.
Note that maybe_nothing
can be combined, through multiple inheritance, with one of nothing< r >
, require_apply
or require_apply0
.
For example when a class b
is derived from change_state
, it also gains that class' maybe_nothing
as accessible base class.
At this point b
is allowed to either have or not have an apply()
or apply0()
.
By letting b
also derive from one of the three mentioned classes, the maybe_nothing
will be ignored and b
will be checked to have or not have the functions as dictated by the respective additional base class.
Sometimes there can be backtracking during a parsing run which can lead to Actions being called in places where their effects are undesired. While it might be intuitively clear what backtracking is, for the purpose of the following discussion we give a slightly more formal definition.
We speak of backtracking across a rule S
when there is a rule R
of which S
is a (direct or indirect) sub-rule and during a parsing run
R
returns local failure afterS
succeeded and its success is a requirement for the success ofR
and- it is "still possible" for the top-level grammar rule of the parsing run to succeed.
In this case the input will have been rewound to the point at which R
was attempted to match and all effects of S
on the Input will have been undone, however, and this is the subject of this section, any action attached to S
will have been already performed without there being an automatic "undo".
In some cases it is easy to rewrite the grammar in a way that prevents backtracking. This simultaneously removes the issue of having to undo actions and improves parsing performance.
The prototypical case for which such a rewrite can be done is R = sor< seq< A, B >, seq< A, C > >
where A
, B
and C
are arbitrary rules.
If during a parsing run there are actions attached to A
and C
, and the input matches seq< A, C >
but not seq< A, B >
, then the action for A
will be called twice before the action for C
, which gives this problem its "AAC" name, given that what happens is:
- Begin
sor< seq< A, B >, seq< A, C > >
- Begin
seq< A, B >
- Begin
A
- Success
A
with action called - Begin
B
- Failure
B
- Failure
seq< A, B >
- Begin
seq< A, C >
- Begin
A
at the same position as the beginA
above - Success
A
with action called again on the same input - Begin
C
- Success
C
- Success
seq< A, C >
- Success
sor< seq< A, B >, seq< A, C > >
In practice the structure of the rule might be more complicated than the pure AAC-problem which will make it harder to recognise the pattern.
One solution is to rewrite R
as R' = seq< A, sor< B, C > >
where of course any action for A
will be called at most once for every successful match of R'
.
Another solution is to undo the effects of the Action attached to A
in case the encompassing seq< A, B >
(or seq< A, C >
) fail.
The advantage of this approach is that the implementation of the Action for A
can pretend that is only called when really needed.
The disadvantage is that there is no function on the Action that is called in the case of failure which requires the user to either write a custom match()
function in the Action for seq< A, B >
or to implement the failure()
function in a custom Control class.
A further solution is to let the Action for A
perform its job "to the side", and only "commit" the effects to the target data structure in the Action for seq< A, B >
.
For example if the Action attached to A
takes the matched portion of the Input as std::string
and appends it to std::vector< std::string >
one could change said Action for A
to only fill some temporary string in one of the States, and create an Action for seq< A, B >
that, after it is called on success of that rule, appends the aforementioned temporary string to the target vector.
When everything else fails and a quick-and-dirty solution to Actions being called too often in the presence of backtracking is required and/or performance is not of prime importance it is relatively easy to solve the problem by employing the infinite look-ahead capability of PEGs.
When backtracking across S
is a problem because an Action attached to S
can be called when S
succeeds even though there is a higher-up rule R
that can still fail then simply replace R
with seq< at< R >, R >
in the grammar.
Remembering that at
disables all Actions explains how this solves the problem; we first verify without Actions that R
will indeed match at this point and only then match R
again with Actions enabled.
The following lists a couple of frequently encountered Action-related errors and how to fix them.
Actions returning bool
are an advanced use case that should be used with caution.
They prevent some internal optimisations, in particular when used with apply0()
.
They can also have weird effects on the semantics of a parsing run, for example at< rule >
can succeed for the same input for which rule
fails when there is a bool
-action attached to rule
that returns false
(remember that actions are disabled within at<>
).
When an action's apply()
or apply0()
expects different states than those present in the parsing run there will either be a possibly not very helpful compiler error, or it will compile without a call to the action, depending on whether tao::pegtl::nothing<>
is used as base class of the primary action class template.
By deriving an action specialisation from either tao::pegtl::require_apply
or tao::pegtl::require_apply0
, as appropriate, a -- potentially more helpful -- compiler error can be provoked, so when the grammar contains my_rule
and the action is my_action
then silently compiling without a call to apply0()
is no longer possible.
template<>
struct my_action< my_rule >
: require_apply0
{
static void apply0( double )
{
// ...
}
}
Note that deriving from require_apply
or require_apply0
is optional and usually only used for troubleshooting.
See the section on legacy-style action rules.
This document is part of the PEGTL.
Copyright (c) 2014-2023 Dr. Colin Hirsch and Daniel Frey
Distributed under the Boost Software License, Version 1.0
See accompanying file LICENSE_1_0.txt or copy at https://www.boost.org/LICENSE_1_0.txt