Skip to content

Commit

Permalink
Do not quietly swallow literals.
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewwardrop committed Dec 20, 2023
1 parent fd8cd40 commit 3f5b1bc
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 9 deletions.
19 changes: 10 additions & 9 deletions docsite/docs/guides/grammar.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,18 +31,18 @@ unless otherwise indicated.
| ** | 2 | Includes all n-th order interactions of the terms in the left operand, where n is the (integral) value of the right operand, e.g. `(a+b+c)**2` is equivalent to `a + b + c + a:b + a:c + b:c`. ||||
| ^ | 2 | Alias for `**`. ||[^3] ||
|-----|
| `:` | 2 | Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product). ||||
| `:` | 2 | Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product). |[^4] |||
|-----|
| `*` | 2 | Includes terms for each of the additive and interactive effects of the left and right operands, e.g. `a * b` is equivalent to `a + b + a:b`. ||||
| `/` | 2 | Adds terms describing nested effects. It expands to the addition of a new term for the left operand and the interaction of all left operand terms with the right operand, i.e `a / b` is equivalent to `a + a:b`, `(a + b) / c` is equivalent to `a + b + a:b:c`, and `a/(b+c)` is equivalent to `a + a:b + a:c`.[^4] ||||
| `/` | 2 | Adds terms describing nested effects. It expands to the addition of a new term for the left operand and the interaction of all left operand terms with the right operand, i.e `a / b` is equivalent to `a + a:b`, `(a + b) / c` is equivalent to `a + b + a:b:c`, and `a/(b+c)` is equivalent to `a + a:b + a:c`.[^5] ||||
| `%in%` | 2 | As above, but with arguments inverted: e.g. `b %in% a` is equivalent to `a / b`. ||||
|-----|
| `+` | 2 | Adds a new term to the set of features. ||||
| `-` | 2 | Removes a term from the set of features (if present). ||||
| `+` | 1 | Returns the current term unmodified (not very useful). ||||
| `-` | 1 | Negates a term (only implemented for 0, in which case it is replaced with `1`). ||||
|-----|
| `|` | 2 | Splits a formula into multiple parts, allowing the simultaneous generation of multiple model matrices. When on the right-hand-side of the `~` operator, all parts will attract an additional intercept term by default. |||[^5] |
| `|` | 2 | Splits a formula into multiple parts, allowing the simultaneous generation of multiple model matrices. When on the right-hand-side of the `~` operator, all parts will attract an additional intercept term by default. |||[^6] |
|-----|
| `~` | 1,2 | Separates the target features from the input features. If absent, it is assumed that we are considering only the the input features. Unless otherwise indicated, it is assumed that the input features implicitly include an intercept. ||||

Expand All @@ -60,8 +60,8 @@ that have *not* been implemented by `formulaic` are explicitly noted also.
| `Q('<column_name>')` | Look up feature by potentially exotic name, e.g. `Q('wacky name!')`. Note that in `formulaic`, it is more idiomatic to use ``` `wacky name!` ```. ||||
| `C(...)` | Categorically encode a column, e.g. `C(x)` ||||
| `center(...)` | Shift column data so mean is zero. ||||
| `scale(...)` | Shift column so mean is zero and variance is 1. ||[^6] ||
| `standardize(...)` | Alias of `scale`. |[^7] |||
| `scale(...)` | Shift column so mean is zero and variance is 1. ||[^7] ||
| `standardize(...)` | Alias of `scale`. |[^8] |||
| `poly(...)` | Generates a polynomial basis, allowing non-linear fits. ||||
| `bs(...)` | Generates a B-Spline basis, allowing non-linear fits. ||||
| `cr(...)` | Generates a natural cubic spline basis, allowing non-linear fits. ||||
Expand Down Expand Up @@ -116,7 +116,8 @@ and conventions of which you should be aware.
[^1]: This "operator" is actually part of the tokenisation process.
[^2]: Formulaic additionally supports quoted fields with special characters, e.g. `` my_func(`my|special+column`) ``.
[^3]: The caret operator is not supported, but will not cause an error. It is ignored by the patsy formula parser, and treated as XOR Python operation on column.
[^4]: This somewhat confusing operator is useful when you want to include hierachical features in your data, and where certain interaction terms do not make sense (particularly in ANOVA contexts). For example, if `a` represents countries, and `b` represents cities, then the full product of terms from `a * b === a + b + a:b` does not make sense, because any value of `b` is guaranteed to coincide with a value in `a`, and does not independently add value. Thus, the operation `a / b === a + a:b` results in more sensible dataset. As a result, the `/` operator is right-distributive, since if `b` and `c` were both nested in `a`, you would want `a/(b+c) === a + a:b + a:c`. Likewise, the operator is not left-distributive, since if `c` is nested under both `a` and `b` separately, then you want `(a + b)/c === a + b + a:b:c`. Lastly, if `c` is nested in `b`, and `b` is nested in `a`, then you would want `a/b/c === a + a:(b/c) === a + a:b + a:b:c`.
[^5]: Implemented by an R package called [Formula](https://cran.r-project.org/web/packages/Formula/index.html) that extends the default formula syntax.
[^6]: Patsy uses the `rescale` keyword rather than `scale`, but provides the same functionality.
[^7]: For increased compatibility with patsy, we use patsy's signature for `standardize`.
[^4]: Note that Formulaic also allows you to use this to scale columns, for example: `2.5:a` (this scaling happens after factor coding).
[^5]: This somewhat confusing operator is useful when you want to include hierachical features in your data, and where certain interaction terms do not make sense (particularly in ANOVA contexts). For example, if `a` represents countries, and `b` represents cities, then the full product of terms from `a * b === a + b + a:b` does not make sense, because any value of `b` is guaranteed to coincide with a value in `a`, and does not independently add value. Thus, the operation `a / b === a + a:b` results in more sensible dataset. As a result, the `/` operator is right-distributive, since if `b` and `c` were both nested in `a`, you would want `a/(b+c) === a + a:b + a:c`. Likewise, the operator is not left-distributive, since if `c` is nested under both `a` and `b` separately, then you want `(a + b)/c === a + b + a:b:c`. Lastly, if `c` is nested in `b`, and `b` is nested in `a`, then you would want `a/b/c === a + a:(b/c) === a + a:b + a:b:c`.
[^6]: Implemented by an R package called [Formula](https://cran.r-project.org/web/packages/Formula/index.html) that extends the default formula syntax.
[^7]: Patsy uses the `rescale` keyword rather than `scale`, but provides the same functionality.
[^8]: For increased compatibility with patsy, we use patsy's signature for `standardize`.
62 changes: 62 additions & 0 deletions formulaic/parser/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
from dataclasses import dataclass, field
from typing import List, Iterable, Sequence, Tuple, Union, cast

from formulaic.parser.types.factor import Factor

from .algos.tokenize import tokenize
from .types import (
FormulaParser,
Expand Down Expand Up @@ -114,6 +116,66 @@ def get_tokens(self, formula: str) -> Iterable[Token]:

return tokens

def get_terms(self, formula: str) -> Structured[List[Term]]:
"""
Assemble the `Term` instances for a formula string. Depending on the
operators involved, this may be an iterable of `Term` instances, or
an iterable of iterables of `Term`s, etc.
This implementation also verifies that the formula is well-formed, in
that it does not have any literals apart from 1 or numeric scaling of
other terms.
Args:
formula: The formula for which an AST should be generated.
"""
terms = super().get_terms(formula)

def check_terms(terms: Iterable[Term]) -> None:
seen_terms = set()
for term in terms:
if len(term.factors) == 1:
factor = term.factors[0]
if (
factor.eval_method is Factor.EvalMethod.LITERAL
and factor.expr != "1"
):
raise exc_for_token(
factor.token or Token(),
"Numeric literals other than `1` can only be used "
"to scale other terms. (tip: Use `:` rather than "
"`*` when scaling terms)"
if factor.expr.replace(".", "", 1).isnumeric()
else "String literals are not valid in formulae.",
)
else:
for factor in term.factors:
if (
factor.eval_method is Factor.EvalMethod.LITERAL
and not factor.expr.replace(".", "", 1).isnumeric()
):
raise exc_for_token(
factor.token or Token(),
"String literals are not valid in formulae.",
)

term_hash = tuple(
factor.expr
for factor in term.factors
if factor.eval_method != Factor.EvalMethod.LITERAL
)
if term_hash in seen_terms:
raise exc_for_token(
term.factors[0].token or Token(),
"Term already seen with a different numerical scaling. "
"(tip: Use `:` rather than `*` when scaling terms)",
)
seen_terms.add(term_hash)

terms._map(check_terms)

return terms


class DefaultOperatorResolver(OperatorResolver):
"""
Expand Down
37 changes: 37 additions & 0 deletions tests/parser/test_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
"(a+b)**2": ["1", "a", "a:b", "b"],
"(a+b)^2": ["1", "a", "a:b", "b"],
"(a+b)**3": ["1", "a", "a:b", "b"],
"50:a": ["1", "50:a"],
# Nested products
"a/b": ["1", "a", "a:b"],
"(b+a)/c": ["1", "b", "a", "b:a:c"],
Expand Down Expand Up @@ -162,6 +163,42 @@ def test_long_formula(self):
terms = PARSER_NO_INTERCEPT.get_terms(expr)
assert {str(term) for term in terms} == names

def test_invalid_literals(self):
with pytest.raises(
FormulaSyntaxError,
match=re.escape(
"Numeric literals other than `1` can only be used to scale other terms."
),
):
PARSER.get_terms("50")
with pytest.raises(
FormulaSyntaxError,
match=re.escape("String literals are not valid in formulae."),
):
PARSER.get_terms("'asd'")
with pytest.raises(
FormulaSyntaxError,
match=re.escape("Term already seen with a different numerical scaling."),
):
PARSER.get_terms("1 * a")
with pytest.raises(
FormulaSyntaxError,
match=re.escape(
"Numeric literals other than `1` can only be used to scale other terms."
),
):
PARSER.get_terms("50*a")
with pytest.raises(
FormulaSyntaxError,
match=re.escape("String literals are not valid in formulae."),
):
PARSER.get_terms("'asd':a")
with pytest.raises(
FormulaSyntaxError,
match=re.escape("Term already seen with a different numerical scaling."),
):
PARSER.get_terms("50:a + 100:a")


class TestDefaultOperatorResolver:
@pytest.fixture
Expand Down

0 comments on commit 3f5b1bc

Please sign in to comment.