Do not quietly swallow literals.

matthewwardrop · Dec 20, 2023 · 3f5b1bc · 3f5b1bc
1 parent fd8cd40
commit 3f5b1bc
Show file tree

Hide file tree

Showing 3 changed files with 109 additions and 9 deletions.
diff --git a/docsite/docs/guides/grammar.md b/docsite/docs/guides/grammar.md
@@ -31,18 +31,18 @@ unless otherwise indicated.
 | ** | 2 | Includes all n-th order interactions of the terms in the left operand, where n is the (integral) value of the right operand, e.g. `(a+b+c)**2` is equivalent to `a + b + c + a:b + a:c + b:c`. | ✓ | ✓ | ✓ |
 | ^ | 2 | Alias for `**`. | ✓ | ✗[^3] | ✓ |
 |-----|
-| `:` | 2 | Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product). | ✓ | ✓ | ✓ |
+| `:` | 2 | Adds a new term that corresponds to the interaction of its operands (i.e. their elementwise product). | ✓[^4] | ✓ | ✓ |
 |-----|
 | `*` | 2 | Includes terms for each of the additive and interactive effects of the left and right operands, e.g. `a * b` is equivalent to `a + b + a:b`. | ✓ | ✓ | ✓ |
-| `/` | 2 | Adds terms describing nested effects. It expands to the addition of a new term for the left operand and the interaction of all left operand terms with the right operand, i.e `a / b` is equivalent to `a + a:b`, `(a + b) / c` is equivalent to `a + b + a:b:c`, and `a/(b+c)` is equivalent to `a + a:b + a:c`.[^4] | ✓ | ✓ | ✓ |
+| `/` | 2 | Adds terms describing nested effects. It expands to the addition of a new term for the left operand and the interaction of all left operand terms with the right operand, i.e `a / b` is equivalent to `a + a:b`, `(a + b) / c` is equivalent to `a + b + a:b:c`, and `a/(b+c)` is equivalent to `a + a:b + a:c`.[^5] | ✓ | ✓ | ✓ |
 | `%in%` | 2 | As above, but with arguments inverted: e.g. `b %in% a` is equivalent to `a / b`. | ✓ | ✗ | ✓ |
 |-----|
 | `+` | 2 | Adds a new term to the set of features. | ✓ | ✓ | ✓ |
 | `-` | 2 | Removes a term from the set of features (if present). | ✓ | ✓ | ✓ |
 | `+` | 1 | Returns the current term unmodified (not very useful). | ✓ | ✓ | ✓ |
 | `-` | 1 | Negates a term (only implemented for 0, in which case it is replaced with `1`). | ✓ | ✓ | ✓ |
 |-----|
-| `|` | 2 | Splits a formula into multiple parts, allowing the simultaneous generation of multiple model matrices. When on the right-hand-side of the `~` operator, all parts will attract an additional intercept term by default. | ✓ | ✗ | ✓[^5] |
+| `|` | 2 | Splits a formula into multiple parts, allowing the simultaneous generation of multiple model matrices. When on the right-hand-side of the `~` operator, all parts will attract an additional intercept term by default. | ✓ | ✗ | ✓[^6] |
 |-----|
 | `~` | 1,2 | Separates the target features from the input features. If absent, it is assumed that we are considering only the the input features. Unless otherwise indicated, it is assumed that the input features implicitly include an intercept. | ✓ | ✓ | ✓ |
 
@@ -60,8 +60,8 @@ that have *not* been implemented by `formulaic` are explicitly noted also.
 | `Q('<column_name>')` | Look up feature by potentially exotic name, e.g. `Q('wacky name!')`. Note that in `formulaic`, it is more idiomatic to use ``` `wacky name!` ```. | ✓ | ✓ | ✗ |
 | `C(...)` | Categorically encode a column, e.g. `C(x)` | ✓ | ✓ | ✓ |
 | `center(...)` | Shift column data so mean is zero. | ✓ | ✓ | ✗ |
-| `scale(...)` | Shift column so mean is zero and variance is 1. | ✓ | ✓[^6] | ✓ |
-| `standardize(...)` | Alias of `scale`. | ✓[^7] | ✓ | ✗ |
+| `scale(...)` | Shift column so mean is zero and variance is 1. | ✓ | ✓[^7] | ✓ |
+| `standardize(...)` | Alias of `scale`. | ✓[^8] | ✓ | ✗ |
 | `poly(...)` | Generates a polynomial basis, allowing non-linear fits. | ✓ | ✗ | ✓ |
 | `bs(...)` | Generates a B-Spline basis, allowing non-linear fits. | ✓ | ✓ | ✓ |
 | `cr(...)` | Generates a natural cubic spline basis, allowing non-linear fits. | ✗ | ✓ | ✓ |
@@ -116,7 +116,8 @@ and conventions of which you should be aware.
 [^1]: This "operator" is actually part of the tokenisation process.
 [^2]: Formulaic additionally supports quoted fields with special characters, e.g. `` my_func(`my|special+column`) ``.
 [^3]: The caret operator is not supported, but will not cause an error. It is ignored by the patsy formula parser, and treated as XOR Python operation on column.
-[^4]: This somewhat confusing operator is useful when you want to include hierachical features in your data, and where certain interaction terms do not make sense (particularly in ANOVA contexts). For example, if `a` represents countries, and `b` represents cities, then the full product of terms from `a * b === a + b + a:b` does not make sense, because any value of `b` is guaranteed to coincide with a value in `a`, and does not independently add value. Thus, the operation `a / b === a + a:b` results in more sensible dataset. As a result, the `/` operator is right-distributive, since if `b` and `c` were both nested in `a`, you would want `a/(b+c) === a + a:b + a:c`. Likewise, the operator is not left-distributive, since if `c` is nested under both `a` and `b` separately, then you want `(a + b)/c === a + b + a:b:c`. Lastly, if `c` is nested in `b`, and `b` is nested in `a`, then you would want `a/b/c === a + a:(b/c) === a + a:b + a:b:c`.
-[^5]: Implemented by an R package called [Formula](https://cran.r-project.org/web/packages/Formula/index.html) that extends the default formula syntax.
-[^6]: Patsy uses the `rescale` keyword rather than `scale`, but provides the same functionality.
-[^7]: For increased compatibility with patsy, we use patsy's signature for `standardize`.
+[^4]: Note that Formulaic also allows you to use this to scale columns, for example: `2.5:a` (this scaling happens after factor coding).
+[^5]: This somewhat confusing operator is useful when you want to include hierachical features in your data, and where certain interaction terms do not make sense (particularly in ANOVA contexts). For example, if `a` represents countries, and `b` represents cities, then the full product of terms from `a * b === a + b + a:b` does not make sense, because any value of `b` is guaranteed to coincide with a value in `a`, and does not independently add value. Thus, the operation `a / b === a + a:b` results in more sensible dataset. As a result, the `/` operator is right-distributive, since if `b` and `c` were both nested in `a`, you would want `a/(b+c) === a + a:b + a:c`. Likewise, the operator is not left-distributive, since if `c` is nested under both `a` and `b` separately, then you want `(a + b)/c === a + b + a:b:c`. Lastly, if `c` is nested in `b`, and `b` is nested in `a`, then you would want `a/b/c === a + a:(b/c) === a + a:b + a:b:c`.
+[^6]: Implemented by an R package called [Formula](https://cran.r-project.org/web/packages/Formula/index.html) that extends the default formula syntax.
+[^7]: Patsy uses the `rescale` keyword rather than `scale`, but provides the same functionality.
+[^8]: For increased compatibility with patsy, we use patsy's signature for `standardize`.
diff --git a/formulaic/parser/parser.py b/formulaic/parser/parser.py
@@ -5,6 +5,8 @@
 from dataclasses import dataclass, field
 from typing import List, Iterable, Sequence, Tuple, Union, cast
 
+from formulaic.parser.types.factor import Factor
+
 from .algos.tokenize import tokenize
 from .types import (
     FormulaParser,
@@ -114,6 +116,66 @@ def get_tokens(self, formula: str) -> Iterable[Token]:
 
         return tokens
 
+    def get_terms(self, formula: str) -> Structured[List[Term]]:
+        """
+        Assemble the `Term` instances for a formula string. Depending on the
+        operators involved, this may be an iterable of `Term` instances, or
+        an iterable of iterables of `Term`s, etc.
+
+        This implementation also verifies that the formula is well-formed, in
+        that it does not have any literals apart from 1 or numeric scaling of
+        other terms.
+
+        Args:
+            formula: The formula for which an AST should be generated.
+        """
+        terms = super().get_terms(formula)
+
+        def check_terms(terms: Iterable[Term]) -> None:
+            seen_terms = set()
+            for term in terms:
+                if len(term.factors) == 1:
+                    factor = term.factors[0]
+                    if (
+                        factor.eval_method is Factor.EvalMethod.LITERAL
+                        and factor.expr != "1"
+                    ):
+                        raise exc_for_token(
+                            factor.token or Token(),
+                            "Numeric literals other than `1` can only be used "
+                            "to scale other terms. (tip: Use `:` rather than "
+                            "`*` when scaling terms)"
+                            if factor.expr.replace(".", "", 1).isnumeric()
+                            else "String literals are not valid in formulae.",
+                        )
+                else:
+                    for factor in term.factors:
+                        if (
+                            factor.eval_method is Factor.EvalMethod.LITERAL
+                            and not factor.expr.replace(".", "", 1).isnumeric()
+                        ):
+                            raise exc_for_token(
+                                factor.token or Token(),
+                                "String literals are not valid in formulae.",
+                            )
+
+                term_hash = tuple(
+                    factor.expr
+                    for factor in term.factors
+                    if factor.eval_method != Factor.EvalMethod.LITERAL
+                )
+                if term_hash in seen_terms:
+                    raise exc_for_token(
+                        term.factors[0].token or Token(),
+                        "Term already seen with a different numerical scaling. "
+                        "(tip: Use `:` rather than `*` when scaling terms)",
+                    )
+                seen_terms.add(term_hash)
+
+        terms._map(check_terms)
+
+        return terms
+
 
 class DefaultOperatorResolver(OperatorResolver):
     """

diff --git a/tests/parser/test_parser.py b/tests/parser/test_parser.py
@@ -73,6 +73,7 @@
     "(a+b)**2": ["1", "a", "a:b", "b"],
     "(a+b)^2": ["1", "a", "a:b", "b"],
     "(a+b)**3": ["1", "a", "a:b", "b"],
+    "50:a": ["1", "50:a"],
     # Nested products
     "a/b": ["1", "a", "a:b"],
     "(b+a)/c": ["1", "b", "a", "b:a:c"],
@@ -162,6 +163,42 @@ def test_long_formula(self):
         terms = PARSER_NO_INTERCEPT.get_terms(expr)
         assert {str(term) for term in terms} == names
 
+    def test_invalid_literals(self):
+        with pytest.raises(
+            FormulaSyntaxError,
+            match=re.escape(
+                "Numeric literals other than `1` can only be used to scale other terms."
+            ),
+        ):
+            PARSER.get_terms("50")
+        with pytest.raises(
+            FormulaSyntaxError,
+            match=re.escape("String literals are not valid in formulae."),
+        ):
+            PARSER.get_terms("'asd'")
+        with pytest.raises(
+            FormulaSyntaxError,
+            match=re.escape("Term already seen with a different numerical scaling."),
+        ):
+            PARSER.get_terms("1 * a")
+        with pytest.raises(
+            FormulaSyntaxError,
+            match=re.escape(
+                "Numeric literals other than `1` can only be used to scale other terms."
+            ),
+        ):
+            PARSER.get_terms("50*a")
+        with pytest.raises(
+            FormulaSyntaxError,
+            match=re.escape("String literals are not valid in formulae."),
+        ):
+            PARSER.get_terms("'asd':a")
+        with pytest.raises(
+            FormulaSyntaxError,
+            match=re.escape("Term already seen with a different numerical scaling."),
+        ):
+            PARSER.get_terms("50:a + 100:a")
+
 
 class TestDefaultOperatorResolver:
     @pytest.fixture