Skip to content

Commit

Permalink
Merge branch 'any-code-point'
Browse files Browse the repository at this point in the history
  • Loading branch information
pdubroy committed Mar 3, 2023
2 parents c28cd81 + b328e96 commit 5c8926c
Show file tree
Hide file tree
Showing 6 changed files with 48 additions and 5 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

### Breaking changes:

- [#424]: `any` now consumes an entire code point (i.e., a full Unicode character), not just a single, 16-bit code unit.
- [55c787b]: The namespace helpers (`namespace`, `extendNamespace`) have been removed. (These were always optional.)
- [bea0be9]: When used as an ES module, the main 'ohm-js' module now has _only_ named exports (i.e., no default export). The same is true for `ohm-js/extras`.
- [#395]: In generated type definitions, action dictionary types now inherit from `BaseActionDict<T>`, a new supertype of `ActionDict<T>`.
Expand Down
18 changes: 18 additions & 0 deletions doc/releases/ohm-js-17.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,24 @@ This version also has experimental support for indentation-sensitive grammars.

## Upgrading

### `any` now consumes a full code point

In JavaScript, a string is a sequence of 16-bit code units. Some Unicode characters, such as emoji, are encoded as pairs of 16-bit values. For example, the string '😆' has length 2, but contains a single Unicode code point. Previously, `any` matched a single 16-bit code unit — even if that unit was part of a surrogate pair. In v17, `any` now matches a full Unicode character.

Old behaviour:

```js
const g = ohm.grammar('OneChar { start = any }');
g.match('😆').succeeded(); // false
```

New behaviour (Ohm v17+):

```js
const g = ohm.grammar('OneChar { start = any }');
g.match('😆').succeeded(); // true
```

### Namespace helpers removed

The top-level `namespace` and `extendNamespace` functions have been removed. They were never required — it was always possible to use a plain old object in any API that asked for a namespace.
Expand Down
4 changes: 3 additions & 1 deletion doc/syntax-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,9 @@ as well as multiline (`/* */`) comments like:

(See [src/built-in-rules.ohm](https://github.com/harc/ohm/blob/main/packages/ohm-js/src/built-in-rules.ohm).)

`any`: Matches the next character in the input stream, if one exists.
`any`: Matches the next Unicode character — i.e., a single code point — in the input stream, if one exists.

**NOTE:** A JavaScript string is a sequence of 16-bit _code units_. Some Unicode characters, such as emoji, are encoded as pairs of 16-bit values. For example, the string `'😆'` has length 2, but contains a single Unicode code point. Prior to Ohm v17, `any` always consumed a single 16-bit code unit, rather than a full Unicode character.

`letter`: Matches a single character which is a letter (either uppercase or lowercase).

Expand Down
2 changes: 1 addition & 1 deletion packages/ohm-js/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "ohm-js",
"version": "17.0.0",
"version": "17.0.1",
"description": "An object-oriented language for parsing and pattern matching",
"repository": "https://github.com/harc/ohm",
"keywords": [
Expand Down
6 changes: 3 additions & 3 deletions packages/ohm-js/src/pexprs-eval.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ pexprs.PExpr.prototype.eval = common.abstract('eval'); // function(state) { ...
pexprs.any.eval = function(state) {
const {inputStream} = state;
const origPos = inputStream.pos;
const ch = inputStream.next();
if (ch) {
state.pushBinding(new TerminalNode(ch.length), origPos);
const cp = inputStream.nextCodePoint();
if (cp !== undefined) {
state.pushBinding(new TerminalNode(String.fromCodePoint(cp).length), origPos);
return true;
} else {
state.processFailure(origPos, this);
Expand Down
22 changes: 22 additions & 0 deletions packages/ohm-js/test/test-ohm-syntax.js
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,28 @@ test('ranges w/ code points > 0xFFFF, special cases', t => {
assertSucceeds(t, g2.match('\u{D83D}x'));
});

test('any consumes an entire code point', t => {
const g = ohm.grammar('G { start = any any }');
const re = /../u; // The regex equivalent of `any any`.

t.is('😇'.length, 2);
t.is('😇!'.length, 3);
t.is('😇😇'.length, 4);

t.is(g.match('😇😇').succeeded(), true);
t.truthy(re.exec('😇😇'));

t.is(g.match('😇!').succeeded(), true);
t.truthy(re.exec('😇!'));

t.is(g.match('!😇').succeeded(), true);
t.truthy(re.exec('!😇'));

t.is('👋🏿'.length, 4); // Skin color modifier is a separate code point.
t.is(g.match('👋🏿').succeeded(), true);
t.truthy(re.exec('👋🏿'));
});

describe('alt', test => {
const m = ohm.grammar('M { altTest = "a" | "b" }');
const s = m.createSemantics().addAttribute('v', {
Expand Down

0 comments on commit 5c8926c

Please sign in to comment.