Skip to content

Commit

Permalink
Simplify character start position tracking
Browse files Browse the repository at this point in the history
Fixes #2175

Moves the char start update to the token completion emit event, vs reader state changes. Multiple state changes may happen for the same token as the tokeniser unwinds itself from an invalid state, which would inadvertently clear the char start position.
  • Loading branch information
jhy committed Jul 29, 2024
1 parent dcf190c commit dc3b6c5
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 12 deletions.
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

* `Element.cssSelector()` would fail if the element's class contained a `*`
character. [2169](https://github.com/jhy/jsoup/issues/2169)
* When tracking source ranges, a text node following an invalid self-closing element may be left
untracked.[2175](https://github.com/jhy/jsoup/issues/2175)

## 1.18.1 (2024-Jul-10)

Expand Down
17 changes: 5 additions & 12 deletions src/main/java/org/jsoup/parser/Tokeniser.java
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,7 @@ final class Tokeniser {
@Nullable private String lastStartTag; // the last start tag emitted, to test appropriate end tag
@Nullable private String lastStartCloseSeq; // "</" + lastStartTag, so we can quickly check for that in RCData

private static final int Unset = -1;
private int markupStartPos, charStartPos = 0; // reader pos at the start of markup / characters. updated on state transition. Initialized to start (0), but set to Unset after emissions.
private int markupStartPos, charStartPos = 0; // reader pos at the start of markup / characters. markup updated on state transition, char on token emit.

Tokeniser(TreeBuilder treeBuilder) {
tagPending = startPending = new Token.StartTag(treeBuilder);
Expand Down Expand Up @@ -90,7 +89,7 @@ void emit(Token token) {
isEmitPending = true;
token.startPos(markupStartPos);
token.endPos(reader.pos());
charStartPos = Unset;
charStartPos = reader.pos(); // update char start when we complete a token emit

if (token.type == Token.TokenType.StartTag) {
Token.StartTag startTag = (Token.StartTag) token;
Expand Down Expand Up @@ -158,15 +157,9 @@ TokeniserState getState() {
}

void transition(TokeniserState newState) {
// track markup / data position on state transitions
switch (newState) {
case TagOpen:
markupStartPos = reader.pos();
break;
case Data:
if (charStartPos == Unset) // don't reset when we are jumping between e.g data -> char ref -> data
charStartPos = reader.pos();
}
// track markup position on state transitions
if (newState == TokeniserState.TagOpen)
markupStartPos = reader.pos();

this.state = newState;
}
Expand Down
9 changes: 9 additions & 0 deletions src/test/java/org/jsoup/parser/PositionTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,15 @@ private void printRange(Node node) {
assertEquals("h1:0-9~12-17; id:4-6=7-8; #text:9-12; #text:17-18; h2:18-27~30-35; id:22-24=25-26; #text:27-30; h10:35-40~43-49; #text:40-43; ", track.toString());
}

@Test void tracksAfterPSelfClose() {
// https://github.com/jhy/jsoup/issues/2175
String html = "foo<p/>bar &amp; 2";
Document doc = Jsoup.parse(html, TrackingHtmlParser);
StringBuilder track = new StringBuilder();
doc.body().forEachNode(node -> accumulatePositions(node, track));
assertEquals("body:0-0~18-18; #text:0-3; p:3-7~3-7; #text:7-18; ", track.toString());
}

@Test void tracksFirstTextnode() {
// https://github.com/jhy/jsoup/issues/2106
String html = "foo<p></p>bar<p></p><div><b>baz</b></div>";
Expand Down

0 comments on commit dc3b6c5

Please sign in to comment.