Skip to content

Commit

Permalink
Merge branch 'master' into Element-stream
Browse files Browse the repository at this point in the history
  • Loading branch information
Isira-Seneviratne authored Dec 4, 2024
2 parents 2fde33f + 33d0d46 commit 7c96316
Show file tree
Hide file tree
Showing 27 changed files with 271 additions and 99 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ target/
*Thrash*
bin/
.vscode/
.java-version
28 changes: 24 additions & 4 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
# jsoup Changelog

## 1.18.2 (Pending)
## 1.18.3 (PENDING)

### Bug Fixes

* When serializing to XML, attribute names containing `-`, `.`, or digits were incorrectly marked as invalid and
removed. [2235](https://github.com/jhy/jsoup/issues/2235)

## 1.18.2 (2024-Nov-27)

### Improvements

* Optimized the throughput and memory use throughout the input read and parse flows, with heap allocations and GC
down between -6% and -89%, and throughput improved up to +143% for small inputs. Most inputs sizes will see
throughput increases of ~ 20%. These performance improvements come through recycling the backing byte[] and char[]
throughput increases of ~ 20%. These performance improvements come through recycling the backing `byte[]` and `char[]`
arrays used to read and parse the input. [2186](https://github.com/jhy/jsoup/pull/2186)
* Speed optimized `html()` and `Entities.escape()` when the input contains UTF characters in a supplementary plane, by
around 49%. [2183](https://github.com/jhy/jsoup/pull/2183)
Expand All @@ -15,6 +22,8 @@
* In the `TreeBuilder`, the `onNodeInserted()` and `onNodeClosed()` events are now also fired for the outermost /
root `Document` node. This enables source position tracking on the Document node (which was previously unset). And
it also enables the node traversor to see the outer Document node. [2182](https://github.com/jhy/jsoup/pull/2182)
* Selected Elements can now be position swapped inline using
`Elements#set()`. [2212](https://github.com/jhy/jsoup/issues/2212)

### Bug Fixes

Expand All @@ -29,8 +38,19 @@
children. [2187](https://github.com/jhy/jsoup/issues/2187)
* A selector query that included multiple `:has()` components in a nested `:has()` might incorrectly
execute. [2131](https://github.com/jhy/jsoup/issues/2131)
* Updated the simple view of cookies available via `Connection.Response#cookies()` to reflect the contents of the
current cookie jar for the current URL. [1831](https://github.com/jhy/jsoup/issues/1831)
* When cookie names in a response are duplicated, the simple view of cookies available via
`Connection.Response#cookies()` will provide the last one set. Generally it is better to use
the [Jsoup.newSession](https://jsoup.org/cookbook/web/request-session) method to maintain a cookie jar, as that
applies appropriate path selection on cookies when making requests. [1831](https://github.com/jhy/jsoup/issues/1831)
* When parsing named HTML entities, base entities should resolve if they are a prefix of the input token (and not in an
attribute). [2207](https://github.com/jhy/jsoup/issues/2207)
* Fixed incorrect tracking of source ranges for attributes merged from late-occurring elements that were implicitly
created (`html` or `body`). [2204](https://github.com/jhy/jsoup/issues/2204)
* Follow the current HTML specification in the tokenizer to allow `<` as part of a tag name, instead of emitting it as a
character node. [2230](https://github.com/jhy/jsoup/issues/2230)
* Similarly, allow a `<` as the start of an attribute name, vs creating a new element. The previous behavior was
intended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to
how browsers behave. [1483](https://github.com/jhy/jsoup/issues/1483)

## 1.18.1 (2024-Jul-10)

Expand Down
15 changes: 8 additions & 7 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.2-SNAPSHOT</version><!-- remember to update previous version below for japicmp -->
<version>1.19.1-SNAPSHOT</version><!-- remember to update previous version below for japicmp -->
<url>https://jsoup.org/</url>
<description>jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.</description>
<inceptionYear>2009</inceptionYear>
Expand Down Expand Up @@ -98,6 +98,7 @@
<ignore>java.io.UncheckedIOException</ignore>
<ignore>java.util.Comparator</ignore> <!-- Comparator.comparingInt() -->
<ignore>java.util.List</ignore> <!-- List#stream() -->
<ignore>java.util.ArrayList</ignore> <!-- List / ArrayList #sort() -->
<ignore>java.util.LinkedHashMap</ignore> <!-- LinkedHashMap#computeIfAbsent() -->
<ignore>java.util.Map</ignore> <!-- Map#computeIfAbsent() -->
<ignore>java.util.Objects</ignore>
Expand All @@ -118,7 +119,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>3.10.0</version>
<version>3.11.1</version>
<configuration>
<doclint>none</doclint>
<source>8</source>
Expand Down Expand Up @@ -203,15 +204,15 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.5.0</version>
<version>3.5.2</version>
<configuration>
<!-- smaller stack to find stack overflows. Was 256, but Zulu on MacOS ARM needs >= 640 -->
<argLine>-Xss640k</argLine>
</configuration>
</plugin>
<plugin>
<artifactId>maven-failsafe-plugin</artifactId>
<version>3.5.0</version>
<version>3.5.2</version>
<executions>
<execution>
<goals>
Expand All @@ -236,7 +237,7 @@
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.1</version>
<version>1.18.2</version>
<type>jar</type>
</dependency>
</oldVersion>
Expand Down Expand Up @@ -372,7 +373,7 @@
<plugins>
<plugin>
<artifactId>maven-failsafe-plugin</artifactId>
<version>3.5.0</version>
<version>3.5.2</version>
<executions>
<execution>
<goals>
Expand All @@ -393,7 +394,7 @@
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter</artifactId>
<version>5.11.0</version>
<version>5.11.3</version>
<scope>test</scope>
</dependency>

Expand Down
11 changes: 7 additions & 4 deletions src/main/java/org/jsoup/Connection.java
Original file line number Diff line number Diff line change
Expand Up @@ -376,7 +376,7 @@ <li>returns the appropriate credentials (username and password)</li>
<code><pre>
Connection session = Jsoup.newSession()
.proxy("proxy.example.com", 8080)
.auth(auth -> {
.auth(auth -&gt; {
if (auth.isServer()) { // provide credentials for the request url
Validate.isTrue(auth.url().getHost().equals("example.com"));
// check that we're sending credentials were we expect, and not redirected out
Expand Down Expand Up @@ -632,9 +632,12 @@ interface Base<T extends Base<T>> {
T removeCookie(String name);

/**
* Retrieve all of the request/response cookies as a map
* @return cookies
* @see #cookieStore()
Retrieve the request/response cookies as a map. For response cookies, if duplicate cookie names were sent, the
last one set will be the one included. For session management, rather than using these response cookies, prefer
to use {@link Jsoup#newSession()} and related methods.
@return simple cookie map
@see #cookieStore()
*/
Map<String, String> cookies();
}
Expand Down
23 changes: 16 additions & 7 deletions src/main/java/org/jsoup/helper/CookieUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import org.jsoup.Connection;
import org.jsoup.internal.StringUtil;
import org.jsoup.parser.TokenQueue;

import java.io.IOException;
import java.net.CookieManager;
Expand Down Expand Up @@ -91,13 +92,21 @@ static void storeCookies(HttpConnection.Request req, HttpConnection.Response res
URI uri = CookieUtil.asUri(url);
manager.put(uri, resHeaders); // stores cookies for session

// set up the simple cookie(name, value) map:
Map<String, List<String>> cookieMap = manager.get(uri, resHeaders); // get cookies for url; may have been set on this or earlier requests. the headers here are ignored other than a null check
for (List<String> values : cookieMap.values()) {
for (String headerVal : values) {
List<HttpCookie> cookies = HttpCookie.parse(headerVal);
for (HttpCookie cookie : cookies) {
res.cookie(cookie.getName(), cookie.getValue());
// set up the simple cookies() map
// the response may include cookies that are not relevant to this request, but users may require them if they are not using the cookie manager (setting request cookies only from the simple cookies() response):
for (Map.Entry<String, List<String>> entry : resHeaders.entrySet()) {
String name = entry.getKey();
List<String> values = entry.getValue();
if (name.equalsIgnoreCase("Set-Cookie")) {
for (String value : values) {
if (value == null)
continue;
TokenQueue cd = new TokenQueue(value);
String cookieName = cd.chompTo("=").trim();
String cookieVal = cd.consumeTo(";").trim();
// ignores path, date, domain, validateTLSCertificates et al. full details will be available in cookiestore if required
// name not blank, value not null
res.cookie(cookieName, cookieVal); // if duplicate names, last set will win
}
}
}
Expand Down
5 changes: 5 additions & 0 deletions src/main/java/org/jsoup/helper/HttpConnection.java
Original file line number Diff line number Diff line change
Expand Up @@ -1136,6 +1136,11 @@ private Response(HttpURLConnection conn, HttpConnection.Request request, HttpCon
CookieUtil.storeCookies(req, this, url, resHeaders); // add set cookies to cookie store

if (previousResponse != null) { // was redirected
// map previous response cookies into this response cookies() object
for (Map.Entry<String, String> prevCookie : previousResponse.cookies().entrySet()) {
if (!hasCookie(prevCookie.getKey()))
cookie(prevCookie.getKey(), prevCookie.getValue());
}
previousResponse.safeClose();

// enforce too many redirects:
Expand Down
8 changes: 4 additions & 4 deletions src/main/java/org/jsoup/nodes/Attribute.java
Original file line number Diff line number Diff line change
Expand Up @@ -199,13 +199,13 @@ else if (syntax == Syntax.html && !isValidHtmlKey(key)) {
private static boolean isValidXmlKey(String key) {
// =~ [a-zA-Z_:][-a-zA-Z0-9_:.]*
final int length = key.length();
if (length ==0) return false;
if (length == 0) return false;
char c = key.charAt(0);
if (!((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_' || c == ':'))
return false;
for (int i = 1; i < length; i++) {
c = key.charAt(i);
if (!((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_' || c == ':'))
if (!((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9') || c == '-' || c == '_' || c == ':' || c == '.'))
return false;
}
return true;
Expand All @@ -214,10 +214,10 @@ private static boolean isValidXmlKey(String key) {
private static boolean isValidHtmlKey(String key) {
// =~ [\x00-\x1f\x7f-\x9f "'/=]+
final int length = key.length();
if (length ==0) return false;
if (length == 0) return false;
for (int i = 0; i < length; i++) {
char c = key.charAt(i);
if (c <= 0x1f || c >= 0x7f && c <= 0x9f || c == ' ' || c == '"' || c == '\'' || c == '/' || c == '=')
if ((c <= 0x1f) || (c >= 0x7f && c <= 0x9f) || c == ' ' || c == '"' || c == '\'' || c == '/' || c == '=')
return false;
}
return true;
Expand Down
19 changes: 19 additions & 0 deletions src/main/java/org/jsoup/nodes/Attributes.java
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,25 @@ public Range.AttributeRange sourceRange(String key) {
return (Map<String, Range.AttributeRange>) userData(AttrRangeKey);
}

/**
Set the source ranges (start to end position) from which this attribute's <b>name</b> and <b>value</b> were parsed.
@param key the attribute name
@param range the range for the attribute's name and value
@return these attributes, for chaining
@since 1.18.2
*/
public Attributes sourceRange(String key, Range.AttributeRange range) {
Validate.notNull(key);
Validate.notNull(range);
Map<String, Range.AttributeRange> ranges = getRanges();
if (ranges == null) {
ranges = new HashMap<>();
userData(AttrRangeKey, ranges);
}
ranges.put(key, range);
return this;
}


@Override
public Iterator<Attribute> iterator() {
Expand Down
24 changes: 24 additions & 0 deletions src/main/java/org/jsoup/nodes/Entities.java
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;

import static org.jsoup.nodes.Document.OutputSettings.*;
Expand All @@ -36,6 +38,9 @@ public class Entities {
private static final char[] codeDelims = {',', ';'};
private static final HashMap<String, String> multipoints = new HashMap<>(); // name -> multiple character references

private static final int BaseCount = 106;
private static final ArrayList<String> baseSorted = new ArrayList<>(BaseCount); // names sorted longest first, for prefix matching

public enum EscapeMode {
/**
* Restricted entities suitable for XHTML output: lt, gt, amp, and quot only.
Expand All @@ -50,6 +55,12 @@ public enum EscapeMode {
*/
extended(EntitiesData.fullPoints, 2125);

static {
// sort the base names by length, for prefix matching
Collections.addAll(baseSorted, base.nameKeys);
baseSorted.sort((a, b) -> b.length() - a.length());
}

// table of named references to their codepoints. sorted so we can binary search. built by BuildEntities.
private String[] nameKeys;
private int[] codeVals; // limitation is the few references with multiple characters; those go into multipoints.
Expand Down Expand Up @@ -134,6 +145,19 @@ public static int codepointsForName(final String name, final int[] codepoints) {
return 0;
}

/**
Finds the longest base named entity that is a prefix of the input. That is, input "notit" would return "not".
@return longest entity name that is a prefix of the input, or "" if no entity matches
*/
public static String findPrefix(String input) {
for (String name : baseSorted) {
if (input.startsWith(name)) return name;
}
return emptyName;
// if perf critical, could look at using a Trie vs a scan
}

/**
HTML escape an input string. That is, {@code <} is returned as {@code &lt;}. The escaped string is suitable for use
both in attributes and in text data.
Expand Down
1 change: 1 addition & 0 deletions src/main/java/org/jsoup/nodes/Node.java
Original file line number Diff line number Diff line change
Expand Up @@ -509,6 +509,7 @@ void nodelistChanged() {
*/
public void replaceWith(Node in) {
Validate.notNull(in);
if (parentNode == null) parentNode = in.parentNode; // allows old to have been temp removed before replacing
Validate.notNull(parentNode);
parentNode.replaceChild(this, in);
}
Expand Down
3 changes: 1 addition & 2 deletions src/main/java/org/jsoup/parser/CharacterReader.java
Original file line number Diff line number Diff line change
Expand Up @@ -489,7 +489,7 @@ String consumeRawData() {

String consumeTagName() {
// '\t', '\n', '\r', '\f', ' ', '/', '>'
// NOTE: out of spec, added '<' to fix common author bugs; does not stop and append on nullChar but eats
// NOTE: out of spec; does not stop and append on nullChar but eats
bufferUp();
int pos = bufPos;
final int start = pos;
Expand All @@ -505,7 +505,6 @@ String consumeTagName() {
case ' ':
case '/':
case '>':
case '<':
break OUTER;
}
pos++;
Expand Down
31 changes: 18 additions & 13 deletions src/main/java/org/jsoup/parser/HtmlTreeBuilderState.java
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import org.jsoup.nodes.Document;
import org.jsoup.nodes.DocumentType;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Range;

import java.util.ArrayList;

Expand Down Expand Up @@ -371,12 +372,7 @@ private boolean inBodyStartTag(Token t, HtmlTreeBuilder tb) {
stack = tb.getStack();
if (stack.size() > 0) {
Element html = tb.getStack().get(0);
if (startTag.hasAttributes()) {
for (Attribute attribute : startTag.attributes) {
if (!html.hasAttr(attribute.getKey()))
html.attributes().put(attribute);
}
}
mergeAttributes(startTag, html);
}
break;
case "body":
Expand All @@ -388,13 +384,8 @@ private boolean inBodyStartTag(Token t, HtmlTreeBuilder tb) {
} else {
tb.framesetOk(false);
// will be on stack if this is a nested body. won't be if closed (which is a variance from spec, which leaves it on)
Element body;
if (startTag.hasAttributes() && (body = tb.getFromStack("body")) != null) { // we only ever put one body on stack
for (Attribute attribute : startTag.attributes) {
if (!body.hasAttr(attribute.getKey()))
body.attributes().put(attribute);
}
}
Element body = tb.getFromStack("body");
if (body != null) mergeAttributes(startTag, body);
}
break;
case "frameset":
Expand Down Expand Up @@ -1841,6 +1832,20 @@ boolean processAsHtml(Token t, HtmlTreeBuilder tb) {
}
};

private static void mergeAttributes(Token.StartTag source, Element dest) {
if (!source.hasAttributes()) return;
for (Attribute attr : source.attributes) { // only iterates public attributes
Attributes destAttrs = dest.attributes();
if (!destAttrs.hasKey(attr.getKey())) {
Range.AttributeRange range = attr.sourceRange(); // need to grab range before its parent changes
destAttrs.put(attr);
if (source.trackSource) { // copy the attribute range
destAttrs.sourceRange(attr.getKey(), range);
}
}
}
}

private static final String nullString = String.valueOf('\u0000');

abstract boolean process(Token t, HtmlTreeBuilder tb);
Expand Down
Loading

0 comments on commit 7c96316

Please sign in to comment.