Skip to content

Commit

Permalink
Merge branch 'master' into Element-stream
Browse files Browse the repository at this point in the history
  • Loading branch information
Isira-Seneviratne authored Dec 30, 2023
2 parents 5e873b7 + 15558b4 commit cb74941
Show file tree
Hide file tree
Showing 21 changed files with 295 additions and 325 deletions.
59 changes: 37 additions & 22 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,50 @@
# jsoup Changelog

## 1.17.2 (Pending)
## 1.18.1 (Pending)

### Improvements

* Added `Element.attribute(String)` and `Attributes.attribute(String)` to more simply obtain an `Attribute` object.
[2069](https://github.com/jhy/jsoup/issues/2069)
* If source tracking is on, and an Attribute's key is changed (via `Attribute.setKey(String)`), the source range is
now still tracked in `Attribute.sourceRange()`. [2070](https://github.com/jhy/jsoup/issues/2070)
* Added support for the `[*]` element with any attribute selector. And also restored support for selecting by an empty
attribute name prefix (`[^]`). [2079](https://github.com/jhy/jsoup/issues/2079)
* Added `Path` accepting parse methods: `Jsoup.parse(Path)`, `Jsoup.parse(path, charsetName, baseUri, parser)`,
etc. [2055](https://github.com/jhy/jsoup/pull/2055)

### Changes

* Removed previously deprecated internal classes and methods. [2094](https://github.com/jhy/jsoup/pull/2094)

---

## 1.17.2 (2023-Dec-29)

### Improvements

* **Attribute object accessors**: Added `Element.attribute(String)` and `Attributes.attribute(String)` to more simply
obtain an `Attribute` object. [2069](https://github.com/jhy/jsoup/issues/2069)
* **Attribute source tracking**: If source tracking is on, and an Attribute's key is changed (
via `Attribute.setKey(String)`), the source range is now still tracked
in `Attribute.sourceRange()`. [2070](https://github.com/jhy/jsoup/issues/2070)
* **Wildcard attribute selector**: Added support for the `[*]` element with any attribute selector. And also restored
support for selecting by an empty attribute name prefix (`[^]`). [2079](https://github.com/jhy/jsoup/issues/2079)

### Bug Fixes

* When tracking the source position of attributes, if source attribute name was mix-cased but the parser was
lower-case normalizing attribute names, the source position for that attribute was not tracked
correctly. [2067](https://github.com/jhy/jsoup/issues/2067)
* When tracking the source position of a body fragment parse, a null pointer exception was
thrown. [2068](https://github.com/jhy/jsoup/issues/2068)
* A multi-point encoded emoji entity may be incorrectly decoded to the replacement
* **Mixed-cased source position**: When tracking the source position of attributes, if the source attribute name was
mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not
tracked correctly. [2067](https://github.com/jhy/jsoup/issues/2067)
* **Source position NPE**: When tracking the source position of a body fragment parse, a null pointer
exception was thrown. [2068](https://github.com/jhy/jsoup/issues/2068)
* **Multi-point emoji entity**: A multi-point encoded emoji entity may be incorrectly decoded to the replacement
character. [2074](https://github.com/jhy/jsoup/issues/2074)
* (Regression) in a selector like `parent [attr=va], other`, the `, OR` was binding to `[attr=va]` instead of
`parent [attr=va]`, causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr
to represent the query, allowing simpler and more thorough query parse
* **Selector sub-expressions**: (Regression) in a selector like `parent [attr=va], other`, the `, OR` was binding
to `[attr=va]` instead of `parent [attr=va]`, causing incorrect selections. The fix includes a EvaluatorDebug class
that generates a sexpr to represent the query, allowing simpler and more thorough query parse
tests. [2073](https://github.com/jhy/jsoup/issues/2073)
* When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an
extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML
polyglot format, if the data is not already within a CData section. [2078](https://github.com/jhy/jsoup/issues/2078)
* The `:has` evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple
concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the
iterator object is a thread-local. [2088](https://github.com/jhy/jsoup/issues/2088)
* **XML CData output**: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData
sections would have an extraneous CData section added, causing script execution errors. Now, the data content is
emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData
section. [2078](https://github.com/jhy/jsoup/issues/2078)
* **Thread safety**: The `:has` evaluator held a non-thread-safe Iterator, and so if an Evaluator object was
shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be
incorrect. Now, the iterator object is a thread-local. [2088](https://github.com/jhy/jsoup/issues/2088)

---
Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in
Expand Down
17 changes: 13 additions & 4 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2-SNAPSHOT</version><!-- remember to update previous version below for japicmp -->
<version>1.18.1-SNAPSHOT</version><!-- remember to update previous version below for japicmp -->
<url>https://jsoup.org/</url>
<description>jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.</description>
<inceptionYear>2009</inceptionYear>
Expand Down Expand Up @@ -42,7 +42,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.12.0</version>
<version>3.12.1</version>
<configuration>
<encoding>UTF-8</encoding>
<compilerArgs>
Expand Down Expand Up @@ -88,13 +88,21 @@
<version>2.3.3_r2</version>
</signature>
<ignores>
<ignore>java.io.File</ignore> <!-- File#toPath() -->
<ignore>java.nio.file.*</ignore>
<ignore>java.nio.channels.SeekableByteChannel</ignore>
<ignore>java.util.function.*</ignore>
<ignore>java.util.stream.*</ignore>
<ignore>java.lang.Throwable</ignore> <!-- Throwable#addSuppressed(Throwable) -->
<ignore>java.lang.ThreadLocal</ignore>
<ignore>java.io.UncheckedIOException</ignore>
<ignore>java.util.Comparator</ignore> <!-- Comparator.comparingInt() -->
<ignore>java.util.List</ignore> <!-- List#stream() -->
<ignore>java.util.LinkedHashMap</ignore> <!-- LinkedHashMap#computeIfAbsent() -->
<ignore>java.util.Map</ignore> <!-- Map#computeIfAbsent() -->
<ignore>java.util.Objects</ignore>
<ignore>java.util.Optional</ignore>
<ignore>java.util.Set</ignore> <!-- Set#stream() -->
<ignore>java.util.Spliterator</ignore>
<ignore>java.util.Spliterators</ignore>

Expand Down Expand Up @@ -227,7 +235,7 @@
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.2</version>
<version>1.17.1</version>
<type>jar</type>
</dependency>
</oldVersion>
Expand All @@ -237,7 +245,8 @@
<breakBuildOnBinaryIncompatibleModifications>true</breakBuildOnBinaryIncompatibleModifications>
<breakBuildOnSourceIncompatibleModifications>true</breakBuildOnSourceIncompatibleModifications>
<excludes>
<!-- <exclude>@java.lang.Deprecated</exclude> -->
<exclude>@java.lang.Deprecated</exclude>
<exclude>org.jsoup.UncheckedIOException</exclude>
</excludes>
<overrideCompatibilityChangeParameters>
<!-- allows new default and move to default methods. compatible as long as existing binaries aren't making calls via reflection. if so, they need to catch errors anyway. -->
Expand Down
67 changes: 67 additions & 0 deletions src/main/java/org/jsoup/Jsoup.java
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Path;

/**
The core public access point to the jsoup functionality.
Expand Down Expand Up @@ -183,6 +184,72 @@ public static Document parse(File file, @Nullable String charsetName, String bas
return DataUtil.load(file, charsetName, baseUri, parser);
}

/**
Parse the contents of a file as HTML.
@param path file to load HTML from. Supports gzipped files (ending in .z or .gz).
@param charsetName (optional) character set of file contents. Set to {@code null} to determine from {@code http-equiv} meta tag, if
present, or fall back to {@code UTF-8} (which is often safe to do).
@param baseUri The URL where the HTML was retrieved from, to resolve relative links against.
@return sane HTML
@throws IOException if the file could not be found, or read, or if the charsetName is invalid.
@since 1.18.1
*/
public static Document parse(Path path, @Nullable String charsetName, String baseUri) throws IOException {
return DataUtil.load(path, charsetName, baseUri);
}

/**
Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
@param path file to load HTML from. Supports gzipped files (ending in .z or .gz).
@param charsetName (optional) character set of file contents. Set to {@code null} to determine from {@code http-equiv} meta tag, if
present, or fall back to {@code UTF-8} (which is often safe to do).
@return sane HTML
@throws IOException if the file could not be found, or read, or if the charsetName is invalid.
@see #parse(File, String, String) parse(file, charset, baseUri)
@since 1.18.1
*/
public static Document parse(Path path, @Nullable String charsetName) throws IOException {
return DataUtil.load(path, charsetName, path.toAbsolutePath().toString());
}

/**
Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
The charset used to read the file will be determined by the byte-order-mark (BOM), or a {@code <meta charset>} tag,
or if neither is present, will be {@code UTF-8}.
<p>This is the equivalent of calling {@link #parse(File, String) parse(file, null)}</p>
@param path the file to load HTML from. Supports gzipped files (ending in .z or .gz).
@return sane HTML
@throws IOException if the file could not be found or read.
@see #parse(Path, String, String) parse(file, charset, baseUri)
@since 1.18.1
*/
public static Document parse(Path path) throws IOException {
return DataUtil.load(path, null, path.toAbsolutePath().toString());
}

/**
Parse the contents of a file as HTML.
@param path file to load HTML from. Supports gzipped files (ending in .z or .gz).
@param charsetName (optional) character set of file contents. Set to {@code null} to determine from {@code http-equiv} meta tag, if
present, or fall back to {@code UTF-8} (which is often safe to do).
@param baseUri The URL where the HTML was retrieved from, to resolve relative links against.
@param parser alternate {@link Parser#xmlParser() parser} to use.
@return sane HTML
@throws IOException if the file could not be found, or read, or if the charsetName is invalid.
@since 1.18.1
*/
public static Document parse(Path path, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
return DataUtil.load(path, charsetName, baseUri, parser);
}

/**
Read an input stream, and parse it to a Document.
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/jsoup/UncheckedIOException.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* @deprecated Use {@link java.io.UncheckedIOException} instead. This class acted as a compatibility shim for Java
* versions prior to 1.8.
*/
// todo annotate @Deprecated in next release (after previous @Deprecations clear)
@Deprecated
public class UncheckedIOException extends java.io.UncheckedIOException {
public UncheckedIOException(IOException cause) {
super(cause);
Expand Down
83 changes: 53 additions & 30 deletions src/main/java/org/jsoup/helper/DataUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

import org.jsoup.internal.ControllableInputStream;
import org.jsoup.internal.Normalizer;
import org.jsoup.internal.SharedConstants;
import org.jsoup.internal.StringUtil;
import org.jsoup.nodes.Comment;
import org.jsoup.nodes.Document;
Expand All @@ -16,7 +15,6 @@
import java.io.BufferedReader;
import java.io.CharArrayReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
Expand All @@ -25,8 +23,12 @@
import java.nio.Buffer;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.channels.Channels;
import java.nio.channels.SeekableByteChannel;
import java.nio.charset.Charset;
import java.nio.charset.IllegalCharsetNameException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Locale;
import java.util.Random;
import java.util.regex.Matcher;
Expand Down Expand Up @@ -63,7 +65,7 @@ private DataUtil() {}
* @throws IOException on IO error
*/
public static Document load(File file, @Nullable String charsetName, String baseUri) throws IOException {
return load(file, charsetName, baseUri, Parser.htmlParser());
return load(file.toPath(), charsetName, baseUri);
}

/**
Expand All @@ -81,18 +83,48 @@ public static Document load(File file, @Nullable String charsetName, String base
* @since 1.14.2
*/
public static Document load(File file, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
InputStream stream = new FileInputStream(file);
String name = Normalizer.lowerCase(file.getName());
if (name.endsWith(".gz") || name.endsWith(".z")) {
// unfortunately file input streams don't support marks (why not?), so we will close and reopen after read
boolean zipped;
try {
zipped = (stream.read() == 0x1f && stream.read() == 0x8b); // gzip magic bytes
} finally {
stream.close();
return load(file.toPath(), charsetName, baseUri, parser);
}

/**
* Loads and parses a file to a Document, with the HtmlParser. Files that are compressed with gzip (and end in {@code .gz} or {@code .z})
* are supported in addition to uncompressed files.
*
* @param path file to load
* @param charsetName (optional) character set of input; specify {@code null} to attempt to autodetect. A BOM in
* the file will always override this setting.
* @param baseUri base URI of document, to resolve relative links against
* @return Document
* @throws IOException on IO error
*/
public static Document load(Path path, @Nullable String charsetName, String baseUri) throws IOException {
return load(path, charsetName, baseUri, Parser.htmlParser());
}

/**
* Loads and parses a file to a Document. Files that are compressed with gzip (and end in {@code .gz} or {@code .z})
* are supported in addition to uncompressed files.
*
* @param path file to load
* @param charsetName (optional) character set of input; specify {@code null} to attempt to autodetect. A BOM in
* the file will always override this setting.
* @param baseUri base URI of document, to resolve relative links against
* @param parser alternate {@link Parser#xmlParser() parser} to use.
* @return Document
* @throws IOException on IO error
* @since 1.17.2
*/
public static Document load(Path path, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
final SeekableByteChannel byteChannel = Files.newByteChannel(path);
InputStream stream = Channels.newInputStream(byteChannel);
String name = Normalizer.lowerCase(path.getFileName().toString());
if (name.endsWith(".gz") || name.endsWith(".z")) {
final boolean zipped = (stream.read() == 0x1f && stream.read() == 0x8b); // gzip magic bytes
byteChannel.position(0); // reset to start of file
if (zipped) {
stream = new GZIPInputStream(stream);
}
stream = zipped ? new GZIPInputStream(new FileInputStream(file)) : new FileInputStream(file);
}
return parseInputStream(stream, charsetName, baseUri, parser);
}
Expand Down Expand Up @@ -139,16 +171,15 @@ static void crossStreams(final InputStream in, final OutputStream out) throws IO
static Document parseInputStream(@Nullable InputStream input, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
if (input == null) // empty body
return new Document(baseUri);
input = ControllableInputStream.wrap(input, DefaultBufferSize, 0);

@Nullable Document doc = null;

// read the start of the stream and look for a BOM or meta charset
try {
input.mark(DefaultBufferSize);
ByteBuffer firstBytes = readToByteBuffer(input, firstReadBufferSize - 1); // -1 because we read one more to see if completed. First read is < buffer size, so can't be invalid.
boolean fullyRead = (input.read() == -1);
input.reset();
try (InputStream wrappedInputStream = ControllableInputStream.wrap(input, DefaultBufferSize, 0)) {
wrappedInputStream.mark(DefaultBufferSize);
ByteBuffer firstBytes = readToByteBuffer(wrappedInputStream, firstReadBufferSize - 1); // -1 because we read one more to see if completed. First read is < buffer size, so can't be invalid.
boolean fullyRead = (wrappedInputStream.read() == -1);
wrappedInputStream.reset();

// look for BOM - overrides any other header or input
BomCharset bomCharset = detectCharsetFromBom(firstBytes);
Expand Down Expand Up @@ -189,9 +220,8 @@ else if (first instanceof Comment) {
if (comment.isXmlDeclaration())
decl = comment.asXmlDeclaration();
}
if (decl != null) {
if (decl.name().equalsIgnoreCase("xml"))
foundCharset = decl.attr("encoding");
if (decl != null && decl.name().equalsIgnoreCase("xml")) {
foundCharset = decl.attr("encoding");
}
}
foundCharset = validateCharset(foundCharset);
Expand All @@ -208,8 +238,7 @@ else if (first instanceof Comment) {
if (doc == null) {
if (charsetName == null)
charsetName = defaultCharsetName;
BufferedReader reader = new BufferedReader(new InputStreamReader(input, Charset.forName(charsetName)), DefaultBufferSize); // Android level does not allow us try-with-resources
try {
try (BufferedReader reader = new BufferedReader(new InputStreamReader(wrappedInputStream, Charset.forName(charsetName)), DefaultBufferSize)) {
if (bomCharset != null && bomCharset.offset) { // creating the buffered reader ignores the input pos, so must skip here
long skipped = reader.skip(1);
Validate.isTrue(skipped == 1); // WTF if this fails.
Expand All @@ -227,14 +256,8 @@ else if (first instanceof Comment) {
doc.charset(UTF_8);
}
}
finally {
reader.close();
}
}
}
finally {
input.close();
}
return doc;
}

Expand Down
Loading

0 comments on commit cb74941

Please sign in to comment.