Merge branch 'master' into Element-stream

jhy · Dec 30, 2023 · cb74941 · cb74941
2 parents 5e873b7 + 15558b4
commit cb74941
Show file tree

Hide file tree

Showing 21 changed files with 295 additions and 325 deletions.
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,35 +1,50 @@
 # jsoup Changelog
 
-## 1.17.2 (Pending)
+## 1.18.1 (Pending)
 
 ### Improvements
 
-* Added `Element.attribute(String)` and `Attributes.attribute(String)` to more simply obtain an `Attribute` object.
-  [2069](https://github.com/jhy/jsoup/issues/2069)
-* If source tracking is on, and an Attribute's key is changed (via `Attribute.setKey(String)`), the source range is
-  now still tracked in `Attribute.sourceRange()`. [2070](https://github.com/jhy/jsoup/issues/2070)
-* Added support for the `[*]` element with any attribute selector. And also restored support for selecting by an empty
-  attribute name prefix (`[^]`). [2079](https://github.com/jhy/jsoup/issues/2079)
+* Added `Path` accepting parse methods: `Jsoup.parse(Path)`, `Jsoup.parse(path, charsetName, baseUri, parser)`,
+  etc. [2055](https://github.com/jhy/jsoup/pull/2055)
+
+### Changes
+
+* Removed previously deprecated internal classes and methods. [2094](https://github.com/jhy/jsoup/pull/2094)
+
+---
+
+## 1.17.2 (2023-Dec-29)
+
+### Improvements
+
+* **Attribute object accessors**: Added `Element.attribute(String)` and `Attributes.attribute(String)` to more simply
+  obtain an `Attribute` object. [2069](https://github.com/jhy/jsoup/issues/2069)
+* **Attribute source tracking**: If source tracking is on, and an Attribute's key is changed (
+  via `Attribute.setKey(String)`), the source range is now still tracked
+  in `Attribute.sourceRange()`. [2070](https://github.com/jhy/jsoup/issues/2070)
+* **Wildcard attribute selector**: Added support for the `[*]` element with any attribute selector. And also restored
+  support for selecting by an empty attribute name prefix (`[^]`). [2079](https://github.com/jhy/jsoup/issues/2079)
 
 ### Bug Fixes
 
-* When tracking the source position of attributes, if source attribute name was mix-cased but the parser was
-  lower-case normalizing attribute names, the source position for that attribute was not tracked
-  correctly. [2067](https://github.com/jhy/jsoup/issues/2067)
-* When tracking the source position of a body fragment parse, a null pointer exception was
-  thrown. [2068](https://github.com/jhy/jsoup/issues/2068)
-* A multi-point encoded emoji entity may be incorrectly decoded to the replacement
+* **Mixed-cased source position**: When tracking the source position of attributes, if the source attribute name was
+  mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not
+  tracked correctly. [2067](https://github.com/jhy/jsoup/issues/2067)
+* **Source position NPE**: When tracking the source position of a body fragment parse, a null pointer
+  exception was thrown. [2068](https://github.com/jhy/jsoup/issues/2068)
+* **Multi-point emoji entity**: A multi-point encoded emoji entity may be incorrectly decoded to the replacement
   character. [2074](https://github.com/jhy/jsoup/issues/2074)
-* (Regression) in a selector like `parent [attr=va], other`, the `, OR` was binding to `[attr=va]` instead of
-  `parent [attr=va]`, causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr
-  to represent the query, allowing simpler and more thorough query parse
+* **Selector sub-expressions**: (Regression) in a selector like `parent [attr=va], other`, the `, OR` was binding
+  to `[attr=va]` instead of `parent [attr=va]`, causing incorrect selections. The fix includes a EvaluatorDebug class
+  that generates a sexpr to represent the query, allowing simpler and more thorough query parse
   tests. [2073](https://github.com/jhy/jsoup/issues/2073)
-* When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an
-  extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML
-  polyglot format, if the data is not already within a CData section. [2078](https://github.com/jhy/jsoup/issues/2078)
-* The `:has` evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple
-  concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the
-  iterator object is a thread-local. [2088](https://github.com/jhy/jsoup/issues/2088)
+* **XML CData output**: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData
+  sections would have an extraneous CData section added, causing script execution errors. Now, the data content is
+  emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData
+  section. [2078](https://github.com/jhy/jsoup/issues/2078)
+* **Thread safety**: The `:has` evaluator held a non-thread-safe Iterator, and so if an Evaluator object was
+  shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be
+  incorrect. Now, the iterator object is a thread-local. [2088](https://github.com/jhy/jsoup/issues/2088)
 
 ---
 Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in

diff --git a/pom.xml b/pom.xml
@@ -5,7 +5,7 @@
 
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
-  <version>1.17.2-SNAPSHOT</version><!-- remember to update previous version below for japicmp -->
+  <version>1.18.1-SNAPSHOT</version><!-- remember to update previous version below for japicmp -->
   <url>https://jsoup.org/</url>
   <description>jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.</description>
   <inceptionYear>2009</inceptionYear>
@@ -42,7 +42,7 @@
       <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-compiler-plugin</artifactId>
-        <version>3.12.0</version>
+        <version>3.12.1</version>
         <configuration>
           <encoding>UTF-8</encoding>
           <compilerArgs>
@@ -88,13 +88,21 @@
                 <version>2.3.3_r2</version>
               </signature>
               <ignores>
+                <ignore>java.io.File</ignore> <!-- File#toPath() -->
+                <ignore>java.nio.file.*</ignore>
+                <ignore>java.nio.channels.SeekableByteChannel</ignore>
                 <ignore>java.util.function.*</ignore>
                 <ignore>java.util.stream.*</ignore>
+                <ignore>java.lang.Throwable</ignore> <!-- Throwable#addSuppressed(Throwable) -->
                 <ignore>java.lang.ThreadLocal</ignore>
                 <ignore>java.io.UncheckedIOException</ignore>
+                <ignore>java.util.Comparator</ignore> <!-- Comparator.comparingInt() -->
                 <ignore>java.util.List</ignore> <!-- List#stream() -->
+                <ignore>java.util.LinkedHashMap</ignore> <!-- LinkedHashMap#computeIfAbsent() -->
+                <ignore>java.util.Map</ignore> <!-- Map#computeIfAbsent() -->
                 <ignore>java.util.Objects</ignore>
                 <ignore>java.util.Optional</ignore>
+                <ignore>java.util.Set</ignore> <!-- Set#stream() -->
                 <ignore>java.util.Spliterator</ignore>
                 <ignore>java.util.Spliterators</ignore>
 
@@ -227,7 +235,7 @@
             <dependency>
               <groupId>org.jsoup</groupId>
               <artifactId>jsoup</artifactId>
-              <version>1.16.2</version>
+              <version>1.17.1</version>
               <type>jar</type>
             </dependency>
           </oldVersion>
@@ -237,7 +245,8 @@
             <breakBuildOnBinaryIncompatibleModifications>true</breakBuildOnBinaryIncompatibleModifications>
             <breakBuildOnSourceIncompatibleModifications>true</breakBuildOnSourceIncompatibleModifications>
             <excludes>
-              <!-- <exclude>@java.lang.Deprecated</exclude> -->
+              <exclude>@java.lang.Deprecated</exclude>
+              <exclude>org.jsoup.UncheckedIOException</exclude>
             </excludes>
             <overrideCompatibilityChangeParameters>
               <!-- allows new default and move to default methods. compatible as long as existing binaries aren't making calls via reflection. if so, they need to catch errors anyway. -->

diff --git a/src/main/java/org/jsoup/Jsoup.java b/src/main/java/org/jsoup/Jsoup.java
@@ -13,6 +13,7 @@
 import java.io.IOException;
 import java.io.InputStream;
 import java.net.URL;
+import java.nio.file.Path;
 
 /**
  The core public access point to the jsoup functionality.
@@ -183,6 +184,72 @@ public static Document parse(File file, @Nullable String charsetName, String bas
         return DataUtil.load(file, charsetName, baseUri, parser);
     }
 
+    /**
+     Parse the contents of a file as HTML.
+
+     @param path          file to load HTML from. Supports gzipped files (ending in .z or .gz).
+     @param charsetName (optional) character set of file contents. Set to {@code null} to determine from {@code http-equiv} meta tag, if
+     present, or fall back to {@code UTF-8} (which is often safe to do).
+     @param baseUri     The URL where the HTML was retrieved from, to resolve relative links against.
+     @return sane HTML
+
+     @throws IOException if the file could not be found, or read, or if the charsetName is invalid.
+     @since 1.18.1
+     */
+    public static Document parse(Path path, @Nullable String charsetName, String baseUri) throws IOException {
+        return DataUtil.load(path, charsetName, baseUri);
+    }
+
+    /**
+     Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
+
+     @param path        file to load HTML from. Supports gzipped files (ending in .z or .gz).
+     @param charsetName (optional) character set of file contents. Set to {@code null} to determine from {@code http-equiv} meta tag, if
+     present, or fall back to {@code UTF-8} (which is often safe to do).
+     @return sane HTML
+
+     @throws IOException if the file could not be found, or read, or if the charsetName is invalid.
+     @see #parse(File, String, String) parse(file, charset, baseUri)
+     @since 1.18.1
+     */
+    public static Document parse(Path path, @Nullable String charsetName) throws IOException {
+        return DataUtil.load(path, charsetName, path.toAbsolutePath().toString());
+    }
+
+    /**
+     Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
+     The charset used to read the file will be determined by the byte-order-mark (BOM), or a {@code <meta charset>} tag,
+     or if neither is present, will be {@code UTF-8}.
+
+     <p>This is the equivalent of calling {@link #parse(File, String) parse(file, null)}</p>
+
+     @param path the file to load HTML from. Supports gzipped files (ending in .z or .gz).
+     @return sane HTML
+     @throws IOException if the file could not be found or read.
+     @see #parse(Path, String, String) parse(file, charset, baseUri)
+     @since 1.18.1
+     */
+    public static Document parse(Path path) throws IOException {
+        return DataUtil.load(path, null, path.toAbsolutePath().toString());
+    }
+
+    /**
+     Parse the contents of a file as HTML.
+
+     @param path          file to load HTML from. Supports gzipped files (ending in .z or .gz).
+     @param charsetName (optional) character set of file contents. Set to {@code null} to determine from {@code http-equiv} meta tag, if
+     present, or fall back to {@code UTF-8} (which is often safe to do).
+     @param baseUri     The URL where the HTML was retrieved from, to resolve relative links against.
+     @param parser alternate {@link Parser#xmlParser() parser} to use.
+     @return sane HTML
+
+     @throws IOException if the file could not be found, or read, or if the charsetName is invalid.
+     @since 1.18.1
+     */
+    public static Document parse(Path path, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
+        return DataUtil.load(path, charsetName, baseUri, parser);
+    }
+
      /**
      Read an input stream, and parse it to a Document.
 

diff --git a/src/main/java/org/jsoup/UncheckedIOException.java b/src/main/java/org/jsoup/UncheckedIOException.java
@@ -6,7 +6,7 @@
  * @deprecated Use {@link java.io.UncheckedIOException} instead. This class acted as a compatibility shim for Java
  * versions prior to 1.8.
  */
-// todo annotate @Deprecated in next release (after previous @Deprecations clear)
+@Deprecated
 public class UncheckedIOException extends java.io.UncheckedIOException {
     public UncheckedIOException(IOException cause) {
         super(cause);

diff --git a/src/main/java/org/jsoup/helper/DataUtil.java b/src/main/java/org/jsoup/helper/DataUtil.java
@@ -2,7 +2,6 @@
 
 import org.jsoup.internal.ControllableInputStream;
 import org.jsoup.internal.Normalizer;
-import org.jsoup.internal.SharedConstants;
 import org.jsoup.internal.StringUtil;
 import org.jsoup.nodes.Comment;
 import org.jsoup.nodes.Document;
@@ -16,7 +15,6 @@
 import java.io.BufferedReader;
 import java.io.CharArrayReader;
 import java.io.File;
-import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStream;
 import java.io.InputStreamReader;
@@ -25,8 +23,12 @@
 import java.nio.Buffer;
 import java.nio.ByteBuffer;
 import java.nio.CharBuffer;
+import java.nio.channels.Channels;
+import java.nio.channels.SeekableByteChannel;
 import java.nio.charset.Charset;
 import java.nio.charset.IllegalCharsetNameException;
+import java.nio.file.Files;
+import java.nio.file.Path;
 import java.util.Locale;
 import java.util.Random;
 import java.util.regex.Matcher;
@@ -63,7 +65,7 @@ private DataUtil() {}
      * @throws IOException on IO error
      */
     public static Document load(File file, @Nullable String charsetName, String baseUri) throws IOException {
-        return load(file, charsetName, baseUri, Parser.htmlParser());
+        return load(file.toPath(), charsetName, baseUri);
     }
 
     /**
@@ -81,18 +83,48 @@ public static Document load(File file, @Nullable String charsetName, String base
      * @since 1.14.2
      */
     public static Document load(File file, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
-        InputStream stream = new FileInputStream(file);
-        String name = Normalizer.lowerCase(file.getName());
-        if (name.endsWith(".gz") || name.endsWith(".z")) {
-            // unfortunately file input streams don't support marks (why not?), so we will close and reopen after read
-            boolean zipped;
-            try {
-                zipped = (stream.read() == 0x1f && stream.read() == 0x8b); // gzip magic bytes
-            } finally {
-                stream.close();
+        return load(file.toPath(), charsetName, baseUri, parser);
+    }
+
+    /**
+     * Loads and parses a file to a Document, with the HtmlParser. Files that are compressed with gzip (and end in {@code .gz} or {@code .z})
+     * are supported in addition to uncompressed files.
+     *
+     * @param path file to load
+     * @param charsetName (optional) character set of input; specify {@code null} to attempt to autodetect. A BOM in
+     *     the file will always override this setting.
+     * @param baseUri base URI of document, to resolve relative links against
+     * @return Document
+     * @throws IOException on IO error
+     */
+    public static Document load(Path path, @Nullable String charsetName, String baseUri) throws IOException {
+        return load(path, charsetName, baseUri, Parser.htmlParser());
+    }
 
+    /**
+     * Loads and parses a file to a Document. Files that are compressed with gzip (and end in {@code .gz} or {@code .z})
+     * are supported in addition to uncompressed files.
+     *
+     * @param path file to load
+     * @param charsetName (optional) character set of input; specify {@code null} to attempt to autodetect. A BOM in
+     *     the file will always override this setting.
+     * @param baseUri base URI of document, to resolve relative links against
+     * @param parser alternate {@link Parser#xmlParser() parser} to use.
+
+     * @return Document
+     * @throws IOException on IO error
+     * @since 1.17.2
+     */
+    public static Document load(Path path, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
+        final SeekableByteChannel byteChannel = Files.newByteChannel(path);
+        InputStream stream = Channels.newInputStream(byteChannel);
+        String name = Normalizer.lowerCase(path.getFileName().toString());
+        if (name.endsWith(".gz") || name.endsWith(".z")) {
+            final boolean zipped = (stream.read() == 0x1f && stream.read() == 0x8b); // gzip magic bytes
+            byteChannel.position(0); // reset to start of file
+            if (zipped) {
+                stream = new GZIPInputStream(stream);
             }
-            stream = zipped ? new GZIPInputStream(new FileInputStream(file)) : new FileInputStream(file);
         }
         return parseInputStream(stream, charsetName, baseUri, parser);
     }
@@ -139,16 +171,15 @@ static void crossStreams(final InputStream in, final OutputStream out) throws IO
     static Document parseInputStream(@Nullable InputStream input, @Nullable String charsetName, String baseUri, Parser parser) throws IOException  {
         if (input == null) // empty body
             return new Document(baseUri);
-        input = ControllableInputStream.wrap(input, DefaultBufferSize, 0);
 
         @Nullable Document doc = null;
 
         // read the start of the stream and look for a BOM or meta charset
-        try {
-            input.mark(DefaultBufferSize);
-            ByteBuffer firstBytes = readToByteBuffer(input, firstReadBufferSize - 1); // -1 because we read one more to see if completed. First read is < buffer size, so can't be invalid.
-            boolean fullyRead = (input.read() == -1);
-            input.reset();
+        try (InputStream wrappedInputStream = ControllableInputStream.wrap(input, DefaultBufferSize, 0)) {
+            wrappedInputStream.mark(DefaultBufferSize);
+            ByteBuffer firstBytes = readToByteBuffer(wrappedInputStream, firstReadBufferSize - 1); // -1 because we read one more to see if completed. First read is < buffer size, so can't be invalid.
+            boolean fullyRead = (wrappedInputStream.read() == -1);
+            wrappedInputStream.reset();
 
             // look for BOM - overrides any other header or input
             BomCharset bomCharset = detectCharsetFromBom(firstBytes);
@@ -189,9 +220,8 @@ else if (first instanceof Comment) {
                         if (comment.isXmlDeclaration())
                             decl = comment.asXmlDeclaration();
                     }
-                    if (decl != null) {
-                        if (decl.name().equalsIgnoreCase("xml"))
-                            foundCharset = decl.attr("encoding");
+                    if (decl != null && decl.name().equalsIgnoreCase("xml")) {
+                        foundCharset = decl.attr("encoding");
                     }
                 }
                 foundCharset = validateCharset(foundCharset);
@@ -208,8 +238,7 @@ else if (first instanceof Comment) {
             if (doc == null) {
                 if (charsetName == null)
                     charsetName = defaultCharsetName;
-                BufferedReader reader = new BufferedReader(new InputStreamReader(input, Charset.forName(charsetName)), DefaultBufferSize); // Android level does not allow us try-with-resources
-                try {
+                try (BufferedReader reader = new BufferedReader(new InputStreamReader(wrappedInputStream, Charset.forName(charsetName)), DefaultBufferSize)) {
                     if (bomCharset != null && bomCharset.offset) { // creating the buffered reader ignores the input pos, so must skip here
                         long skipped = reader.skip(1);
                         Validate.isTrue(skipped == 1); // WTF if this fails.
@@ -227,14 +256,8 @@ else if (first instanceof Comment) {
                         doc.charset(UTF_8);
                     }
                 }
-                finally {
-                    reader.close();
-                }
             }
         }
-        finally {
-            input.close();
-        }
         return doc;
     }