bug/certain htmls cannot be parsed #3697

AraiYuno · 2024-10-04T16:35:31Z

Describe the bug
Certain HTML files scraped from GCP docs like the following URLs return empty elements or elements with simply newline characters when using partition_html.

To Reproduce

import requests
from unstructured.partition.html import partition_html

url = 'https://cloud.google.com/generative-ai-app-builder/docs/update-schemas'
# Send a GET request to the URL
response = requests.get(url)
elements = data = partition_html(text=response.content)
for el in elements:
    print(el.text)

outputs

\n \n \n
\n
\n \n \n \n
\n
\n \n \n

Expected behavior

should parse the HTML file, and return a list of Elements properly.

Screenshots

Environment Info

$ python --version
Python 3.11.8

$ pip show unstructured
Name: unstructured
Version: 0.15.12
Summary: A library that prepares raw documents for downstream ML tasks.
Home-page: https://github.com/Unstructured-IO/unstructured
Author: Unstructured Technologies
Author-email: [email protected]
License: Apache-2.0
Location: /Users/kyleahn/Desktop/contextual_ai/repos/core/.venv/lib/python3.11/site-packages
Requires: backoff, beautifulsoup4, chardet, dataclasses-json, emoji, filetype, langdetect, lxml, nltk, numpy, psutil, python-iso639, python-magic, python-oxmsg, rapidfuzz, requests, tabulate, tqdm, typing-extensions, unstructured-client, wrapt
Required-by:

Additional context

did minor dubugging to figure out lxml returns bad root.

unstructured/unstructured/partition/html/partition.py

Line 193 in 4711a8d

root = etree.fromstring(html_text, html_parser)

The text was updated successfully, but these errors were encountered:

AraiYuno · 2024-10-24T22:34:16Z

Suspicious link has been deleted

wdormann · 2024-10-29T13:34:15Z

That's a malicious link.

AraiYuno · 2024-10-29T13:51:37Z

That's a malicious link.

Figured it. Thanks!

PhorstenkampFuzzy · 2024-10-30T09:14:49Z

The user that added that link was banned. @AraiYuno Could you delete the citation of the malicious link?

AraiYuno · 2024-10-30T09:44:34Z

The user that added that link was banned. @AraiYuno Could you delete the citation of the malicious link?

deleted!

AraiYuno added the bug Something isn't working label Oct 4, 2024

scanny added the html label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/certain htmls cannot be parsed #3697

bug/certain htmls cannot be parsed #3697

AraiYuno commented Oct 4, 2024

AraiYuno commented Oct 24, 2024 •

edited

Loading

wdormann commented Oct 29, 2024

AraiYuno commented Oct 29, 2024

PhorstenkampFuzzy commented Oct 30, 2024

AraiYuno commented Oct 30, 2024

bug/certain htmls cannot be parsed #3697

bug/certain htmls cannot be parsed #3697

Comments

AraiYuno commented Oct 4, 2024

AraiYuno commented Oct 24, 2024 • edited Loading

wdormann commented Oct 29, 2024

AraiYuno commented Oct 29, 2024

PhorstenkampFuzzy commented Oct 30, 2024

AraiYuno commented Oct 30, 2024

AraiYuno commented Oct 24, 2024 •

edited

Loading