You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Certain HTML files scraped from GCP docs like the following URLs return empty elements or elements with simply newline characters when using partition_html.
To Reproduce
importrequestsfromunstructured.partition.htmlimportpartition_htmlurl='https://cloud.google.com/generative-ai-app-builder/docs/update-schemas'# Send a GET request to the URLresponse=requests.get(url)
elements=data=partition_html(text=response.content)
forelinelements:
print(el.text)
outputs
\n \n \n
\n
\n \n \n \n
\n
\n \n \n
Expected behavior
should parse the HTML file, and return a list of Elements properly.
Screenshots
Environment Info
$ python --version
Python 3.11.8
$ pip show unstructured
Name: unstructured
Version: 0.15.12
Summary: A library that prepares raw documents for downstream ML tasks.
Home-page: https://github.com/Unstructured-IO/unstructured
Author: Unstructured Technologies
Author-email: [email protected]
License: Apache-2.0
Location: /Users/kyleahn/Desktop/contextual_ai/repos/core/.venv/lib/python3.11/site-packages
Requires: backoff, beautifulsoup4, chardet, dataclasses-json, emoji, filetype, langdetect, lxml, nltk, numpy, psutil, python-iso639, python-magic, python-oxmsg, rapidfuzz, requests, tabulate, tqdm, typing-extensions, unstructured-client, wrapt
Required-by:
Additional context
did minor dubugging to figure out lxml returns bad root.
Describe the bug
Certain HTML files scraped from GCP docs like the following URLs return empty elements or elements with simply newline characters when using
partition_html
.To Reproduce
outputs
Expected behavior
Screenshots
Environment Info
Additional context
lxml
returns bad root.unstructured/unstructured/partition/html/partition.py
Line 193 in 4711a8d
The text was updated successfully, but these errors were encountered: