Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/certain htmls cannot be parsed #3697

Open
AraiYuno opened this issue Oct 4, 2024 · 5 comments
Open

bug/certain htmls cannot be parsed #3697

AraiYuno opened this issue Oct 4, 2024 · 5 comments
Labels
bug Something isn't working html

Comments

@AraiYuno
Copy link

AraiYuno commented Oct 4, 2024

Describe the bug
Certain HTML files scraped from GCP docs like the following URLs return empty elements or elements with simply newline characters when using partition_html.

To Reproduce

import requests
from unstructured.partition.html import partition_html

url = 'https://cloud.google.com/generative-ai-app-builder/docs/update-schemas'
# Send a GET request to the URL
response = requests.get(url)
elements = data = partition_html(text=response.content)
for el in elements:
    print(el.text)

outputs

\n \n \n
\n
\n \n \n \n
\n
\n \n \n

Expected behavior

  • should parse the HTML file, and return a list of Elements properly.

Screenshots

Environment Info

$ python --version
Python 3.11.8

$ pip show unstructured
Name: unstructured
Version: 0.15.12
Summary: A library that prepares raw documents for downstream ML tasks.
Home-page: https://github.com/Unstructured-IO/unstructured
Author: Unstructured Technologies
Author-email: [email protected]
License: Apache-2.0
Location: /Users/kyleahn/Desktop/contextual_ai/repos/core/.venv/lib/python3.11/site-packages
Requires: backoff, beautifulsoup4, chardet, dataclasses-json, emoji, filetype, langdetect, lxml, nltk, numpy, psutil, python-iso639, python-magic, python-oxmsg, rapidfuzz, requests, tabulate, tqdm, typing-extensions, unstructured-client, wrapt
Required-by: 

Additional context

@AraiYuno AraiYuno added the bug Something isn't working label Oct 4, 2024
@scanny scanny added the html label Oct 4, 2024
@AraiYuno
Copy link
Author

AraiYuno commented Oct 24, 2024

Suspicious link has been deleted

@wdormann
Copy link

That's a malicious link.

@AraiYuno
Copy link
Author

That's a malicious link.

Figured it. Thanks!

@PhorstenkampFuzzy
Copy link

The user that added that link was banned. @AraiYuno Could you delete the citation of the malicious link?

@AraiYuno
Copy link
Author

The user that added that link was banned. @AraiYuno Could you delete the citation of the malicious link?

deleted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working html
Projects
None yet
Development

No branches or pull requests

5 participants
@scanny @wdormann @AraiYuno @PhorstenkampFuzzy and others