-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(community): Adds an HTML loader for URLS #7184
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Skipped Deployment
|
ce78d96
to
4572f1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Looks great.
Sorry, noticed one small thing. Can you also run |
@jacoblee93 I moved the I then imported and re-exported the Hope that's all ok! |
I recently discovered the same issue that was reported in langchain-ai#2467. The documentation and examples suggest chaining the `CheerioWebBaseLoader` with `MozillaReadabilityTransformer` to load and transform HTML documents. However, the `CheerioWebBaseLoader` uses [Cheerio's `text` method to extract the text](https://github.com/langchain-ai/langchainjs/blob/05e5813715150cd69d9e384924818562e3b7c1fa/libs/langchain-community/src/document_loaders/web/cheerio.ts#L144C34-L144C40) from the HTML document, or provided selector. This is great if you want the text, but the MozillaReadabilityTransformer needs to act on HTML. I have added an `HTMLWebBaseLoader` that simply uses `fetch` to get an HTML document and returns the full HTML content. I've also updated the `MozillaReadabilityTransformer` example to use the `HTMLWebBaseLoader` instead of the `CheerioWebBaseLoader`.
I moved the WebBaseLoader and WebBaseLoaderParams to the HTML loader as they were relevant and it seems that loaders export their own types. I also added the `selector` property back to WebBaseLoaderParams and deprecated it from there. I then imported and re-exported the WebBaseLoaderParams from the Cheerio loader with deprecation, so that it doesn't break for anyone that was importing it from here.
67f69ab
to
5bfd07d
Compare
Force pushed to fix the merge conflict. |
@jacoblee93 can we trigger the jobs again to see if this is looking good? Thanks! |
I recently discovered the same issue that was reported in #2467. The documentation and examples suggest chaining the
CheerioWebBaseLoader
withMozillaReadabilityTransformer
to load and transform HTML documents. However, theCheerioWebBaseLoader
uses Cheerio'stext
method to extract the text from the HTML document, or provided selector. This is great if you want the text, but the MozillaReadabilityTransformer needs to act on HTML.I have added an
HTMLWebBaseLoader
that simply usesfetch
to get an HTML document and returns the full HTML content.I've also updated the
MozillaReadabilityTransformer
example to use theHTMLWebBaseLoader
instead of theCheerioWebBaseLoader
.Fixes #2467
I'm @philnash if you're still doing shout outs :)