You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While there is a lot of public API it seems that the core method is readability::extractor::extract with scrape as a small wrapper. However this function makes a lot of assumptions. Currently it does:
Extracts candidates.
Ranks candidates.
Cleans the DOM.
Fixes up img links.
Extracts a title.
Generates a plain-text version.
1 and 2 seem pretty core to the library. However the others all seem to be more on the side or can be done separately. Based on my reading: 4 is basically free because it is done while iterating the dom but 3, 5 and 6 are just run on the output so don't need to be bundled. For example in my use case I am not using the text, have my own process to fix up links more reliably and don't need the title so that is all wasted work.
I wonder if a better API would be something like:
extractv2() that returns some sort of opaque result object Extracted.
Extracted#html with a settings object that controls sanitizing, simplifying and URL rewriting.
Extracted#text, Extracted#title
The current extract would then be implemented as something like:
But importantly all of the bits can be done independently for efficiency and the different output formats can be naturally parametrized for output settings.
The text was updated successfully, but these errors were encountered:
I managed to get a stripped down form for me. Basically if you inline the readability::extractor::extract function you can then strip it down as required, it doesn't require any public APIs. However doing it this way means that you have to interact with some implementation details. My result looks like this:
Not only does this remove a lot of the unnecessary work for my use case of just getting the raw HTML I also managed to remove a lot of unnecessary clones (although they were Rcs so the cost was probably minimal).
One note, I removed the call to readability::scorer::preprocess which does modify the document so may slightly affect scoring. I did this because it did some cleaning that I didn't want and the scoring effect appears to be minimal. This is also where the title extraction was done but I didn't need that.
While there is a lot of public API it seems that the core method is readability::extractor::extract with
scrape
as a small wrapper. However this function makes a lot of assumptions. Currently it does:img
links.1 and 2 seem pretty core to the library. However the others all seem to be more on the side or can be done separately. Based on my reading: 4 is basically free because it is done while iterating the dom but 3, 5 and 6 are just run on the output so don't need to be bundled. For example in my use case I am not using the text, have my own process to fix up links more reliably and don't need the title so that is all wasted work.
I wonder if a better API would be something like:
extractv2()
that returns some sort of opaque result objectExtracted
.Extracted#html
with a settings object that controls sanitizing, simplifying and URL rewriting.Extracted#text
,Extracted#title
The current extract would then be implemented as something like:
But importantly all of the bits can be done independently for efficiency and the different output formats can be naturally parametrized for output settings.
The text was updated successfully, but these errors were encountered: