Separate extraction+selection from cleaning, serializing and other features. #16

kevincox · 2022-02-12T20:58:46Z

While there is a lot of public API it seems that the core method is readability::extractor::extract with scrape as a small wrapper. However this function makes a lot of assumptions. Currently it does:

Extracts candidates.
Ranks candidates.
Cleans the DOM.
Fixes up img links.
Extracts a title.
Generates a plain-text version.

1 and 2 seem pretty core to the library. However the others all seem to be more on the side or can be done separately. Based on my reading: 4 is basically free because it is done while iterating the dom but 3, 5 and 6 are just run on the output so don't need to be bundled. For example in my use case I am not using the text, have my own process to fix up links more reliably and don't need the title so that is all wasted work.

I wonder if a better API would be something like:

extractv2() that returns some sort of opaque result object Extracted.
Extracted#html with a settings object that controls sanitizing, simplifying and URL rewriting.
Extracted#text, Extracted#title

The current extract would then be implemented as something like:

fn extract(input: &mut R, url: &Url) -> Result<Product, Error> {
  let extracted = extractv2(input, url)?;
  Product {
    title: extracted.title(),
    html: extracted.html(Default::default(),
    text: extracted.text(),
  }
}

But importantly all of the bits can be done independently for efficiency and the different output formats can be naturally parametrized for output settings.

The text was updated successfully, but these errors were encountered:

kevincox · 2022-02-21T18:35:04Z

I managed to get a stripped down form for me. Basically if you inline the readability::extractor::extract function you can then strip it down as required, it doesn't require any public APIs. However doing it this way means that you have to interact with some implementation details. My result looks like this:

    let data: &str = unimplemented!("Your data here");

	let parser = html5ever::parse_document(
		markup5ever_rcdom::RcDom::default(),
		Default::default())
		.from_utf8();
	let mut dom = html5ever::tendril::TendrilSink::read_from(
		parser,
		&mut std::io::Cursor::new(data))?;

	let mut candidates = std::collections::BTreeMap::new();
	let root = dom.document.clone();
	readability::scorer::find_candidates(
		&mut dom,
		std::path::Path::new("/"),
		root,
		&mut candidates,
		&mut Default::default());

	let candidate = candidates.into_iter()
		.map(|(_, c)| c)
		.max_by_key(|c| {
			let link_density = readability::scorer::get_link_density(c.node.clone());
			let fscore = c.score.get() * (1.0 - link_density);
			(fscore * 1000.0) as u32
		})
		.ok_or_else(|| anyhow::Error::msg("No candidates found."))?;

	let mut bytes = Vec::with_capacity(1024);
	html5ever::serialize(
		&mut bytes,
		&markup5ever_rcdom::SerializableHandle::from(candidate.node),
		Default::default()).ok();
	let html = String::from_utf8(bytes)?;

Not only does this remove a lot of the unnecessary work for my use case of just getting the raw HTML I also managed to remove a lot of unnecessary clones (although they were Rcs so the cost was probably minimal).

One note, I removed the call to readability::scorer::preprocess which does modify the document so may slightly affect scoring. I did this because it did some cleaning that I didn't want and the scoring effect appears to be minimal. This is also where the title extraction was done but I didn't need that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate extraction+selection from cleaning, serializing and other features. #16

Separate extraction+selection from cleaning, serializing and other features. #16

kevincox commented Feb 12, 2022

kevincox commented Feb 21, 2022

Separate extraction+selection from cleaning, serializing and other features. #16

Separate extraction+selection from cleaning, serializing and other features. #16

Comments

kevincox commented Feb 12, 2022

kevincox commented Feb 21, 2022