Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate extraction+selection from cleaning, serializing and other features. #16

Open
kevincox opened this issue Feb 12, 2022 · 1 comment

Comments

@kevincox
Copy link

While there is a lot of public API it seems that the core method is readability::extractor::extract with scrape as a small wrapper. However this function makes a lot of assumptions. Currently it does:

  1. Extracts candidates.
  2. Ranks candidates.
  3. Cleans the DOM.
  4. Fixes up img links.
  5. Extracts a title.
  6. Generates a plain-text version.

1 and 2 seem pretty core to the library. However the others all seem to be more on the side or can be done separately. Based on my reading: 4 is basically free because it is done while iterating the dom but 3, 5 and 6 are just run on the output so don't need to be bundled. For example in my use case I am not using the text, have my own process to fix up links more reliably and don't need the title so that is all wasted work.

I wonder if a better API would be something like:

  • extractv2() that returns some sort of opaque result object Extracted.
  • Extracted#html with a settings object that controls sanitizing, simplifying and URL rewriting.
  • Extracted#text, Extracted#title

The current extract would then be implemented as something like:

fn extract(input: &mut R, url: &Url) -> Result<Product, Error> {
  let extracted = extractv2(input, url)?;
  Product {
    title: extracted.title(),
    html: extracted.html(Default::default(),
    text: extracted.text(),
  }
}

But importantly all of the bits can be done independently for efficiency and the different output formats can be naturally parametrized for output settings.

@kevincox
Copy link
Author

I managed to get a stripped down form for me. Basically if you inline the readability::extractor::extract function you can then strip it down as required, it doesn't require any public APIs. However doing it this way means that you have to interact with some implementation details. My result looks like this:

    let data: &str = unimplemented!("Your data here");

	let parser = html5ever::parse_document(
		markup5ever_rcdom::RcDom::default(),
		Default::default())
		.from_utf8();
	let mut dom = html5ever::tendril::TendrilSink::read_from(
		parser,
		&mut std::io::Cursor::new(data))?;

	let mut candidates = std::collections::BTreeMap::new();
	let root = dom.document.clone();
	readability::scorer::find_candidates(
		&mut dom,
		std::path::Path::new("/"),
		root,
		&mut candidates,
		&mut Default::default());

	let candidate = candidates.into_iter()
		.map(|(_, c)| c)
		.max_by_key(|c| {
			let link_density = readability::scorer::get_link_density(c.node.clone());
			let fscore = c.score.get() * (1.0 - link_density);
			(fscore * 1000.0) as u32
		})
		.ok_or_else(|| anyhow::Error::msg("No candidates found."))?;

	let mut bytes = Vec::with_capacity(1024);
	html5ever::serialize(
		&mut bytes,
		&markup5ever_rcdom::SerializableHandle::from(candidate.node),
		Default::default()).ok();
	let html = String::from_utf8(bytes)?;

Not only does this remove a lot of the unnecessary work for my use case of just getting the raw HTML I also managed to remove a lot of unnecessary clones (although they were Rcs so the cost was probably minimal).

One note, I removed the call to readability::scorer::preprocess which does modify the document so may slightly affect scoring. I did this because it did some cleaning that I didn't want and the scoring effect appears to be minimal. This is also where the title extraction was done but I didn't need that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant