-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace()-ing or remove()-ing self-closing tags breaks outputted markup #215
Comments
OK, this is looking like a more serious problem. I haven't made a minimal reproducer yet, but it looks like calling ...and using |
OK, here's a minimal reproducer based on the README example: use lol_html::{element, HtmlRewriter, Settings};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut output = vec![];
let mut rewriter = HtmlRewriter::new(
Settings {
element_content_handlers: vec![element!("hello", |el| {
eprintln!("Is Self-Closing: {}", el.is_self_closing());
el.replace("Hello,", lol_html::html_content::ContentType::Text);
Ok(())
})],
..Settings::default()
},
|c: &[u8]| output.extend_from_slice(c),
);
rewriter.write(b"<hello></hello> House!\n<hello /> Mouse!\nHello, World!\n")?;
rewriter.end()?;
assert_eq!(
String::from_utf8(output)?,
"Hello, House!\nHello, Mouse!\nHello, World!\n"
);
Ok(())
}
|
👋 I'm not a maintainer here, but jumping in with some info. (you can see a similar case and reply from me here: #207 (comment)) The crux is that in HTML, "self-closing" elements aren't denoted by the Even more specifically, the All of that to say, in this case lol-html sees you open a Browsers have more error protection and will force an end tag onto the tl;dr |
I'd find that more convincing if lol-html didn't have a functioning In its current state, that just comes across as "lol-html knows what you mean, but it's going to engage in malicious spec-compliance and not even warn you that it will do so because it just doesn't like you". I know it's functioning because I currently use it to implement |
|
Note the example output from my reproducer. It's returning |
Oh got you — yes looks like it is using the slash. That's not a great implementation from lol-html. |
Tracing this back a little more: lol-html is pretty directly implementing the tokenizer state machine, which is where that flag is coming from: https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#self-closing-start-tag-state — which does "Set the self-closing flag of the current tag token" — though that flag should only be for tokenization afaict. In lol-html it looks to be setting that flag on the element, which is then exposed via |
As far as lol-html goes, my biggest concern is ensuring that it's not such a footgun. Given that my own use is working around Markdown's lack of a generic inline "lang extension" syntax akin to what reStructuredText supports (I already use custom "languages" like Hell, I know it can work that way without taking advantage of how pulldown-cmark's token stream lets me run it only on chunks of embedded HTML without having to deal with the HTML generated from the Markdown constructs. (I just wish I could hook into the token stream before the built-in optional syntax extensions run so I don't have to choose between the smart punctuation and strikethrough extensions, the ability to implement Compose key shortcodes like |
In HTML, the The
That's how HTML parsing works. Lol-html is an HTML parser, so it won't give the I've updated the documentation to make it clear that When you try to use the invalid syntax |
I'm using lol-html as a way to postprocess some custom tags in pulldown-cmark output that's less fragile than quick-xml with pairing enforcement turned off (which is, itself, an order of magnitude faster than markup5ever) and I find it very frustrating that lol-html is so wedded to "once a self-closing tag, always a self-closing tag".
Even if I pull in my templating engine early and manually generate a properly escaped
<span class="currency" title="$77.84 CAD, including tax">87.95 CAD + tax</span>
string to.replace()
my<price amount="77.84" currency="cad_tax" />
, I wind up with broken markup and a spurious>
at the next occurrence of the tag.Isn't this a bit of a hazard? Shouldn't it assume that all bets are off about whether something is self-closing when you
replace()
it?Hell, if it weren't so wedded to that and had
elem.set_self_closing(false);
(or just automatically toggled it when you doelem.set_inner_content
), then I wouldn't need the templating engine at all and could do something like this:...or, if it were smart enough to parse the contents of
.replace()
and re-attachelem
to the first element in the replacement string:As-is, my only choice appears to be to bail on the input file with an error message that blames lol-html and points to this issue (which is really terrible UX) or switch back to quick-xml and pay the runtime cost to run an Ammonia pass first to ensure that the user hasn't included any
<script>
,<style>
, or<textarea>
tags that would confuse it.The text was updated successfully, but these errors were encountered: