Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline <p> not handled well #92

Open
mirabilos opened this issue Jul 7, 2023 · 1 comment · May be fixed by #108 or #120
Open

Inline <p> not handled well #92

mirabilos opened this issue Jul 7, 2023 · 1 comment · May be fixed by #108 or #120

Comments

@mirabilos
Copy link

Input:

Wie nennt man jemanden, der gegen Covid-Politik Proteste organisiert, und sich dann mit einem Covid-Testcenter an der Zitze des Staates labt?<p><a href="https://www.mdr.de/nachrichten/sachsen-anhalt/dessau/bitterfeld/buergermeister-raguhn-jessnitz-loth-afd-100.html">Bürgermeister und AfD-Abgeordneter</a>.

Proposed patch:

-        return '%s\n\n' % text if text else ''
+        return '\n\n%s\n\n' % text if text else ''

Unsure if the newlines after actually need to be there, but, sure, why not. People have to postprocess the output of this to clean up newlines anyway.

@chrispy-snps
Copy link
Collaborator

Smaller testcase:

>>> from markdownify import markdownify as md
>>> md('TEXT1<p>TEXT2</p>')
'TEXT1TEXT2\n\n'

There is no line break before the TEXT2 paragraph content.

mirabilos added a commit to mirabilos/python-markdownify that referenced this issue Jan 31, 2024
@mirabilos mirabilos linked a pull request Jan 31, 2024 that will close this issue
jsm28 added a commit to jsm28/python-markdownify that referenced this issue Apr 9, 2024
There are various cases in which inline text fails to be separated by
(sufficiently many) newlines from adjacent block content.  A paragraph
needs a blank line (two newlines) separating it from prior text, as
does an underlined header; an ATX header needs a single newline
separating it from prior text.  A list needs at least one newline
separating it from prior text, but in general two newlines (for an
ordered list starting other than at 1, which will only be recognized
given a blank line before).

To avoid accumulation of more newlines than necessary, take care when
concatenating the results of converting consecutive tags to remove
redundant newlines (keeping the greater of the number ending the prior
text and the number starting the subsequent text).

This is thus an alternative to matthewwithanm#108 that tries to avoid the excess
newline accumulation that was a concern there, as well as fixing more
cases than just paragraphs, and updating tests.

Fixes matthewwithanm#92

Fixes matthewwithanm#98
@jsm28 jsm28 linked a pull request Apr 9, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants