div mis-converted #107

mirabilos · 2024-01-29T01:17:42Z

>>> MarkdownConverter().convert('<div>foo</div><div>bar<span>baz</span><span>meow</span></div>')
'foobarbazmeow'

Expected: 'foo \nbarbazmeow'

The text was updated successfully, but these errors were encountered:

mirabilos · 2024-01-29T01:21:59Z

Looking at the code, this is probably very hard to do because for this the second div has to lookbehind and see that the previous text did not end with a \n\n hard paragraph break.

mirabilos · 2024-01-29T01:35:14Z

Workaround (this relies on the fix for #92 to be applied):

>>> text = '<div>foo</div><div>bar<span>baz</span><span>meow</span></div>'
>>> html = bs4.BeautifulSoup(text, 'html.parser')
>>> for e in html.find_all('div'):
...     e.insert_before(html.new_tag('br'))
... 
>>> text = MarkdownConverter().convert_soup(html)
>>> text = re.sub('  \n  \n', '\n\n', text)
>>> text = re.sub(' *\n\n+', '\n\n', text).strip()
>>> text
'foo  \nbarbazmeow'

Maybe it helps someone.

For the sake of completeness, the following is my current complete example of how I clean up HTML from RSS feeds to post it, as Markdown, to Fediverse (called as cleanup(post.x) where x is title, summary, content, …), which includes a number of workarounds for bad input and limits of the conversion tools:

def _cleanup_tablish(tag):
    for e in tag.contents:
        if isinstance(e, bs4.element.NavigableString) and str(e).strip() == '':
            e.extract()
            return True
    return False

def _cleanup_table(top):
    tag = top
    while isinstance(tag, bs4.element.Tag) and \
      tag.name in ('table', 'tbody', 'tr', 'th', 'td'):
        while _cleanup_tablish(tag):
            pass
        have_tablish = False
        have_nontablish = False
        have_elts = 0
        for e in tag.contents:
            if isinstance(e, bs4.element.NavigableString):
                have_nontablish = True
            elif e.name in ('table', 'tbody', 'tr', 'th', 'td'):
                have_tablish = True
            else:
                have_nontablish = True
            have_elts = have_elts + 1
        if have_elts == 0:
            top.extract()
            return
        if have_nontablish:
            if have_tablish:
                # huh?
                return
            tag.name = 'div'
            tag.attrs.clear()
            e = tag.contents[0]
            if have_elts == 1 and isinstance(e, bs4.element.Tag) and \
              e.name in bs4.builder.HTMLTreeBuilder.block_elements:
                tag = e
            if tag != top:
                top.replace_with(tag)
            return
        if have_elts > 1:
            return
        tag = tag.contents[0]

_cleanup_traildots = re.compile('\\.\\.\\.$')
def cleanup(text):
    text = re.sub('\r+\n?', '\n', text)
    html = bs4.BeautifulSoup(text, 'html.parser', multi_valued_attributes=None)
    # remove <!-- comments -->
    for e in html.find_all(string=lambda e: isinstance(e, bs4.element.Comment)):
        e.extract()
    # flatten tables with only one cell (Goodreads)
    for e in html.find_all('table'):
        _cleanup_table(e)
    # expand shortened links
    for e in html.find_all('a', href=True, string=_cleanup_traildots):
        href = str(e['href'])
        if href.startswith(str(e.string).rstrip('.')):
            e.string.replace_with(href)
    # temporarily move <pre>s aside
    pres = []
    npres = 0
    for pre in html.find_all('pre'):
        pres.append(pre.replace_with(html.new_tag('rpre', num=npres)))
        npres = npres + 1
    # clean whitespace except in the extracted <pre>s
    text = str(html)
    text = re.sub(' *\n *', '\n', text)
    text = text.replace('\n', '\1')
    text = re.sub('\1\1\1+', '\n\n', text)
    text = re.sub('\1+ *', ' ', text).strip()
    text = re.sub('[\t ]+', ' ', text)
    # bring back the extracted <pre>s
    html = bs4.BeautifulSoup(text, 'html.parser')
    for pre in html.find_all('rpre'):
        pre.replace_with(pres[int(pre.attrs['num'])])
    # work around https://github.com/matthewwithanm/python-markdownify/issues/107
    for e in html.find_all('div'):
        e.insert_before(html.new_tag('br'))
    # convert and clean up
    text = MarkdownConverter(strip=['img']).convert_soup(html)
    text = re.sub('  \n  \n', '\n\n', '\n' + text + '\n')
    text = re.sub('(\n> )+\n', '\n> \n', '\n' + text + '\n')
    text = re.sub(' *\n\n+', '\n\n', text)
    return text.strip()

chrispy-snps · 2024-06-10T17:50:19Z

We're hitting this too. It is a difficult fix in the current architecture.

AlexVonB · 2024-11-24T20:36:01Z

Thanks for reporting this! Indeed it is quite hard to fix. Especially we would have to decide if divs always behave as paragraphs. If we do, we could just handle divs the save as p and it would be fixed. But I think this would break different stuff. I'm open to suggestions, tho.

mirabilos · 2024-11-24T21:11:35Z

On Sun, 24 Nov 2024, AlexVonB wrote: Especially we would have to decide if divs always behave as paragraphs.

Uhm… no need to decide it, there’s already a spec for that ;-) Basically, a div forces the browser to begin at a new line, thinking in terms like what text browsers such as lynx would use. If the current position is already at a new line (e.g. because there was a </p> before it, no need to do anything; otherwise, forcing a linebreak is needed. So they specifically _don’t_ behave like paragraphs, which also have inter-paragraph spacing. Hence, the example and expected text in the submission above. bye, //mirabilos -- "Using Lynx is like wearing a really good pair of shades: cuts out the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL." -- Henry Nelson, March 1999

chrispy-snps · 2024-12-29T16:20:20Z

We previously used a source-modifying workaround like @mirabilos posted. However, we recently switched to a subclass-based workaround:

class CustomMarkdownConverter(markdownify.MarkdownConverter):
    """
    Create a custom MarkdownConverter that fixes some issues.
    """

    def convert_div(self, el, text, convert_as_inline):
        if convert_as_inline:
            return " " + text.strip() + " "
        else:
            return "\n\n" + text + "\n\n"

Our application prefers treating <div> boundaries as paragraph breaks, but you can modify it to use Markdown line breaks if preferred.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

div mis-converted #107

div mis-converted #107

mirabilos commented Jan 29, 2024

mirabilos commented Jan 29, 2024

mirabilos commented Jan 29, 2024 •

edited

Loading

chrispy-snps commented Jun 10, 2024

AlexVonB commented Nov 24, 2024

mirabilos commented Nov 24, 2024 via email

chrispy-snps commented Dec 29, 2024

div mis-converted #107

div mis-converted #107

Comments

mirabilos commented Jan 29, 2024

mirabilos commented Jan 29, 2024

mirabilos commented Jan 29, 2024 • edited Loading

chrispy-snps commented Jun 10, 2024

AlexVonB commented Nov 24, 2024

mirabilos commented Nov 24, 2024 via email

chrispy-snps commented Dec 29, 2024

mirabilos commented Jan 29, 2024 •

edited

Loading