Bug: Docling misinterprets linebreaks in markdown input as paragraph breaks #822

bbrowning · 2025-01-28T01:55:49Z

Bug

When feeding a markdown with line-wrapped content (ie a text editor or human user has wrapped all lines at 72, 80, or some number of characters), Docling is misinterpreting this single linebreak between lines as separate paragraphs in the markdown.

Steps to reproduce

Create an input markdown file with single linebreaks in it for word wrapping purposes. Here's an example that I'll refer to in commands below as living at the path input/phoenix.md:

**Phoenix** is a minor [constellation](constellation "wikilink") in the
[southern sky](southern_sky "wikilink"). Named after the mythical
[phoenix](Phoenix_(mythology) "wikilink"), it was first depicted on a

Docling-generated markdown output

Convert that input phoenix.md to markdown with: docling --from md --to md input/phoenix.md

Phoenix is a minor constellation in the

southern sky. Named after the mythical

phoenix, it was first depicted on a

Docling-generated json output

Convert that input phoenix.md to json with: docling --from md --to json input/phoenix.md. This is just a small piece of the JSON snippet from a larger input file, but illustrates the point:

    {
      "self_ref": "#/texts/1",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "Phoenix is a minor constellation in the",
      "text": "Phoenix is a minor constellation in the"
    },
    {
      "self_ref": "#/texts/2",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "southern sky. Named after the mythical",
      "text": "southern sky. Named after the mythical"
    },
    {
      "self_ref": "#/texts/3",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "phoenix, it was first depicted on a",
      "text": "phoenix, it was first depicted on a"
    },

Docling version

Docling version: 2.16.0
Docling Core version: 2.15.1
Docling IBM Models version: 3.3.0
Docling Parse version: 3.1.2

Python version

Python 3.11.9

The text was updated successfully, but these errors were encountered:

bbrowning · 2025-01-28T14:20:47Z

The problem with treating linebreaks as paragraph breaks shows up the most, at least in my applications, when using the HybridChunker. Because it thinks the line-wrapping is actually separate paragraphs, my chunks are frequently getting split on the original newline boundaries which ends up splitting text in the middle of a sentence regularly. The HybridChunker is doing its job and trying to split on paragraph boundaries - it's just that we're misidentifying paragraph boundaries in the source content.

As a workaround, the user can ensure there are no newlines in their markdown except ones that are meant to signify paragraph breaks. In practice, it's fairly common to have hard line wrapping in markdowns because it gets ignored during rendering. That's a personal preference of the person writing markdown and their editor settings, but in the wild it's pretty frequent to have markdown files hard wrapped at some number of characters per line.

bbrowning · 2025-01-28T19:44:36Z

I pulled the new 2.17.0 release, tried this again on my sample input file, and the output looks much better. There aren't any extra newlines added and the chunks created from HybridChunker are splitting on actual paragraph boundaries from my source text a lot more often. Thanks for the quick fix!

bbrowning added the bug Something isn't working label Jan 28, 2025

bbrowning mentioned this issue Jan 28, 2025

Docling treating every line break in markdown files as new paragraph instructlab/sdg#514

Open

PeterStaar-IBM added the markdown issue related to markdown backend label Jan 28, 2025

ceberam assigned ceberam and vagenas and unassigned ceberam Jan 28, 2025

vagenas mentioned this issue Jan 28, 2025

fix: fix single newline handling in MD backend #824

Merged

vagenas closed this as completed in #824 Jan 28, 2025

aakankshaduggal mentioned this issue Feb 3, 2025

Upgrade to Docling 2.9.0 or newer for HybridChunker instructlab/sdg#505

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Docling misinterprets linebreaks in markdown input as paragraph breaks #822

Bug: Docling misinterprets linebreaks in markdown input as paragraph breaks #822

bbrowning commented Jan 28, 2025

bbrowning commented Jan 28, 2025

bbrowning commented Jan 28, 2025

Bug: Docling misinterprets linebreaks in markdown input as paragraph breaks #822

Bug: Docling misinterprets linebreaks in markdown input as paragraph breaks #822

Comments

bbrowning commented Jan 28, 2025

Bug

Steps to reproduce

Docling-generated markdown output

Docling-generated json output

Docling version

Python version

bbrowning commented Jan 28, 2025

bbrowning commented Jan 28, 2025