Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Docling misinterprets linebreaks in markdown input as paragraph breaks #822

Closed
bbrowning opened this issue Jan 28, 2025 · 2 comments · Fixed by #824
Closed

Bug: Docling misinterprets linebreaks in markdown input as paragraph breaks #822

bbrowning opened this issue Jan 28, 2025 · 2 comments · Fixed by #824
Assignees
Labels
bug Something isn't working markdown issue related to markdown backend

Comments

@bbrowning
Copy link

Bug

When feeding a markdown with line-wrapped content (ie a text editor or human user has wrapped all lines at 72, 80, or some number of characters), Docling is misinterpreting this single linebreak between lines as separate paragraphs in the markdown.

Steps to reproduce

Create an input markdown file with single linebreaks in it for word wrapping purposes. Here's an example that I'll refer to in commands below as living at the path input/phoenix.md:

**Phoenix** is a minor [constellation](constellation "wikilink") in the
[southern sky](southern_sky "wikilink"). Named after the mythical
[phoenix](Phoenix_(mythology) "wikilink"), it was first depicted on a

Docling-generated markdown output

Convert that input phoenix.md to markdown with: docling --from md --to md input/phoenix.md

Phoenix is a minor constellation in the

southern sky. Named after the mythical

phoenix, it was first depicted on a

Docling-generated json output

Convert that input phoenix.md to json with: docling --from md --to json input/phoenix.md. This is just a small piece of the JSON snippet from a larger input file, but illustrates the point:

    {
      "self_ref": "#/texts/1",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "Phoenix is a minor constellation in the",
      "text": "Phoenix is a minor constellation in the"
    },
    {
      "self_ref": "#/texts/2",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "southern sky. Named after the mythical",
      "text": "southern sky. Named after the mythical"
    },
    {
      "self_ref": "#/texts/3",
      "parent": {
        "$ref": "#/body"
      },
      "children": [],
      "label": "paragraph",
      "prov": [],
      "orig": "phoenix, it was first depicted on a",
      "text": "phoenix, it was first depicted on a"
    },

Docling version

Docling version: 2.16.0
Docling Core version: 2.15.1
Docling IBM Models version: 3.3.0
Docling Parse version: 3.1.2

Python version

Python 3.11.9

@bbrowning
Copy link
Author

The problem with treating linebreaks as paragraph breaks shows up the most, at least in my applications, when using the HybridChunker. Because it thinks the line-wrapping is actually separate paragraphs, my chunks are frequently getting split on the original newline boundaries which ends up splitting text in the middle of a sentence regularly. The HybridChunker is doing its job and trying to split on paragraph boundaries - it's just that we're misidentifying paragraph boundaries in the source content.

As a workaround, the user can ensure there are no newlines in their markdown except ones that are meant to signify paragraph breaks. In practice, it's fairly common to have hard line wrapping in markdowns because it gets ignored during rendering. That's a personal preference of the person writing markdown and their editor settings, but in the wild it's pretty frequent to have markdown files hard wrapped at some number of characters per line.

@bbrowning
Copy link
Author

I pulled the new 2.17.0 release, tried this again on my sample input file, and the output looks much better. There aren't any extra newlines added and the chunks created from HybridChunker are splitting on actual paragraph boundaries from my source text a lot more often. Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working markdown issue related to markdown backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants