You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When feeding a markdown with line-wrapped content (ie a text editor or human user has wrapped all lines at 72, 80, or some number of characters), Docling is misinterpreting this single linebreak between lines as separate paragraphs in the markdown.
Steps to reproduce
Create an input markdown file with single linebreaks in it for word wrapping purposes. Here's an example that I'll refer to in commands below as living at the path input/phoenix.md:
**Phoenix** is a minor [constellation](constellation"wikilink") in the
[southern sky](southern_sky"wikilink"). Named after the mythical
[phoenix](Phoenix_(mythology)"wikilink"), it was first depicted on a
Docling-generated markdown output
Convert that input phoenix.md to markdown with: docling --from md --to md input/phoenix.md
Phoenix is a minor constellation in the
southern sky. Named after the mythical
phoenix, it was first depicted on a
Docling-generated json output
Convert that input phoenix.md to json with: docling --from md --to json input/phoenix.md. This is just a small piece of the JSON snippet from a larger input file, but illustrates the point:
{
"self_ref": "#/texts/1",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "paragraph",
"prov": [],
"orig": "Phoenix is a minor constellation in the",
"text": "Phoenix is a minor constellation in the"
},
{
"self_ref": "#/texts/2",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "paragraph",
"prov": [],
"orig": "southern sky. Named after the mythical",
"text": "southern sky. Named after the mythical"
},
{
"self_ref": "#/texts/3",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "paragraph",
"prov": [],
"orig": "phoenix, it was first depicted on a",
"text": "phoenix, it was first depicted on a"
},
The problem with treating linebreaks as paragraph breaks shows up the most, at least in my applications, when using the HybridChunker. Because it thinks the line-wrapping is actually separate paragraphs, my chunks are frequently getting split on the original newline boundaries which ends up splitting text in the middle of a sentence regularly. The HybridChunker is doing its job and trying to split on paragraph boundaries - it's just that we're misidentifying paragraph boundaries in the source content.
As a workaround, the user can ensure there are no newlines in their markdown except ones that are meant to signify paragraph breaks. In practice, it's fairly common to have hard line wrapping in markdowns because it gets ignored during rendering. That's a personal preference of the person writing markdown and their editor settings, but in the wild it's pretty frequent to have markdown files hard wrapped at some number of characters per line.
I pulled the new 2.17.0 release, tried this again on my sample input file, and the output looks much better. There aren't any extra newlines added and the chunks created from HybridChunker are splitting on actual paragraph boundaries from my source text a lot more often. Thanks for the quick fix!
Bug
When feeding a markdown with line-wrapped content (ie a text editor or human user has wrapped all lines at 72, 80, or some number of characters), Docling is misinterpreting this single linebreak between lines as separate paragraphs in the markdown.
Steps to reproduce
Create an input markdown file with single linebreaks in it for word wrapping purposes. Here's an example that I'll refer to in commands below as living at the path
input/phoenix.md
:Docling-generated markdown output
Convert that input phoenix.md to markdown with:
docling --from md --to md input/phoenix.md
Docling-generated json output
Convert that input phoenix.md to json with:
docling --from md --to json input/phoenix.md
. This is just a small piece of the JSON snippet from a larger input file, but illustrates the point:Docling version
Docling version: 2.16.0
Docling Core version: 2.15.1
Docling IBM Models version: 3.3.0
Docling Parse version: 3.1.2
Python version
Python 3.11.9
The text was updated successfully, but these errors were encountered: