Nested quotes inside of the multi-line strings bugs #230

vors · 2019-02-22T19:52:36Z

This is somewhat similar to #123 .
Tested on 0.10.0

I found 2 issues with quotes inside a multiline file.

Repro 1

>>> toml.loads('''
... 'foo' = """
... yield "Don't do this"
... """
... 'bar' = 'baz'
... ''')
{}

That should not be empty.

Note that 'bar' = 'baz' is significant - if you drop it, then everything works

>>> toml.loads('''
... 'foo' = """
... yield "Don't do this"
... """
... ''')
{'foo': 'yield "Don\'t do this"\n '}

Repro 2

Also dropping the yield part leads to exception, but I don't think that the quotes are unbalanced there

>>> toml.loads('''
... 'foo' = """
... "Don't do this"
... """
... ''')
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/local/lib/python3.7/site-packages/toml/decoder.py", line 301, in loads
    raise TomlDecodeError("Unbalanced quotes", original, i)
toml.decoder.TomlDecodeError: Unbalanced quotes (line 3 column 16 char 28)

The text was updated successfully, but these errors were encountered:

JiayiXu · 2019-02-22T19:57:44Z

This is important. I hit the same exact. Will really appreciate it if this is fixed.

hipe · 2019-06-05T22:56:07Z

Sorry to be like this, but i've stepped thru the code in the debugger w/ step debugging, and the whole thing needs a rewrite. the parsing algorithm is fundamentally flawed and it needs an overhaul, if not a parser-generator. Pretty much every line i've stepped over has something .. not-best-practicey about it. I could elaborate 🤪

Specifically the part about how multiline strings are parsed is like, a mostly nonworking hack. Every couple of lines I step over I can come up with another edge case for a multi-line string that trips up the parser. like, parsing a multi-line literal string (''', in contrast to a multi-line basic( """)) is supposed to be easy because there's no escaping you have to scan for at all. you just have to keep reading until you find the first closing ('''). This is not how it works currently and this is why there are so many bug reports around multi-line strings.

I'm exploring various options now, including trying to get C-bindings to the (unofficially authoritative) Go library .. that could be called from python.

I'd be curious to hear others' thoughts of course!

Also i find the lack of unit tests troubling. Yes there's the official test suite, but that doesn't cover all these edge cases that people are running in to. When you fix it for these edge cases, there should always be a new test case covering this fix. There isn't

vors · 2019-06-05T23:41:40Z

@hipe thank you for taking a look! Out of curiosity, is go parser more authoritative than the rust one?

hipe · 2019-06-06T10:43:04Z

@vors thanks for your response! Offhand i don't know but i (i as in me but also i as in whoever) would certainly look at the rust one too when shopping around for possibilities for a python-go or python-rust interoperability for a possible new python package to work on.

Also i would give the rust [crate ?] a close look too when I'm looking for test cases to put in a test suite with better coverage to compliment the "official" (but apparently anemic) test suite.

My workaround for my current project ..

(i have a bug where, with this ☝️ python implementation of toml, multiline strings that have URL's with html anchors in them '#' they GET REMOVED LIKE THEY'RE COMMENTS LOL 😭 (but only if the number of single quotes that appear before that '#' IS ODD 💀))

..my workaround for my current projects might be to migrate from toml to this new "eno"

https://news.ycombinator.com/item?id=17765426

because it sounds like a better fit for my use case anyway. But there is still this place in my heart for toml and also ☝️ this toml library should "just work" because of how supposedly simple it is and because this module ships in the python standard distribution, right?

more later hopefully..

vors · 2019-06-06T18:36:36Z

this module ships in the python standard distribution, right?

I don't think so - I got it from pip.

uiri · 2019-06-07T05:08:03Z

Hi Sergei -- thank you for reporting this issue! Unfortunately, I have not yet had the chance to look in depth at fixing it.

I am going to add a bit of context for Chip as to why the string parsing in particular is so fucked up and the sorry state of this repository.

@hipe This codebase is almost as old as the TOML standard itself. I originally wrote it over 6 years ago after seeing the spec posted on Hacker News. The format at that time was much, much simpler. For example, there was one type of string (not 4!) and there were no inline tables to create mutual nesting of arrays and tables. Since I was so quick to implement it, I was able to snag the toml name in Pip.

The multiline string and literal string stuff was added in v0.3.0, around a year and a half after I had implemented the original parser. So the code suffers from the standard problem of scope creep plus a lack of foresight as to how the spec would expand in the future when the codebase was first written.

Beyond the scope creep issue, I have been the primary maintainer of the codebase since I wrote it. This project has never been my day job. I've had less and less spare time to devote to it as I've gone from high school senior to undergrad student to software professional. The primary mode of development is drive-by patches, whether written by someone else or by myself.

I have been meaning to go back and add test cases to the test suite to cover all the past issues that I've patched, but it is difficult to justify prioritizing that over fixing new and existing issues as they are reported. And by "meaning to", I mean this is something that has been in the back of my mind as a TODO for months (years?) at this point. If you want to take a stab at it, by all means, please do. Patches welcome 😃 That applies to both this repo and the test suite repo.

hipe · 2019-06-07T08:53:59Z

Thank you @uiri! I feel your pain 🙏 and I hope I can help out eventually 🤗

i just re-discovered that Pipfiles are in toml, and so we have:

https://github.com/pypa/pip/blob/master/src/pip/_vendor/pytoml/parser.py

(and pip certainly isn't going anywhere!)

and so it would be a fun compare-and-contrast to see if one can hackishly reach that module and see how it does against our imaginary test suite i keep talking about.

but when you look for a "pytoml" library proper, it says it's deprecated and points us back to the topic project https://github.com/avakar/pytoml

more later!

simplesteph mentioned this issue Mar 9, 2019

Can't parse multiline #237

Open

miccoli mentioned this issue Oct 31, 2019

Broken parsing of triple-quoted strings with dotted keys #265

Open

pradyunsg added type: bug A confirmed bug or unintended behavior component: decoder Related to parsing in `toml.load` syntax: strings Related to string literals labels Apr 20, 2022

pirate mentioned this issue Sep 24, 2024

Bug: Many serious TOML parsing/loading bugs caused by underlying toml library, consider switching? fabiocaccamo/python-benedict#439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested quotes inside of the multi-line strings bugs #230

Nested quotes inside of the multi-line strings bugs #230

vors commented Feb 22, 2019 •

edited

Loading

JiayiXu commented Feb 22, 2019

hipe commented Jun 5, 2019 •

edited

Loading

vors commented Jun 5, 2019 •

edited

Loading

hipe commented Jun 6, 2019

vors commented Jun 6, 2019

uiri commented Jun 7, 2019

hipe commented Jun 7, 2019

Nested quotes inside of the multi-line strings bugs #230

Nested quotes inside of the multi-line strings bugs #230

Comments

vors commented Feb 22, 2019 • edited Loading

Repro 1

Repro 2

JiayiXu commented Feb 22, 2019

hipe commented Jun 5, 2019 • edited Loading

vors commented Jun 5, 2019 • edited Loading

hipe commented Jun 6, 2019

vors commented Jun 6, 2019

uiri commented Jun 7, 2019

hipe commented Jun 7, 2019

vors commented Feb 22, 2019 •

edited

Loading

hipe commented Jun 5, 2019 •

edited

Loading

vors commented Jun 5, 2019 •

edited

Loading