Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested quotes inside of the multi-line strings bugs #230

Open
vors opened this issue Feb 22, 2019 · 7 comments
Open

Nested quotes inside of the multi-line strings bugs #230

vors opened this issue Feb 22, 2019 · 7 comments
Labels
component: decoder Related to parsing in `toml.load` syntax: strings Related to string literals type: bug A confirmed bug or unintended behavior

Comments

@vors
Copy link

vors commented Feb 22, 2019

This is somewhat similar to #123 .
Tested on 0.10.0

I found 2 issues with quotes inside a multiline file.

Repro 1

>>> toml.loads('''
... 'foo' = """
... yield "Don't do this"
... """
... 'bar' = 'baz'
... ''')
{}

That should not be empty.

Note that 'bar' = 'baz' is significant - if you drop it, then everything works

>>> toml.loads('''
... 'foo' = """
... yield "Don't do this"
... """
... ''')
{'foo': 'yield "Don\'t do this"\n '}

Repro 2

Also dropping the yield part leads to exception, but I don't think that the quotes are unbalanced there

>>> toml.loads('''
... 'foo' = """
... "Don't do this"
... """
... ''')
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/local/lib/python3.7/site-packages/toml/decoder.py", line 301, in loads
    raise TomlDecodeError("Unbalanced quotes", original, i)
toml.decoder.TomlDecodeError: Unbalanced quotes (line 3 column 16 char 28)
@JiayiXu
Copy link

JiayiXu commented Feb 22, 2019

This is important. I hit the same exact. Will really appreciate it if this is fixed.

@hipe
Copy link

hipe commented Jun 5, 2019

Sorry to be like this, but i've stepped thru the code in the debugger w/ step debugging, and the whole thing needs a rewrite. the parsing algorithm is fundamentally flawed and it needs an overhaul, if not a parser-generator. Pretty much every line i've stepped over has something .. not-best-practicey about it. I could elaborate 🤪

Specifically the part about how multiline strings are parsed is like, a mostly nonworking hack. Every couple of lines I step over I can come up with another edge case for a multi-line string that trips up the parser. like, parsing a multi-line literal string (''', in contrast to a multi-line basic( """)) is supposed to be easy because there's no escaping you have to scan for at all. you just have to keep reading until you find the first closing ('''). This is not how it works currently and this is why there are so many bug reports around multi-line strings.

I'm exploring various options now, including trying to get C-bindings to the (unofficially authoritative) Go library .. that could be called from python.

I'd be curious to hear others' thoughts of course!

Also i find the lack of unit tests troubling. Yes there's the official test suite, but that doesn't cover all these edge cases that people are running in to. When you fix it for these edge cases, there should always be a new test case covering this fix. There isn't

@vors
Copy link
Author

vors commented Jun 5, 2019

@hipe thank you for taking a look! Out of curiosity, is go parser more authoritative than the rust one?

@hipe
Copy link

hipe commented Jun 6, 2019

@vors thanks for your response! Offhand i don't know but i (i as in me but also i as in whoever) would certainly look at the rust one too when shopping around for possibilities for a python-go or python-rust interoperability for a possible new python package to work on.

Also i would give the rust [crate ?] a close look too when I'm looking for test cases to put in a test suite with better coverage to compliment the "official" (but apparently anemic) test suite.

My workaround for my current project ..

(i have a bug where, with this ☝️ python implementation of toml, multiline strings that have URL's with html anchors in them '#' they GET REMOVED LIKE THEY'RE COMMENTS LOL 😭 (but only if the number of single quotes that appear before that '#' IS ODD 💀))

..my workaround for my current projects might be to migrate from toml to this new "eno"

https://news.ycombinator.com/item?id=17765426

because it sounds like a better fit for my use case anyway. But there is still this place in my heart for toml and also ☝️ this toml library should "just work" because of how supposedly simple it is and because this module ships in the python standard distribution, right?

more later hopefully..

@vors
Copy link
Author

vors commented Jun 6, 2019

this module ships in the python standard distribution, right?

I don't think so - I got it from pip.

@uiri
Copy link
Owner

uiri commented Jun 7, 2019

Hi Sergei -- thank you for reporting this issue! Unfortunately, I have not yet had the chance to look in depth at fixing it.

I am going to add a bit of context for Chip as to why the string parsing in particular is so fucked up and the sorry state of this repository.

@hipe This codebase is almost as old as the TOML standard itself. I originally wrote it over 6 years ago after seeing the spec posted on Hacker News. The format at that time was much, much simpler. For example, there was one type of string (not 4!) and there were no inline tables to create mutual nesting of arrays and tables. Since I was so quick to implement it, I was able to snag the toml name in Pip.

The multiline string and literal string stuff was added in v0.3.0, around a year and a half after I had implemented the original parser. So the code suffers from the standard problem of scope creep plus a lack of foresight as to how the spec would expand in the future when the codebase was first written.

Beyond the scope creep issue, I have been the primary maintainer of the codebase since I wrote it. This project has never been my day job. I've had less and less spare time to devote to it as I've gone from high school senior to undergrad student to software professional. The primary mode of development is drive-by patches, whether written by someone else or by myself.

I have been meaning to go back and add test cases to the test suite to cover all the past issues that I've patched, but it is difficult to justify prioritizing that over fixing new and existing issues as they are reported. And by "meaning to", I mean this is something that has been in the back of my mind as a TODO for months (years?) at this point. If you want to take a stab at it, by all means, please do. Patches welcome 😃 That applies to both this repo and the test suite repo.

@hipe
Copy link

hipe commented Jun 7, 2019

Thank you @uiri! I feel your pain 🙏 and I hope I can help out eventually 🤗

i just re-discovered that Pipfiles are in toml, and so we have:

https://github.com/pypa/pip/blob/master/src/pip/_vendor/pytoml/parser.py

(and pip certainly isn't going anywhere!)

and so it would be a fun compare-and-contrast to see if one can hackishly reach that module and see how it does against our imaginary test suite i keep talking about.

but when you look for a "pytoml" library proper, it says it's deprecated and points us back to the topic project https://github.com/avakar/pytoml

more later!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: decoder Related to parsing in `toml.load` syntax: strings Related to string literals type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants