-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug "Found character that cannot start any token" with trailing tab character. #450
Comments
Yes, it's a bug in PyYAML. The reference parser behind ypaste is built for YAML 1.2. If the use case are YAML files maintained in a git repo, for example, it can be a good idea to enforce a good style with yamllint.
Indeed this would be perfect, but there are a lot of issues to solve both in SnakeYAML and PyYAML (and others) until this becomes reality: |
Thank you for confirming! The YAML spec wasn't exactly a model of clarify on this point.
Indeed. A common use for YAML is hand-coded configuration files.
Oh, we don't recommend them either. But as you note, they are hard to spot. And since the YAML specification allows them, we don't expect our downstream Python YAML processing to break if someone accidentally leaves in a tab. (Yes, this happened.) So do you have a timeline for when this could be fixed, or point me to the area of the code where I could look into fixing it?
Linting is a good idea, thanks. (Although if the YAML specification were less ambiguous and the the parser implementations were more robust, such as I'm used to in the XML world, then simply parsing with any parser would be sufficient "linting" to work with other parsers.) Our Python processing is downstream. We would need to do the linting in Java. I did a search and found Java YAML Lint, but I already see that its support for basic UTF-8 encoding is broken, so that doesn't give me much confidence. Is there a more robust YAML linter in Java you would recommend?
I really appreciate this additional information; it's all news to me. Perhaps I came into the YAML world with too high of expectations, having worked with XML since it came out. But regardless of whether the expectations were too high, in the project I'm working on we absolutely must be sure that our YAML documents don't break downstream when they reach PyYAML. Let me know if you think it would be worth my time to try to find this bug, seeing that I'm not a Python or a YAML expert (yet). |
example of this bug in the wild: docker/compose#5662 |
This bug may also be causing errors with tabs separating values and comments (allowed by the spec).
Create test.yml with a tab (\t):
Attempt to load test.yml (pyyaml 5.4.1/python3.10)
|
I'm not sure if this is allowed by the spec, but I also encountered this with a tab in an unquoted string. A simpler example that exhibits the same behavior is: foo: bar→baz |
I know that it is an improper use, anyway, if this may be of any interest, this bug also appears when trying to use pyyaml to import a tab-indented JSON file (since the YAML language is a superset of JSON), while it works correctly with a JSON file without indentation. Other tools, like ruamel.yaml, work fine in both cases instead. |
I'm working on project in which we have to parse our YAML files with several programming languages. We have some YAML 1.1 files that work with SnakeYAML (Java) but break with PyYAML (Python). I've contacted SnakeYAML and their team has informed me that this is a PyYAML bug. Wherever the bug lies, we need to get to the bottom of it so that we can be confident that the YAML parsers we use for all our programming languages follow the YAML specification and can reliably process the same files.
The file causing the problem has a line ending in a tab. Here is a test case, using
→
to represent the tab character, Unicode U+0009, as in the YAML specification. The actual file contains the U+0009 character.PyYAML (I tested with v3.13) will produce an error message like this:
I provided an extensive analysis in SnakeYAML Issue 487, finding a section of the YAML 1.1 specification saying that "spaces [which] are used for indentation and separation between tokens" cannot contain a tab character. However a SnakeYAML developer's pointed out is that in this case the tab is neither used for indent nor used to separate anything, and so it should be allowed.
Note also that the official YAML reference parser allows this test case, which you can see in YPaste #2085. (I'm not sure however whether YPaste is parsing according to YAML 1.1 or YAML 1.2 rules, or whether there would be a difference in this case.)
So both the SnakeYAML development team and the YPaste YAML reference parser seem to indicate that this document should parse correctly, but PyYAML flags it as an error.
Can you confirm that this is indeed a PyYAML bug and provide some timeline for its fix? I would be happy to contribute if it is within my expertise, although I am not yet what I would consider an "expert" in Python or YAML.
The important thing is that we absolutely require consistent handling of YAML documents across Python and Java, so we need to ascertain the source of the problem and get a fix either in PyYAML or SnakeYAML as appropriate. Thanks in advance for your time in looking at this.
The text was updated successfully, but these errors were encountered: