Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug "Found character that cannot start any token" with trailing tab character. #450

Open
garretwilson opened this issue Oct 22, 2020 · 6 comments
Assignees
Labels

Comments

@garretwilson
Copy link

garretwilson commented Oct 22, 2020

I'm working on project in which we have to parse our YAML files with several programming languages. We have some YAML 1.1 files that work with SnakeYAML (Java) but break with PyYAML (Python). I've contacted SnakeYAML and their team has informed me that this is a PyYAML bug. Wherever the bug lies, we need to get to the bottom of it so that we can be confident that the YAML parsers we use for all our programming languages follow the YAML specification and can reliably process the same files.

The file causing the problem has a line ending in a tab. Here is a test case, using to represent the tab character, Unicode U+0009, as in the YAML specification. The actual file contains the U+0009 character.

test:
  - foo →
  - bar

PyYAML (I tested with v3.13) will produce an error message like this:

Found character '\t' that cannot start any token.

I provided an extensive analysis in SnakeYAML Issue 487, finding a section of the YAML 1.1 specification saying that "spaces [which] are used for indentation and separation between tokens" cannot contain a tab character. However a SnakeYAML developer's pointed out is that in this case the tab is neither used for indent nor used to separate anything, and so it should be allowed.

Note also that the official YAML reference parser allows this test case, which you can see in YPaste #2085. (I'm not sure however whether YPaste is parsing according to YAML 1.1 or YAML 1.2 rules, or whether there would be a difference in this case.)

So both the SnakeYAML development team and the YPaste YAML reference parser seem to indicate that this document should parse correctly, but PyYAML flags it as an error.

Can you confirm that this is indeed a PyYAML bug and provide some timeline for its fix? I would be happy to contribute if it is within my expertise, although I am not yet what I would consider an "expert" in Python or YAML.

The important thing is that we absolutely require consistent handling of YAML documents across Python and Java, so we need to ascertain the source of the problem and get a fix either in PyYAML or SnakeYAML as appropriate. Thanks in advance for your time in looking at this.

@perlpunk
Copy link
Member

Yes, it's a bug in PyYAML. The reference parser behind ypaste is built for YAML 1.2.
I'm not sure how easy it would be to fix.
But I would like to point out that trailing tabs like this are not produced by YAML dumpers, so they should appear only in handwritten YAML. Since tabs can be hard to spot, they are not recommended outside of block scalars.

If the use case are YAML files maintained in a git repo, for example, it can be a good idea to enforce a good style with yamllint.
I'm also working on a new tool yamltidy that can automatically remove such things, but it is very new and doesn't have a lot of configuration options yet.

absolutely require consistent handling of YAML documents across Python and Java

Indeed this would be perfect, but there are a lot of issues to solve both in SnakeYAML and PyYAML (and others) until this becomes reality:
https://matrix.yaml.io/
We hope that more and more authors become aware of the test suite and that matrix, so behaviours get more consistent.

@garretwilson
Copy link
Author

Yes, it's a bug in PyYAML.

Thank you for confirming! The YAML spec wasn't exactly a model of clarify on this point.

But I would like to point out that trailing tabs like this are not produced by YAML dumpers, so they should appear only in handwritten YAML.

Indeed. A common use for YAML is hand-coded configuration files.

Since tabs can be hard to spot, they are not recommended outside of block scalars.

Oh, we don't recommend them either. But as you note, they are hard to spot. And since the YAML specification allows them, we don't expect our downstream Python YAML processing to break if someone accidentally leaves in a tab. (Yes, this happened.)

So do you have a timeline for when this could be fixed, or point me to the area of the code where I could look into fixing it?

If the use case are YAML files maintained in a git repo, for example, it can be a good idea to enforce a good style with yamllint.

Linting is a good idea, thanks. (Although if the YAML specification were less ambiguous and the the parser implementations were more robust, such as I'm used to in the XML world, then simply parsing with any parser would be sufficient "linting" to work with other parsers.)

Our Python processing is downstream. We would need to do the linting in Java. I did a search and found Java YAML Lint, but I already see that its support for basic UTF-8 encoding is broken, so that doesn't give me much confidence.

Is there a more robust YAML linter in Java you would recommend?

there are a lot of issues to solve both in SnakeYAML and PyYAML (and others) until this becomes reality: https://matrix.yaml.io/

I really appreciate this additional information; it's all news to me. Perhaps I came into the YAML world with too high of expectations, having worked with XML since it came out. But regardless of whether the expectations were too high, in the project I'm working on we absolutely must be sure that our YAML documents don't break downstream when they reach PyYAML.

Let me know if you think it would be worth my time to try to find this bug, seeing that I'm not a Python or a YAML expert (yet).

@earonesty
Copy link

example of this bug in the wild: docker/compose#5662

@danielbakken
Copy link

danielbakken commented May 24, 2023

This bug may also be causing errors with tabs separating values and comments (allowed by the spec).

https://yaml.org/spec/1.2.2/#separation-spaces
Outside indentation and scalar content, YAML uses white space characters for separation between tokens within a line. Note that such white space may safely include tab characters.

Create test.yml with a tab (\t):

---

some_key:
  - value1\t# comment
  - value2

Attempt to load test.yml (pyyaml 5.4.1/python3.10)

>>> import yaml
>>> f = open("test.yml", "r")
>>> yaml.safe_load(f)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 162, in safe_load
    return load(stream, SafeLoader)
  File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 114, in load
    return loader.get_single_data()
  File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 36, in get_single_node
    document = self.compose_document()
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 55, in compose_document
    node = self.compose_node(None, None)
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 84, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 133, in compose_mapping_node
    item_value = self.compose_node(node, item_key)
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 82, in compose_node
    node = self.compose_sequence_node(anchor)
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 110, in compose_sequence_node
    while not self.check_event(SequenceEndEvent):
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 379, in parse_block_sequence_first_entry
    return self.parse_block_sequence_entry()
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 384, in parse_block_sequence_entry
    if not self.check_token(BlockEntryToken, BlockEndToken):
  File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 258, in fetch_more_tokens
    raise ScannerError("while scanning for the next token", None,
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
  in "test.yml", line 4, column 11

@michaelmior
Copy link

michaelmior commented Aug 17, 2023

I'm not sure if this is allowed by the spec, but I also encountered this with a tab in an unquoted string.

A simpler example that exhibits the same behavior is:

foo: bar→baz

@pgp
Copy link

pgp commented Sep 6, 2023

I know that it is an improper use, anyway, if this may be of any interest, this bug also appears when trying to use pyyaml to import a tab-indented JSON file (since the YAML language is a superset of JSON), while it works correctly with a JSON file without indentation. Other tools, like ruamel.yaml, work fine in both cases instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants