Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quoting strings composed of digits #98

Open
drtconway opened this issue Nov 20, 2017 · 31 comments
Open

quoting strings composed of digits #98

drtconway opened this issue Nov 20, 2017 · 31 comments
Labels

Comments

@drtconway
Copy link

Hi,
Consider the following program:

import sys
import yaml

w = {'userName': 'scientist ', 'userEmail': '[email protected]', 'sampleName': '08063075', 'fastqDir': '/foo/bar/baz/171108_M00139_0253_000000000-BHFD5/ProjectFolders/Project_Qux-Wombat-/Sample_08063075', 'analysis': 'Somatic_Run375_MY_DY_08112017', 'laneNo': 1, 'panel': 'Somatic_Panel_Manifest_v4_23_10_2015'}

yaml.dump(w, sys.stdout, default_flow_style=False)

The value for the key 'sampleName' is composed of digits, but is a string. When run the program produces the following output:

$ python quoting.py
analysis: Somatic_Run375_MY_DY_08112017
fastqDir: /foo/bar/baz/171108_M00139_0253_000000000-BHFD5/ProjectFolders/Project_Qux-Wombat-/Sample_08063075
laneNo: 1
panel: Somatic_Panel_Manifest_v4_23_10_2015
sampleName: 08063075
userEmail: [email protected]
userName: 'scientist '

This leads to problems downstream because the sequence of digits gets interpreted as an integer with the effect that when reserialized, the leading zero is lost.

Is this actually a bug? If not, how do I arrange for the value to be quoted?

Version information:

$ pip2 show pyyaml
Name: PyYAML
Version: 3.12
Summary: YAML parser and emitter for Python
Home-page: http://pyyaml.org/wiki/PyYAML
Author: Kirill Simonov
Author-email: [email protected]
License: MIT
Location: /Users/conwaythomas/Library/Python/2.7/lib/python/site-packages
Requires: 
$ python --version
Python 2.7.10
$ uname -a
Darwin MA41192 16.7.0 Darwin Kernel Version 16.7.0: Wed Oct  4 00:17:00 PDT 2017; root:xnu-3789.71.6~1/RELEASE_X86_64 x86_64
@BigJMoney
Copy link

I'm not at my desk to attempt, but use yaml.load() on a document that is formatted as you desire with an example int-string inside single quotes. Then view the python object to see how yaml formats it. That should be your answer. I.E. reverse engineer what you're trying to do.

@jk2l
Copy link

jk2l commented Mar 5, 2018

nope not going to work


>>> yaml_dump = """
... testing:
...   testing: "033677994240"
... """
>>> yaml.load(yaml_dump)
{'testing': {'testing': '033677994240'}}
>>> yaml.dump(yaml.load(yaml_dump))
'testing: {testing: 033677994240}\n'

@justin8
Copy link

justin8 commented May 10, 2018

it seems to convert all strings starting with a zero to int, but other ints stored as a string it keeps the quote marks around it. This is quite annoying

@perlpunk
Copy link
Member

perlpunk commented May 10, 2018

The plain scalar 08063075 should not be interpreted as an integer because the specification for integers does not allow leading zeros:
http://yaml.org/type/int.html (Spec for YAML 1.1 int)
http://yaml.org/spec/1.2/spec.html#id2805071 (10.3.2. Tag Resolution for YAML 1.2 Core Schema)

A leading zero in YAML 1.1 indicates an octal number, which can only have digits 0-7, that's why the following strings are dumped differently:

import sys
import yaml

data = [ '012', '08' ]

yaml.dump(data, sys.stdout)

# Output:
# ['012', 08]

So it's safe to dump the 08 without quotes because it cannot be an int based 10 and also not base 8.

The bug seems to be in the library that interprets it as an integer.

@perlpunk
Copy link
Member

If I read the related issues cloudtools/troposphere#994 and awslabs/aws-cfn-template-flip#41 correctly, the issues have been resolved there, so I guess this can be closed.

@justin8
Copy link

justin8 commented May 11, 2018

The problem is, a leading 0 without quotes indicates octal. But PyYAML is converting the string "012345" to an octal, even though it was a string.

In order to get around this I had to make a custom dumper and specifically not do anything with strings starting with a zero. Similar to https://github.com/awslabs/aws-cfn-template-flip/pull/43/files

@perlpunk
Copy link
Member

perlpunk commented May 11, 2018

@justin8 if the string "012345" is dumped without quotes, then this is indeed a bug.
In this thread and the related issues I only saw examples including digits greater than 7.

Can you please post the output of the following script?

import sys
import yaml

data = [ '012345', '012348' ]

print(sys.version_info)
print(yaml.__version__)
dump = yaml.dump(data)
print(dump)
print(yaml.load(dump))

I get:

sys.version_info(major=2, minor=7, micro=14, releaselevel='final', serial=0)
3.12
['012345', 012348]
['012345', '012348']

As you can see, the quotes are omitted only if it can't be interpreted as an octal because 8 is not a valid digit for an octal number, as I explained above.
When loading again, both items get loaded as strings.

@justin8
Copy link

justin8 commented May 11, 2018

I would have to check on my other computer if my sample data had 8s or 9s in them. It may well have.

But how would I store a number that is an identifier? In this case it was AWS account IDs, which can start with zero, but converting to octal or an int is not desirable since the leading zero is a part of the descriptor; and hence why it is being stored as a string, until PyYAML decides to convert it to a different format.

@perlpunk
Copy link
Member

Actually, looking at the regex for int/float in YAML 1.2, it allows leading zeros.
http://yaml.org/spec/1.2/spec.html#id2805071 10.3.2. Tag Resolution

Sorry I was forgetting about these.

PyYAML only implements YAML 1.1, so it's behaving correctly.
It would be nice to implement the 1.2 Schemas at some point, but this is a bigger task.

Since 1.2 allows leading zeros, it would maybe a good idea for PyYAML to quote such strings, as it wouldn't break anything and still be 1.1 compatible.

@justin8
Copy link

justin8 commented May 11, 2018

Ah, this would explain it. At least it's fixed in the next version of the spec.

@DVGeoffrey
Copy link

Any word on this? The 1.2 spec is quite old. If we're not going to upgrade, then we can at least be forwards-compatible right?

@ingydotnet
Copy link
Member

We should in general be moving pyyaml and libyaml closer towards 1.2 (in ways that don't conflict with 1.3 plans).

We should probably move forward with a fix here. Any takers? :)

Don't forget we'll need to make sure libyaml is in sync.

@maikelvl
Copy link

maikelvl commented Sep 5, 2018

I found also a bug which is sort of related to this.

Successfully installed pyyaml-3.13
/ # python
Python 3.7.0 (default, Sep  5 2018, 03:33:35)
[GCC 6.4.0] on linux
>>> import yaml
>>> print(yaml.dump({
...     'int_123': 123,
...     'int_123e1': 123e1,
...     'str_123': '123',
...     'str_123e1': '123e1',
... }, default_flow_style=False))

output:

int_123: 123
int_123e1: 1230.0
str_123: '123'
str_123e1: 123e1

The value of that last one should be in quotes because it is converted into an int/float. That did not happen to the '123' string.

@bxnxiong
Copy link

bxnxiong commented Nov 7, 2018

I managed to get around this issue by following @justin8 's PR: https://github.com/awslabs/aws-cfn-template-flip/pull/43/files

import six
import yaml
from yaml import Dumper
def string_representer(dumper, value):
    if value.startswith("0"):
        return dumper.represent_scalar(TAG_STR, value, style="'")
    return dumper.represent_scalar(TAG_STR, value)
TAG_STR = "tag:yaml.org,2002:str"    
Dumper.add_representer(six.text_type, string_representer)

with open('output1.yaml', 'w') as outfile:
    yaml.dump({'1':'001','2':'008','3':'009'}, outfile, default_flow_style=True)

Thanks @justin8 !

@justin8
Copy link

justin8 commented Nov 8, 2018

@bxnxiong I forgot about this thread.

I've actually found a much nicer way to get around this. The answer was to stop using pyyaml. The below code snippets will work exactly like pyyaml, but with the YAML 1.2 spec, so you don't have to mess with this stuff:
pip install ruamel.yaml

from ruamel.yaml import YAML
yaml=YAML()

with open("foo.yml") as f:
    q = yaml.load(f)

there's no loads/load difference, it reads strings or file-like objects from the same function, but otherwise behaves almost identically aside from complying with the modern standard.

@drzraf
Copy link

drzraf commented Jun 25, 2019

Same when it comes to generate object keys:
yaml.dump({"374e949253": "foo"}) should output: "374e949253": foo
In order to avoid interpreting the key as 374 x 10⁹⁴⁹²⁵³

westonsteimel added a commit to westonsteimel/chalice-shrubbery that referenced this issue Dec 17, 2019
There is an issue with the YAML 1.1 spec which causes some
strings to be interpreted incorrectly as integers
(see yaml/pyyaml#98), so we'll
use the json template to try and prevent this

Signed-off-by: Weston Steimel <[email protected]>
surfkansas pushed a commit to surfkansas/chalice-shrubbery that referenced this issue Dec 17, 2019
There is an issue with the YAML 1.1 spec which causes some
strings to be interpreted incorrectly as integers
(see yaml/pyyaml#98), so we'll
use the json template to try and prevent this

Signed-off-by: Weston Steimel <[email protected]>
@ecmonsen
Copy link

ecmonsen commented Feb 5, 2020

+1 -- I vote to fix this

Or should I say +"01"

@amandahla
Copy link

amandahla commented Apr 1, 2020

+1

I had to do this:

conteudo_yaml = yaml.dump(conteudo,default_flow_style=False, sort_keys=False)

conteudo_yaml = conteudo_yaml.replace("'","")

:-|

@ckwillling
Copy link

this is still a problem(or a bug)

yaml.dump(["018", "07", "08", "010117"], allow_unicode=True,default_flow_style=False, indent=4,sort_keys=False)
"- 018\n- '07'\n- 08\n- '010117'\n"

its really annoying

@Frodox
Copy link

Frodox commented May 19, 2020

same issue...

thx @justin8 ! I had to switch to ruamel also :|

@perlpunk
Copy link
Member

It's an issue regarding YAML 1.1 and 1.2.
Just adding quoting here and there doesn't fix the general problem.
There must be real support for YAML 1.2.
See https://perlpunk.github.io/yaml-test-schema/schemas.html

But there won't be any support for 1.2 before @ingydotnet gives feedback on #116

@perlpunk
Copy link
Member

perlpunk commented May 19, 2020

Btw, I just tested ruamel.yaml (0.16.10), and it resolved the following items as numbers/timestamps/etc., although that doesn't match the spec. They should all be loaded as strings when using the default loader:

- 85_230.15  # no underscores in numbers anymore
- 100_200    # no underscores in numbers anymore
- 0b100      # no numbers base 2 any more
- -0x30      # no negative hexadecimal numbers
- 2020-05-20 # no timestamp anymore by default
- <<         # no merge keys by default

Edit: I tested with this script: https://github.com/yaml/yaml-runtimes/blob/master/docker/python/testers/py-ruamel-py

@TheSantacloud
Copy link

+1

1 similar comment
@Ati59
Copy link

Ati59 commented Oct 19, 2020

+1

@andreamaruccia
Copy link

+1

@Yop-La
Copy link

Yop-La commented Feb 28, 2022

You can use ruamel.yaml instead

it is better documented than pyyaml: https://pypi.org/project/ruamel.yaml/

And with this library, you can at least dump digits without quotes and with no extra configuration

@perlpunk
Copy link
Member

perlpunk commented Feb 28, 2022

You can use ruamel.yaml instead

Except that ruamel.yaml is wrong in some other cases (tested with 0.17.20) as I wrote here:
#98 (comment)

@igorsantos07
Copy link

igorsantos07 commented Mar 24, 2024

Just felt here and got quite surprised such a problem has never been solved by the "most common YAML implementation in Python".

My scenario is not with octal numbers (not mentioned in the title, but later discovered by OP) but I need to force an integer to be represented as a string (simply because it's a request from the application reading the final YAML, ahem Docker Compose's interpretation of group_add group IDs) and I found no way for that. There are tags to force types when reading YAML, but it's not possible on the other way around. Yikes.

The solution found was to work around that with a "proprietary" tag syntax based off this previous comment.

Not to mention the documentation is, sadly, horribly written and looks very dated - there are no argument descriptions for dump(), for instance, where I would expect to solve my problems............. 👀

@perlpunk
Copy link
Member

perlpunk commented May 1, 2024

You can use https://pypi.org/project/yamlcore/ on top of PyYAML for YAML 1.2 support:

>>> from yamlcore import CoreDumper
>>> yaml.dump(w, sys.stdout, Dumper=CoreDumper)
analysis: Somatic_Run375_MY_DY_08112017
fastqDir: /foo/bar/baz/171108_M00139_0253_000000000-BHFD5/ProjectFolders/Project_Qux-Wombat-/Sample_08063075
laneNo: 1
panel: Somatic_Panel_Manifest_v4_23_10_2015
sampleName: '08063075'
userEmail: [email protected]
userName: 'scientist '

@ericabrauer
Copy link

For anyone else having this issue - I submitted the following PR.

#812

I'm not sure I fully understand why it can't be fixed since some ints with leading zeroes load and dump fine and others do not- see attached test-yaml-dump-str.txt, it seems like it fails if the number contains/ends in 8 or 9.

If it doesn't get fixed and anyone else is looking for an easy solution, this is what's working for me for now.

yaml.SafeDumper.add_implicit_resolver(
    "tag:yaml.org,2002:object",
    re.compile(r"\d{2,}"),
    first=list("0")
)

@perlpunk
Copy link
Member

There is nothing to be fixed, as it ain't broken.
07 is an octal number in YAML 1.1. 09 is just a string in YAML 1.1.
PyYAML implements YAML 1.1.

If for some reason you need to have quotes around a 09 because otherwise your other YAML loader would read it as an integer, then your other YAML loader has a bug or it possibly implements YAML 1.2, where 09 is indeed an integer.
In that case just use a dumper that implements YAML 1.2, e.g. https://pypi.org/project/yamlcore/

>>> import yaml
>>> import yamlcore
>>> d = {'string': '09' }
>>> y = yaml.dump(d, Dumper=yamlcore.CoreDumper)
>>> print(y)
string: '09'

if your other loader has a bug, then monkeypatching the PyYAML dumper like in #98 (comment) is probably the best idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests