-
-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stable identifiers for examples #763
Comments
Solution to link issue: always link to a specific version of the spec. Solution to classification issue: use the sha1 hash of the example text as an identifier rather than the example number. |
Not a very user friendly solution. I have various links in the API docs on the standard, you don't really want to link to different versions of the standard in the docs, it's confusing.
Again I'd like to link back on the specification. Hashes are one way functions ;-) |
I agree that having example numbers change between versions is a bit annoying, but I also have no good solution to this. For links, you could try using text fragment links. |
That looks quite brittle and again linking on different version of the specifications is not user friendly. You lure users into reading outdated specification material. CommonMark is already confusing enough to understand :-) Honestly now that the specification is mostly stable I don't think it would be such a chore to maintain alphanumerical identifiers manually along the scheme I suggested above. I'm not sure anyone really cares about this being a linear sequence of numbers. That would entail finding a way to specify the id whenever you define an example, (Bonus point you can know cross-reference examples from the specification text itself if you need to) |
You can also solve most of this by changing how you think about CommonMark: it’s a living standard that gets commits. Same with HTML or Unicode or whatnot: you follow the latest version and link to the text. The numbers of versions here are meaningless anyway. Because the grammar of markdown (and HTML) is
Practically, you can improve this when crawling the spec for test cases, by ignoring the numbers, and looking at headings and then the relative number of each test case in them. You could also use an MD5 or so hash of the markdown.
If we do anything here, I recommend using unique IDs that are not ordered but instead describe the test case. Or auto-generated MD5 hashes. And/or stop using versions (no problem with date snapshots) |
I don't care about changes to the spec, I care about being able to point people to the right information bits as the specification changes without losing them in outdated information. And that:
won't help at all I'm afraid.
That's a curious point of view. Versions are precisely meaningful since they entail semantic changes on how your markdown is going to be interpreted. |
I think you describe Git hashes / snapshots of Git hashes more than a particular number. Can you elaborate with what you mean by “semantic changes”. Do you mean the text that is in commits/PRs?
Perhaps you’d be interested in reading more about it: http://trevorjim.com/a-specification-for-markdown/. See also the “With HTML5” link there. And, https://xkcd.com/1172/.
Even if we used IDs, tests change, they’re removed or split up. Characters are added/removed. Are those the same test? How much difference is a new test case?
I think @jgm correctly inferred that you’re asking about two problems. You are responding about the first here, where John answered the second. You can hash each example in each snapshot of the spec. Then you know whether an exact example existed in a different snapshot? From what I can see, you have only shown the case of the hard to maintain exception list? |
I know you very much like the trope that the BNF of markdown is
As long as you keep the spirit of the test (what it conceptually shows) yes.
Your HTML renderings are going to change from one version to another. A CommonMark implementation abiding by Anyways, I don't want to spend too much energy trying to convince people about the value of stable identifiers across specification versions. @jgm just tell me whether I will have to cope with this state of affairs or if you are willing to try to do something about it. Otherwise let's just close this and move on. |
This discussion touches on versioning and on how markdown works.
HTML being a living standard or JavaScript using yearly snapshots does not make them moot or not-semantic. I have not received this question over the years.
I understand open source as discussing problems and consensus seeking. I don’t understand why jgm’s solution 2. for classification is not acceptable. |
I don't see a very good alternative, yet. I'm not interested in spending my time going through and creating unique ids for hundreds of examples. |
By the way, you could make things work with hashes, with a bit of scripting. You just need a script that goes through the latest spec and associates hashes of examples with their numbers. Then you can map your hashes to example numbers. |
Well, actually, I suppose we could simply autogenerate identifiers by hashing the example's contents, instead of using identifiers based on numbers. |
I mentioned both too. What I worry about, is that it’s essentially the same as those text fragment URLs. But that any one character change to an example changes the hash. One more variation is to take the first letter of the first 3 words and the first letters of the last 3 words as a “hash”. Has to be a bit more involved but something like it is more resistant to changes. |
I think changing with any one character change is not a bad thing. Currently, one has no guarantees of link stability across versions. With this change, one would have a guarantee that links would point to the same examples, unless the examples have changed. That doesn't seem so bad. Presumably you don't want to link to an example whose content might have changed? |
It’s unclear to me whether the OP deems it as acceptable. one more alternative that would also solve this: add a sort of “message” to each test case, that can be used by parsers for assertions: assert(fn(markdown), html, message). either are fine to me! |
Yes, unfortunately quite a bit of work, with hundreds of examples. |
Let's be pragmatic. I think we agree the specification is not going to change much at that point. Here's a script that numbers the current examples by adding their current number in their info string: cat << 'EOF' > number-examples.sh
#!/bin/bash
ex=0
function process_line()
{
echo "$1" | grep '``` example' > /dev/null
if [[ $? != 0 ]]; then
echo "$1"
else
((ex++))
echo "$1 ${ex}"
fi
}
while IFS='' read -r line || [[ -n "${line}" ]]; do
process_line "${line}"
done < "$1"
EOF
chmod +x number-examples.sh
./number-examples.sh spec.txt Then it's a matter of tweaking these lines of And then just use the alphanumerical convention when changes are made to the spec. That is if someone wants to add examples between 45 and 46, these new example numbers should be manually numbered 45a, 45b, etc. If people are ok with this scheme. I'm willing to do the work so that this is established for future potential versions of the standard. (I don't mind hashes but then when you write comments in code, talk to people, mention example numbers in regression tests etc. I much prefer to say example 45 or 45a of the specification than a gibberish of hex numbers. Also I don't mind if the content of one example content may become a 404 or change under the hood now and then, if the example shows the same thing. It's the current number mixup from one version to another that I find annoying to work with as an implementer of the standard). |
Not sure I agree. |
Same. I personally would not want to maintain a unique list of example IDs. One more alternative that is quick and improves the use case: autogenerated IDs, still numbered, but including the current heading in it, so "atx-heading-6" and such. |
@wooorm - are you proposing that the numbering in the generated IDs starts over for every section? That would make the example ids much more stable, in that changes would only change example ids in the same section. But it would have the drawback that the IDs no longer match the displayed example numbers. Unless the proposal is that we display "Example atx-heading-6" instead of "Example 123" as now...? |
Right! a) perhaps we don’t need numbers next to examples, HTML doesn’t either, and from a quick scroll through CM, those numbers (because they are unstable) aren’t used in the text. I checked through long headings. They’re not very long so I think it’s fine.
|
A section like 'List items' that I'm perusing now has almost 50 examples. At that point I prefer the current status quo which will be easier to work around. At least there is a single linear shift to consider rather having to consider resets at each new heading. Let's not turn a simple problem into a more complicated one :-) I'm not sure I understand the resistance behind the simple solution I propose. I don't think it introduces any kind of daunting overhead for people working on the spec (and has been shown to work relatively well in other standards). |
You several times now have expressed the spec doesn’t change much and so it’s easy. This thing existed for 10 years. It’ll exist for 10 more. A lot happens in 10 years. For me, it’s that you’re asking to introduce a semver-like versioning scheme, where numbers can’t be reused, with a lot of undecided factors. I see the following unknowns: What if there’s example 50 and 51 already, add 50a? Now what if 50 is removed? What if something is added between 49 and 50a? What if we reorder a section? What if we change an example a little. What if it’s completely rewritten? What if it’s dropped? Where do we document how it works? How to explain it to folks PRing an example?
Why? With this proposal you have relative numbers that at most shift 50.
Which ones? |
I will also add that the scheme is easy even if the spec changes. All the numbers are written in the spec it's just a matter of picking a nearby non-conflicting alphanumerical identifier (and the extracting program can easily check they are unique).
I have already mentioned these things earlier, I don't think the sequentiality matters much to perusers of the spec. You are not supposed to recycle numbers but then I'm also not asking for a bullet proof formal solution. It's not a drama if one get reused or if an example changes. The current scheme is annoying because everything changes when there's little change.
It's mentioned in my first message.
Because the problem remains and is even worse to track automatically than the status quo. In any case there's likely more important issues for the spec than this one. Let's not lose to much time on this. (I remain available to do the work if people want to move to something along the lines I proposed). |
When a new specification is published, examples get renumbered if a new example is added or if one is suppressed.
This is a bit annoying for referring to examples when the specification evolves, since linking on an example by say
printf ("https://spec.commonmark.org/%s/#example-354", version)
is not doable reliably.For example in the
cmarkit
implementation which includes a layout preserving CommonMark renderer I have a classification of those examples of the specification which do not round trip. It's a bit annoying that it is now desynchronized with the latest specification.I'm not sure I have a good answer of how to provide that except perhaps automatically insert the current numbers in the current text and then make sure on new additions to provide an unused identifier (e.g. if an example is inserted between examples 1 and 2 you could simply identify it by 1a and make sure the identifier of deleted examples do not get recycled. That's for example how the Unicode Text Segmentation standard proceeds when it ads new rules).
The text was updated successfully, but these errors were encountered: