Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programmatically replace HTML-escaped characters? #314

Closed
bradleeedmondson opened this issue May 12, 2016 · 18 comments
Closed

Programmatically replace HTML-escaped characters? #314

bradleeedmondson opened this issue May 12, 2016 · 18 comments
Assignees
Milestone

Comments

@bradleeedmondson
Copy link
Contributor

Opening this issue to confirm that @myndzi plans to programmatically replace HTML-escaped characters like ", ', etc., on a later pass (i.e. there is no need to replace these by hand). I believe I recall correctly (though could always be mistaken) that he planned to do so at some point and waved us off modifying these manually.

If so, however, do we need to worry about straight quotes in ascii vs. UTF-8 curly quotes? Or is that all taken care of?

@myndzi
Copy link
Contributor

myndzi commented May 16, 2016

I assume curly quotes will be left curly and straight quotes will be left straight. This presents a problem for matching, but leaving them as they were in the source file helps reduce this for now. Later, they might be added to the synonym/equivalence list to allow them to match each other when building a match...

@bradleeedmondson
Copy link
Contributor Author

Note that we need to leave greater-than and less-than characters escaped, but can replace quotes, apostrophes, and other characters with their ASCII characters.

@bradleeedmondson
Copy link
Contributor Author

Brad to find example of UTF-8 character and check schemadev branch to see whether tag-conversion tool has taken care of this already and let Gary know.

@bradleeedmondson bradleeedmondson self-assigned this Jun 8, 2017
@jlovejoy jlovejoy added this to the Immediate Release milestone Jun 8, 2017
@silverhook
Copy link
Collaborator

Keep in mind that some of the curly and straight quotes seem to stem from different ways and sources of gathering the original texts in the first place.

An obvious example are GPL-3.0 and AGPL-3.0 which in both this and the license-list repository use different quotation marks. I’d suspect that one was copied in plaintext format and the other was a copy-paste from the HTML website. This might also explain the difference in word wrapping between the two, otherwise near-identical licenses.

For consistancy’s sake and easier diff-ing, I’d prefer choosing simply one type of brackets.

@bradleeedmondson
Copy link
Contributor Author

We do have a synonyms/equivalents list already, including mostly words (Programme=Program) but also (C) = ©, so I think the plan to reduce these to ascii and then mark curly/straight quotes as synonyms should probably work well.

https://spdx.org/spdx-license-list/matching-guidelines

@silverhook
Copy link
Collaborator

@bradleeedmondson Right, but the license texts in the SPDX list may be used also by less sophisticated tools, so reducing them to whatever makes sense, sounds great :)

@goneall
Copy link
Member

goneall commented Jun 10, 2017

I added an issue to the SPDX tools to normalize the quotes when we product the license list data and website from the XML input format: spdx/tools#95

I'll try to include this improvement before the next release of the license list.

@goneall
Copy link
Member

goneall commented Nov 24, 2017

I now normalize the quotes in the SPDX tools when doing the compares. I don't normalize them for the license text, but retain the original form (whatever is in the license text XML files).

I did find out that there are some licenses that use two single quotes '' to represent a single double quote. These are also normalized by the tools.

@bradleeedmondson does this resolve this issue or do we need to normalize the text rendered on the spdx.org/licenses website?

@wking
Copy link
Contributor

wking commented Nov 26, 2017 via email

@zvr
Copy link
Member

zvr commented Nov 26, 2017

Per our matching guidelines:

5.1.3 Guideline: Quotes Any variation of quotations (single, double, curly, etc.) should be considered equivalent.

Do we believe that this covers the case matching two single quotes with another quote symbol, or do we need to update the text?

@wking
Copy link
Contributor

wking commented Nov 26, 2017 via email

@bradleeedmondson
Copy link
Contributor Author

bradleeedmondson commented Nov 29, 2017

I agree with Trevor that we ought to normalize our text in the source-XML, and in considering ``foo" or ``foo'' to be encoding errors.

For maintainability on the XML side, I would also suggest that we continue to replace all ampersand-encoded characters (except gt and lt) with UTF-8 literals, in the source, even though tools for generating the lists will also convert these if we leave them. However, I am not opposed to splitting this into a second issue and calling it later release, or at least not immediate release. We could then close this issue, assuming that you're comfortable confirming that the tools do, in fact, replace all the encoded characters when the XML is processed into the output formats.

That would let us close this and move forward with the release while still planning on tidying up the XML later. The more I think about this, the more I think we should do it. Everyone on board with this split?

@wking
Copy link
Contributor

wking commented Nov 29, 2017

I think the quote issue is already covered by the matching guidelines. I like replacing character entities with UTF-8 where that's legal (making the source easier to read and plain-text versions easier to generate), and normalizing on curly quotes (making the source prettier, and plain-text, ASCII versions slightly harder to generate). And I'm fine splitting that into two separate issues. But I don't see why either of those changes would need to be in the “immediate release” milestone.

@bradleeedmondson
Copy link
Contributor Author

Fair point -- if the target formats are all correct, this can probably be moved to later release (unblock immediate release). I'd be fine with that.

@goneall
Copy link
Member

goneall commented Nov 30, 2017

I'm pretty sure the target formats are OK - so I'll update this issue to a future release.

@goneall goneall modified the milestones: Immediate Release, Later Release Nov 30, 2017
@wking wking mentioned this issue Nov 30, 2017
wking added a commit to wking/license-list-XML that referenced this issue Dec 26, 2017
We prefer UTF-8 where possible [1].

[1]: spdx#314
@wking
Copy link
Contributor

wking commented Mar 26, 2018

Is this fixed? I don't see anything in master besides things that need XML escaping:

$ grep -hor '&[a-z]*;' src | sort | uniq -c
     47 &
    203 >
    200 <

It looks like the cleanup happened in 378fe01, which landed without a PR.

@jlovejoy
Copy link
Member

@goneall - I think this has been dealt with and we can close this issue, can you confirm when you have a chance?

@goneall
Copy link
Member

goneall commented Oct 26, 2018

Agree - looks like it is fixed. I'll go ahead and close.

@goneall goneall closed this as completed Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants