-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Programmatically replace HTML-escaped characters? #314
Comments
I assume curly quotes will be left curly and straight quotes will be left straight. This presents a problem for matching, but leaving them as they were in the source file helps reduce this for now. Later, they might be added to the synonym/equivalence list to allow them to match each other when building a match... |
Note that we need to leave greater-than and less-than characters escaped, but can replace quotes, apostrophes, and other characters with their ASCII characters. |
Brad to find example of UTF-8 character and check schemadev branch to see whether tag-conversion tool has taken care of this already and let Gary know. |
Keep in mind that some of the curly and straight quotes seem to stem from different ways and sources of gathering the original texts in the first place. An obvious example are GPL-3.0 and AGPL-3.0 which in both this and the license-list repository use different quotation marks. I’d suspect that one was copied in plaintext format and the other was a copy-paste from the HTML website. This might also explain the difference in word wrapping between the two, otherwise near-identical licenses. For consistancy’s sake and easier |
We do have a synonyms/equivalents list already, including mostly words (Programme=Program) but also (C) = ©, so I think the plan to reduce these to ascii and then mark curly/straight quotes as synonyms should probably work well. |
@bradleeedmondson Right, but the license texts in the SPDX list may be used also by less sophisticated tools, so reducing them to whatever makes sense, sounds great :) |
I added an issue to the SPDX tools to normalize the quotes when we product the license list data and website from the XML input format: spdx/tools#95 I'll try to include this improvement before the next release of the license list. |
I now normalize the quotes in the SPDX tools when doing the compares. I don't normalize them for the license text, but retain the original form (whatever is in the license text XML files). I did find out that there are some licenses that use two single quotes '' to represent a single double quote. These are also normalized by the tools. @bradleeedmondson does this resolve this issue or do we need to normalize the text rendered on the spdx.org/licenses website? |
On Fri, Nov 24, 2017 at 11:58:13PM +0000, goneall wrote:
I did find out that there are some licenses that use two single
quotes '' to represent a single double quote.
Some of these look like our source was copied from LaTeX or a related
language [1], e.g. [2]. I'm in favor of replacing those with Unicode
curly quotes (“”) in our XML with an <alt> to allow our original form.
[1]: https://en.wikibooks.org/wiki/LaTeX/Text_Formatting#Quote-marks
[2]: https://github.com/spdx/license-list-XML/blob/9f4432fbb660510859417b3d78a795beeeb8279b/src/ErlPL-1.1.xml#L20
|
Per our matching guidelines:
Do we believe that this covers the case matching two single quotes with another quote symbol, or do we need to update the text? |
On Sun, Nov 26, 2017 at 09:02:43AM +0000, Alexios Zavras (zvr) wrote:
Per our matching guidelines:
> 5.1.3 Guideline: Quotes Any variation of quotations (single,
> double, curly, etc.) should be considered equivalent.
Do we believe that this covers the case matching two single quotes
with another quote symbol…
I think it does…
… or do we need to update the text?
… but I'd like to update the text anyway ;). I consider ``foo'' in
our XML to basically be an encoding error, which we should fix in our
XML rendering of the upstream source.
|
I agree with Trevor that we ought to normalize our text in the source-XML, and in considering ``foo" or ``foo'' to be encoding errors. For maintainability on the XML side, I would also suggest that we continue to replace all ampersand-encoded characters (except gt and lt) with UTF-8 literals, in the source, even though tools for generating the lists will also convert these if we leave them. However, I am not opposed to splitting this into a second issue and calling it later release, or at least not immediate release. We could then close this issue, assuming that you're comfortable confirming that the tools do, in fact, replace all the encoded characters when the XML is processed into the output formats. That would let us close this and move forward with the release while still planning on tidying up the XML later. The more I think about this, the more I think we should do it. Everyone on board with this split? |
I think the quote issue is already covered by the matching guidelines. I like replacing character entities with UTF-8 where that's legal (making the source easier to read and plain-text versions easier to generate), and normalizing on curly quotes (making the source prettier, and plain-text, ASCII versions slightly harder to generate). And I'm fine splitting that into two separate issues. But I don't see why either of those changes would need to be in the “immediate release” milestone. |
Fair point -- if the target formats are all correct, this can probably be moved to later release (unblock immediate release). I'd be fine with that. |
I'm pretty sure the target formats are OK - so I'll update this issue to a future release. |
We prefer UTF-8 where possible [1]. [1]: spdx#314
Is this fixed? I don't see anything in master besides things that need XML escaping: $ grep -hor '&[a-z]*;' src | sort | uniq -c
47 &
203 >
200 < It looks like the cleanup happened in 378fe01, which landed without a PR. |
@goneall - I think this has been dealt with and we can close this issue, can you confirm when you have a chance? |
Agree - looks like it is fixed. I'll go ahead and close. |
Opening this issue to confirm that @myndzi plans to programmatically replace HTML-escaped characters like
"
,'
, etc., on a later pass (i.e. there is no need to replace these by hand). I believe I recall correctly (though could always be mistaken) that he planned to do so at some point and waved us off modifying these manually.If so, however, do we need to worry about straight quotes in ascii vs. UTF-8 curly quotes? Or is that all taken care of?
The text was updated successfully, but these errors were encountered: