Programmatically replace HTML-escaped characters? #314

bradleeedmondson · 2016-05-12T19:48:40Z

Opening this issue to confirm that @myndzi plans to programmatically replace HTML-escaped characters like ", ', etc., on a later pass (i.e. there is no need to replace these by hand). I believe I recall correctly (though could always be mistaken) that he planned to do so at some point and waved us off modifying these manually.

If so, however, do we need to worry about straight quotes in ascii vs. UTF-8 curly quotes? Or is that all taken care of?

The text was updated successfully, but these errors were encountered:

myndzi · 2016-05-16T20:23:10Z

I assume curly quotes will be left curly and straight quotes will be left straight. This presents a problem for matching, but leaving them as they were in the source file helps reduce this for now. Later, they might be added to the synonym/equivalence list to allow them to match each other when building a match...

bradleeedmondson · 2017-06-08T17:29:36Z

Note that we need to leave greater-than and less-than characters escaped, but can replace quotes, apostrophes, and other characters with their ASCII characters.

bradleeedmondson · 2017-06-08T17:34:24Z

Brad to find example of UTF-8 character and check schemadev branch to see whether tag-conversion tool has taken care of this already and let Gary know.

silverhook · 2017-06-09T06:41:54Z

Keep in mind that some of the curly and straight quotes seem to stem from different ways and sources of gathering the original texts in the first place.

An obvious example are GPL-3.0 and AGPL-3.0 which in both this and the license-list repository use different quotation marks. I’d suspect that one was copied in plaintext format and the other was a copy-paste from the HTML website. This might also explain the difference in word wrapping between the two, otherwise near-identical licenses.

For consistancy’s sake and easier diff-ing, I’d prefer choosing simply one type of brackets.

bradleeedmondson · 2017-06-09T14:23:50Z

We do have a synonyms/equivalents list already, including mostly words (Programme=Program) but also (C) = ©, so I think the plan to reduce these to ascii and then mark curly/straight quotes as synonyms should probably work well.

https://spdx.org/spdx-license-list/matching-guidelines

silverhook · 2017-06-09T14:29:05Z

@bradleeedmondson Right, but the license texts in the SPDX list may be used also by less sophisticated tools, so reducing them to whatever makes sense, sounds great :)

goneall · 2017-06-10T16:58:07Z

I added an issue to the SPDX tools to normalize the quotes when we product the license list data and website from the XML input format: spdx/tools#95

I'll try to include this improvement before the next release of the license list.

goneall · 2017-11-24T23:58:12Z

I now normalize the quotes in the SPDX tools when doing the compares. I don't normalize them for the license text, but retain the original form (whatever is in the license text XML files).

I did find out that there are some licenses that use two single quotes '' to represent a single double quote. These are also normalized by the tools.

@bradleeedmondson does this resolve this issue or do we need to normalize the text rendered on the spdx.org/licenses website?

wking · 2017-11-26T06:05:43Z

On Fri, Nov 24, 2017 at 11:58:13PM +0000, goneall wrote: I did find out that there are some licenses that use two single quotes '' to represent a single double quote.

Some of these look like our source was copied from LaTeX or a related language [1], e.g. [2]. I'm in favor of replacing those with Unicode curly quotes (“”) in our XML with an <alt> to allow our original form. [1]: https://en.wikibooks.org/wiki/LaTeX/Text_Formatting#Quote-marks [2]: https://github.com/spdx/license-list-XML/blob/9f4432fbb660510859417b3d78a795beeeb8279b/src/ErlPL-1.1.xml#L20

zvr · 2017-11-26T09:02:42Z

Per our matching guidelines:

5.1.3 Guideline: Quotes Any variation of quotations (single, double, curly, etc.) should be considered equivalent.

Do we believe that this covers the case matching two single quotes with another quote symbol, or do we need to update the text?

wking · 2017-11-26T11:08:41Z

On Sun, Nov 26, 2017 at 09:02:43AM +0000, Alexios Zavras (zvr) wrote: Per our matching guidelines: > 5.1.3 Guideline: Quotes Any variation of quotations (single, > double, curly, etc.) should be considered equivalent. Do we believe that this covers the case matching two single quotes with another quote symbol…

I think it does…

… or do we need to update the text?

… but I'd like to update the text anyway ;). I consider ``foo'' in our XML to basically be an encoding error, which we should fix in our XML rendering of the upstream source.

bradleeedmondson · 2017-11-29T18:19:37Z

I agree with Trevor that we ought to normalize our text in the source-XML, and in considering ``foo" or ``foo'' to be encoding errors.

For maintainability on the XML side, I would also suggest that we continue to replace all ampersand-encoded characters (except gt and lt) with UTF-8 literals, in the source, even though tools for generating the lists will also convert these if we leave them. However, I am not opposed to splitting this into a second issue and calling it later release, or at least not immediate release. We could then close this issue, assuming that you're comfortable confirming that the tools do, in fact, replace all the encoded characters when the XML is processed into the output formats.

That would let us close this and move forward with the release while still planning on tidying up the XML later. The more I think about this, the more I think we should do it. Everyone on board with this split?

wking · 2017-11-29T18:39:12Z

I think the quote issue is already covered by the matching guidelines. I like replacing character entities with UTF-8 where that's legal (making the source easier to read and plain-text versions easier to generate), and normalizing on curly quotes (making the source prettier, and plain-text, ASCII versions slightly harder to generate). And I'm fine splitting that into two separate issues. But I don't see why either of those changes would need to be in the “immediate release” milestone.

bradleeedmondson · 2017-11-29T19:54:19Z

Fair point -- if the target formats are all correct, this can probably be moved to later release (unblock immediate release). I'd be fine with that.

goneall · 2017-11-30T04:34:56Z

I'm pretty sure the target formats are OK - so I'll update this issue to a future release.

We prefer UTF-8 where possible [1]. [1]: spdx#314

wking · 2018-03-26T17:27:02Z

Is this fixed? I don't see anything in master besides things that need XML escaping:

$ grep -hor '&[a-z]*;' src | sort | uniq -c
     47 &amp;
    203 &gt;
    200 &lt;

It looks like the cleanup happened in 378fe01, which landed without a PR.

jlovejoy · 2018-10-26T02:33:25Z

@goneall - I think this has been dealt with and we can close this issue, can you confirm when you have a chance?

goneall · 2018-10-26T04:38:18Z

Agree - looks like it is fixed. I'll go ahead and close.

bradleeedmondson self-assigned this Jun 8, 2017

jlovejoy added this to the Immediate Release milestone Jun 8, 2017

goneall modified the milestones: Immediate Release, Later Release Nov 30, 2017

wking mentioned this issue Nov 30, 2017

Add the EPL-2.0 #499

Merged

wking added a commit to wking/license-list-XML that referenced this issue Dec 26, 2017

EPL-2.0: Literal apostrophes (vs. XML entity)

773f38a

We prefer UTF-8 where possible [1]. [1]: spdx#314

goneall closed this as completed Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Programmatically replace HTML-escaped characters? #314

Programmatically replace HTML-escaped characters? #314

bradleeedmondson commented May 12, 2016

myndzi commented May 16, 2016

bradleeedmondson commented Jun 8, 2017

bradleeedmondson commented Jun 8, 2017

silverhook commented Jun 9, 2017

bradleeedmondson commented Jun 9, 2017

silverhook commented Jun 9, 2017

goneall commented Jun 10, 2017

goneall commented Nov 24, 2017

wking commented Nov 26, 2017 via email

zvr commented Nov 26, 2017

wking commented Nov 26, 2017 via email

bradleeedmondson commented Nov 29, 2017 •

edited

Loading

wking commented Nov 29, 2017

bradleeedmondson commented Nov 29, 2017

goneall commented Nov 30, 2017

wking commented Mar 26, 2018

jlovejoy commented Oct 26, 2018

goneall commented Oct 26, 2018

Programmatically replace HTML-escaped characters? #314

Programmatically replace HTML-escaped characters? #314

Comments

bradleeedmondson commented May 12, 2016

myndzi commented May 16, 2016

bradleeedmondson commented Jun 8, 2017

bradleeedmondson commented Jun 8, 2017

silverhook commented Jun 9, 2017

bradleeedmondson commented Jun 9, 2017

silverhook commented Jun 9, 2017

goneall commented Jun 10, 2017

goneall commented Nov 24, 2017

wking commented Nov 26, 2017 via email

zvr commented Nov 26, 2017

wking commented Nov 26, 2017 via email

bradleeedmondson commented Nov 29, 2017 • edited Loading

wking commented Nov 29, 2017

bradleeedmondson commented Nov 29, 2017

goneall commented Nov 30, 2017

wking commented Mar 26, 2018

jlovejoy commented Oct 26, 2018

goneall commented Oct 26, 2018

bradleeedmondson commented Nov 29, 2017 •

edited

Loading