Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding license information (or lack thereof) #336

Open
gouttegd opened this issue Nov 15, 2023 · 4 comments
Open

Encoding license information (or lack thereof) #336

gouttegd opened this issue Nov 15, 2023 · 4 comments

Comments

@gouttegd
Copy link
Contributor

gouttegd commented Nov 15, 2023

The license under which a mapping set is published is indicated using a license slot. That’s one of the few set-level metadata slot that is REQUIRED, and it must be ”a url to the license of the mapping”.

This raises a couple of questions.

A single license can be identified by many URLs

Consider the four following URLs:

  • https://creativecommons.org/licenses/by/4.0/legalcode.en
  • https://creativecommons.org/licenses/by/4.0/
  • https://spdx.org/licenses/CC-BY-4.0.html
  • https://github.com/FlyBase/drosophila-anatomy-developmental-ontology/blob/master/LICENSE

They all point to the Creative Commons’ Attribution 4.0 license (aka “CC-BY-4.0”), so they all equally meet the requirement to be ”a URL to the license”. Yet the URL themselves are literally completely different, preventing any meaningful comparison (e.g., if we wanted to check that two mapping sets are published under the same license).

There are legitimate reasons to prefer any of those four URLs: the first one because it points to the project that created the license in the first place; the second one (without the legalcode.en) because it indirectly points to the same license and is the most commonly used form of the link; the third one because it points to a well-known website whose purpose is precisely to catalog the available licenses; and the fourth one because it points to the actual license file of the project that is publishing a mapping set.

I don’t think there is much we can do about that.

So I think we should make clear to both implementers and users that the license slot is for human consumption only. That is, implementations SHOULD NOT try to automatically interpret the contents of that slot in any way, for example to decide whether it is OK to redistribute a given mapping set or whether two datasets are published under compatible licenses that allow them to be merged and to redistribute the merged set. Such questions should only ever be addressed by humans, after having read the license(s) pointed to by the license slots.

Recording the absence of license

As mentioned above, the license slot is required. Upon encountering a mapping set that does not have such a slot, sssom-py automatically injects a fabricated slot with the value https://w3id.org/sssom/license/unspecified. That value is not mentioned anywhere in the spec or the documentation, and the URL does not resolve to anything.

Do we want/need a unique special value to indicate “No license specified”?

I personally don’t see the need for such a value. The absence of the license slot should be enough. Upon encountering a mapping set without license, implementations can either reject the set outright or proceed anyway (maybe after emitting a warning), but I don’t see what value is added by injecting a pseudo-value.

If we do want a pseudo-value to indicate the absence of a license: Do we want it to be unique? That is, should all implementations inject the same pseudo-value (like https://w3id.org/sssom/license/unspecified)? Or should implementations be free to inject a value of their choosing, as long as it is a URL and that it conveys the fact that the license is unknown?

If we do want such a unique value, then it should be specified somewhere in the spec, and ideally, it should be a URL that actually points to something. And instead of https://w3id.org/sssom/license/unspecified, I would like a URL that better highlights the fact that, in the absence of an explicit license, the mapping set must be assumed to fall under the normal copyright rules; so I’d suggest something like https://w3id.org/sssom/license/all-rights-reserved instead.

@matentzn
Copy link
Collaborator

I agree that standardising the values should not happen on standard (SSSOM) level, but on organisation level. Ideally, there would be a database of licenses with standard identifiers to use, but in the absence of that, a URL is the best thing we can do. They are mostly for human consumption, so I am happy to follow you way of thinking, although I know that folks like @cthoyt would disagree and would want to force a more rigorous standard for recording licenses (which we will implement on organisation level, rather than standard level).

I am certainly not opposed to change the default license to https://w3id.org/sssom/license/all-rights-reserved, but I do think that having license required is an incredibly important part of our wider mission. While using an "unspecified" license seems like little better than using no license, it at least makes it explicit that you have not specified it, which creates a good ground for criticism during a review for example. In general, it is too important an element to drop from the required list.

We can make the new default license resolvable quickly.

@cthoyt
Copy link
Member

cthoyt commented Nov 16, 2023

SPDX (https://bioregistry.io/spdx) is such a semantic space of standard licenses.

You are right that I would like people who are using SSSOM to have to make a decision. SSSOM doesn't come from nowhere, so whoever makes it, either by manual or automated means, should be required to make a choice. I like it being in the standard, but people are used to not having to actually put a license (when using sssom-py) so there's also a social cost to enforcing better standards to consider.

All that being said, the SSSOM spec says license is required, so having SSSOM-py stop skirting the spec also seems reasonable

@gouttegd
Copy link
Contributor Author

Ideally, there would be a database of licenses with standard identifiers to use

Well, there is. The SPDX license list and its short identifiers are precisely that.

But I’d be opposed to using such identifiers in the license slot. They would impose a supplementary lookup for the humans who’d want to know exactly what are the terms of the license, instead of just having to follow a link. This would needlessly complicate the tasks of humans while not making the tasks of machines any easier (overall, machines cannot perform any task related to licensing issues; you need a human reading the terms in order to resolve any such issue; giving the machine an easily comparable identifier accomplishes nothing).

At best, the SSSOM spec can suggest or even recommend that people use SPDX URLs (e.g. https://spdx.org/licenses/CC-BY-4.0.html), but that won’t solve anything. (Of course that doesn't prevent organisations from enforcing the use of such URLs for the mapping sets they produce.)

I do think that having license required is an incredibly important part of our wider mission […] it is too important an element to drop from the required list.

I have no objection to that slot being required. The spec can mandate that any SSSOM writer MUST include a license (defaulting to a special value like `https://w3id.org/sssom/license/all-rights-reserved' if the user behind the software did not explicitly say which license he or she wanted to use); it can mandate that any SSSOM reader MUST complain upon parsing a set that does not have such a slot.

My question is about forcefully injecting a default value when reading a set that does not have one (which is the current behaviour of the sssom-py parser). I personally don’t think it is useful and I want to know if this a behaviour that you think the spec should mandate or recommend. The spec currently says nothing about that (it says “in the absence of license we assume no license”, which does not imply that implementations should automatically fill the slot).

it at least makes it explicit that you have not specified it, which creates a good ground for criticism during a review for example

You do not need the parser to auto-fill the slot with a default value for that.

@jonquet
Copy link

jonquet commented Nov 16, 2023

Hello. Just to let you know the most complete list of URI for License I found is https://rdflicense.linkeddata.es/
But the mapping issues is not completely solved. An I am in favor of using the URI of the original provider of the License (https://creativecommons.org/licenses/by/4.0/) but for harmonsisation aspects its not trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants