Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change all URIs to IRIs or URLs, depending on context. Resolves #369 #466

Merged
merged 4 commits into from
Nov 9, 2022

Conversation

jakebeal
Copy link
Contributor

Change all URIs to IRIs or URLs, depending on context. Resolves #369
Also adds mapping information for SBOL2/SBOL3 regarding namespace, identity, and version

Per note on #369, I believe this is a non-SEP change.

Also adds mapping information for SBOL2/SBOL3 regarding namespace, identity, and version
@jakebeal
Copy link
Contributor Author

@udp : I'd like you to look especially at the section on mapping identifiers and versions between SBOL2 and SBOL3, since I think you might want to adjust sbolgraph based on these recommendations.

@cjmyers
Copy link
Contributor

cjmyers commented Sep 17, 2021

I'm not comfortable with this change without more discussion. I understand the issue from the tracker. However, the only part of the URI I would think where this makes a big difference is displayIds, since this is the main part people see. However, we have long had these limited to alphanumeric underscore. The introduction of additional special characters to support other language alphabets has a high potential to break software. There is code in many places that checks displayIds are restricted to English alphanumeric, meaning that some software will declare these as invalid SBOL files and refuse to process them accordingly. I would suggest we delay this change for now until testing can be done.

@jakebeal
Copy link
Contributor Author

Is this something that's a libSBOLj restriction?

The SBOL2 document doesn't actually specify English anywhere as a restriction on alphanumeric. A displayID is a string, and the referenced string-type includes unicode. The definition of anyURI that is linked also actually allows the full range of IRIs as well, despite being called "anyURI":

anyURI represents an Internationalized Resource Identifier Reference (IRI)

As a consequence, pySBOL has long supported any unicode character that tests as true for being alphanumeric, since that's what the specification already required.

@cjmyers
Copy link
Contributor

cjmyers commented Sep 17, 2021

The issue is I don't know if it is or is not. I've not tested this. If you have test files that we can use to verify that software will not break, then we can validate there are no issues. But before testing, I'm not comfortable with this change.

Try uploading an SBOL2 file with international characters in their displayIds to SBH and see what happens. Try opening with SBOLCanvas also. I'm really unsure if there are going to be problems, but I would prefer not making the change until we are sure there will be none.

@jakebeal
Copy link
Contributor Author

I just tested with SBOL Canvas and SynBioHub. Both of them reject the characters as invalid.

Why is this a problem, though? Both of them are SBOL2, and the draft says that you SHOULD escape these characters when converting from SBOL3 to SBOL2.

@cjmyers
Copy link
Contributor

cjmyers commented Sep 17, 2021

The library uses simple RegEx to check validity. The RegEx does not include non-English characters, so they are rejected.

Given you experiment and the fact that many tools in the wild use libSBOLj, I think we should hold this change for now. It would break tools. Even if we update libSBOLj, we cannot guarantee developers will update their tools to the new version immediately.

By the way, there are ways to convert special characters into English alphabets (at least according to my German student). We could consider converting them in an SBOL3 to SBOL2 converter, assuming SBOL3 libraries are ALL okay with this. Have you tested Goksel's libSBOLj3?

@jakebeal jakebeal requested a review from goksel September 17, 2021 19:36
@jakebeal
Copy link
Contributor Author

I think that we're in agreement that SBOL2 doesn't in practice support IRIs, and that conversion from unicode to ASCII would typically be necessary for SBOL3->SBOL2. That's not a problem.

For SBOL3, I expect that @udp's library supports IRIs, since he requested the change. I don't know about @goksel, so have added him as a reviewer.

@tcmitchell
Copy link
Contributor

This is a change to SBOL3, not SBOL2. I don't think we should worry to much about SBOL2 tools (SynBioHub, libSBOLj) and how they handle SBOL2 displayIds. Since SBOL3 is based on tooling that has a broader definition of alphanumeric than "English alphanumeric" (or ASCII), I think SBOL3 should embrace a wider variety of characters than "English alphanumeric". I had interpreted the specification more broadly when I read it.

As far as implementation, pySBOL3 uses Python's isalnum, which relies on isalpha, which uses this definition of alphabetic characters:

Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.

I think the spec should be updated with definitions for "alphanumeric" and "underscore" and "digit". Something along the lines of the above definition so that there is less room for interpretation by individual tools and developers.

@@ -16,18 +16,18 @@ \subsection{Identified}

\subparagraph{The \sbolheading{displayId} property}
\label{sec:displayId}
The \sbol{displayId} property is an OPTIONAL identifier with a data type of \sbol{String}. This property is intended to be an intermediate between a URI and the \sbol{name} property that is machine-readable, but more human-readable than the full URI of an object.
The \sbol{displayId} property is an OPTIONAL identifier with a data type of \sbol{String}. This property is intended to be an intermediate between a IRI and the \sbol{name} property that is machine-readable, but more human-readable than the full IRI of an object.

If the \sbol{displayId} property is used, then its \sbol{String} value MUST be composed of only alphanumeric or underscore characters and MUST NOT begin with a digit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can definitions be added for "alphanumeric", "underscore", and "digit"?

For example, Python uses the following definition of "alphabetic" (a component of "alphanumeric"):

Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.

For "underscore" I think we probably mean Unicode U+005F. Wikipedia's definition of underscore lists three possibilities though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, the one that's equivalent to ASCII underscore 0x5F

@cjmyers
Copy link
Contributor

cjmyers commented Sep 21, 2021

I would like to discuss this at our next SBOL3 meeting. You are correct that this is an SBOL3 change, but it is potentially going to affect SBOL2 tools as well. For example, it is possible to upload SBOL3 to SynBioHub now, but this would break if you used non-English alphabets in displayIds. Also, we need to ensure that conversion tools are capable of changing non-English characters to English characters when converting from SBOL3 to SBOL2. I would like to propose that this is a 3.1.0 change, so we can have some time to work out these issues, and avoid delaying the release of 3.0.1 as we work them out.

@jakebeal
Copy link
Contributor Author

I'm fine with pushing this to 3.1 as long as you're OK that pySBOL3 allows the more liberal definition.

@cjmyers
Copy link
Contributor

cjmyers commented Sep 21, 2021

If pySBOL3 can create content with non-English alphabets, then there will be issues with these files. Is there an urgent need to support this now?

@tcmitchell
Copy link
Contributor

No, we do not have an urgent need for you to support this now. It will be good to clarify the spec as a first step. Users of SynBioHub will have to be careful to limit themselves to ASCII characters.

@cjmyers
Copy link
Contributor

cjmyers commented Sep 21, 2021

Actually, my question is do you have an urgent need to have pySBOL3 support this now?

@tcmitchell
Copy link
Contributor

pySBOL3 has supported Unicode alphanumeric displayIds since at least August, 2020. We're not making a change to support Unicode displayIds, which is probably why I interpreted your question differently. We have supported this for over a year at least. Probably longer than that.

@cjmyers
Copy link
Contributor

cjmyers commented Sep 21, 2021

I see. Ok, well, hopefully we can get some solution to this soon then. Not sure if many people are using this feature as of yet. Have you seen it being used?

@jakebeal
Copy link
Contributor Author

Not blatantly, but with Excel-to-SBOL it may be getting used already without being obvious.

# Conflicts:
#	apdx-validation.tex
#	feature.tex
#	location.tex
#	uml/feature.pdf
#	umlet_source/feature.uxf
#	vocabulary.tex
@jakebeal jakebeal mentioned this pull request Aug 30, 2022
@LukasBuecherl LukasBuecherl merged commit 8e762c3 into master Nov 9, 2022
@LukasBuecherl LukasBuecherl deleted the issue-369-IRIs branch November 9, 2022 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

URIs -> IRIs
4 participants