Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DANS/Local PermaLink PID Provider #8674

Merged

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented May 4, 2022

What this PR does / why we need it: Refactors PID classes and adds a new PermaLink Provider for catalog/replica use

Which issue(s) this PR closes:

This is related to #4106 and related issues, but does not close them.

Closes #

Special notes for your reviewer:
The main purpose of the PR is to add a new 'PermaLink' provider that can provide a URI for resolution (via the Citation servlet) and that looks significantly different from a DOI (no required internal slashes, no required 10.xxxx in initial authority). The class could at some point be extended to use an external resolver (nominally the Citation servlet is separate from the app already, but it does not deal with minting PIDs or keeping any metadata.)

This PR takes significant steps toward making PID Providers plugable and allowing multiple DOI accounts to manage different authority/shoulder combinations. Key changes include:

  • moving all PID minting code into the Provider classes
  • moving all PID recognition from string or protocol/authority/identifier triples into these classes
  • adds a dynamic list of Providers (in PIDUtil) that are scanned to see if they recognize/can manage a specific PID

Other changes include:

  • handling the ToDo to remove the deprecated getGlobalIdAsString method
  • adding an asUrl method that returns a String to avoid having to do URL.toString() everywhere.
  • making the Citation servlet handle both Dataset and DataFile PIDs
  • the Global ID class now caches the separator ('/' for DOI and Handle), Preferred URL prefix for the identifier, and the name of the PIDProvider it is associated with. These are populated during Construction.
  • as a follow-on, GlobalId instances are now created via the PIDproviders rather than directly by Dataset/file services.
  • simplified the Fake provider - now uses the common isGlobalIdUnique method instead of providing call-stack dependent true/false answers as it did.

Things not done that would be needed to handle different or authority/shoulder-specific Providers:

  • change PIDProvider classes to not be @stateless Beans and instead be dynamically loaded/instantiated, e.g. with one DOIProvider per authority/shoulder. The PIDHelper/PIDUtil classes would then be adjusted to read a new set of properties/microconfig settings to decide which PIDProviders to instantiate. (Perhaps it should be called PIDProviderFactory or similar at that point.)
  • the logic to get the relevant PIDProvider when publishing, etc. would change to read the provider name from the Global ID and retrieve the instance by name from the map maintained by the PIDUtil.
  • to support individual PIDs managed by an account that are not in the default authority/shoulder that the account uses for minting, logic would need to be added to allow a PID provider to be configured with an additional list of managed PIDs (or we could lookup client at provider).

Optional things that could be done:

  • move all PID-related classes into the pid package
  • separate the PID recognition functionality into an interface separate from the minting/management functionality
  • restore/expand tests that check for valid PID structure and whether specific PIDs can be managed. (Testing some things became harder when the recognition logic moved to PIDPRoviderBeans - when these are no longer beans, test instances can be created more easily.)
  • add a mechanism to select the PidProvider per collection.

Suggestions on how to test this:

New functionality: Configure the (misnamed) DOIProvider setting to "PERMA", change authority/shoulder to random values, e.g. "dansx","abc", and verify that datasets/files can be created, published, and that the citation URL shown resolves to the dataset/file specified.

Regression: Since the changes are to the overall PID mechanism, broad testing with DataCite/Handles/EZID, with various PID options (random, by stored procedure, dependent/independent) and other processes (harvesting, import/migrate APIs) could all be tested. Given that most of the changes are refactoring, I think if things still work with DataCite, they should still work with the others, so more thorough checking with DataCite and simple tests that EZID/Handles still work for basic create/publish could be sufficient.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:
Could announce the Perma provider. The other refactoring probably isn't very visible.

Additional documentation:

Will add info on configuring Perma as an alternative to DataCite/EZID DOIs, Handles and Fake.

@coveralls
Copy link

coveralls commented May 24, 2022

Coverage Status

Coverage: 20.153% (+0.04%) from 20.116% when pulling 3ef94eb on GlobalDataverseCommunityConsortium:DANS/local_pid_provider into 1a79717 on IQSS:develop.

qqmyers added 21 commits August 9, 2022 15:03
Note the difference is that the dataset version was not doing a check
for the id at the remote service, just a local check. It now does both
as was the case for files.
As before the methods for DataFiles/Datasets differed in whether there's
a remote service check done which should always be included.

Also removed unused datasetIdentifier variable/setter.
left most new GlobalId(String) calls, just put in mechanism to call
PidProviders in turn and updated just one Pids API call to test it.
@kcondon kcondon assigned kcondon and unassigned qqmyers Mar 13, 2023
@kcondon
Copy link
Contributor

kcondon commented Mar 15, 2023

Issues found:

  1. exception when creating dataset w/ doi configured (fixed)
  2. some noisy/extra logging on dataset creation, file upload, dataset publish (fixed)
  3. Null ptr in server.log when creating a dataset with one file and file pid disabled. Looks like it was during export, both OAI_ORE and JSON_LD formats are missing. See attached stack trace. (fixed, thanks!)
    nofilepidpuberr.txt
  4. Stack trace error in server.log when creating dataset with sequential (storedProcGenerated) style, :IdentifierGenerationStyle. Says FK2/100002/1) already exists. When I checked db, only one identifier, with only 100002 for the dataset exists. I have not yet published so the file pid should not yet exist. Dataset remains locked. (fixed)
    seqpiderroncreateds.txt
  5. Cannot display dataset created using hdls, throws ui exception. (fixed)
  6. Cannot create dataset using EZID, UI and server.log exceptions (fixed)
    ezid_pid_ds_create_err.txt
  7. Perma citation link not redirecting/using configured doi.baseurlstring setting, instead uses default siteurl value. (fixed)
  8. Update permlinks doc section to explain default siteurl and new perma.baseurlstring setting. (fixed)
  9. Permalinks doesn't seem to support file pids, should it? (fixed)

@qqmyers
Copy link
Member Author

qqmyers commented Mar 15, 2023

3 should be fixed - there were a couple places where the change from getGlobalIdAsString() -> getGlobalId().asString() wasn't checking for a possible null for files. The stacktrace above should have caused the ORE export to fail.
Still looking at 4 - including the odd fact that dependent storedProcedure generation method doesn't seem to use the stored procedure (and didn't before this PR either).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DANS related to GDCC work for DANS Size: 80 A percentage of a sprint. 56 hours.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

8 participants