Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GESIS: new endpoint for harvesting GESIS studies #162

Closed
cessda-bitbucket-importer opened this issue Apr 9, 2020 · 36 comments
Closed

GESIS: new endpoint for harvesting GESIS studies #162

cessda-bitbucket-importer opened this issue Apr 9, 2020 · 36 comments
Assignees
Labels
Milestone

Comments

@cessda-bitbucket-importer

Original report on BitBucket by Wolfgang Zenk-Möltgen.


We do have now a new OAI-PMH endpoint available for harvesting the GESIS studies into the CDC:

http://dbkapps.gesis.org/dbkoai3/?verb=Identify

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Please add HTTPS option asap

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Some issues with implementation of OAI-PMH protocol, according to BASE Validator (http://oval.base-search.net/)

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


I have corrected the validation error by including a locally provided xsd for http://purl.org/dc/elements/1.1/ - seems to be an error retrieving it online.

Also set the ListSize to 200, not 1000 for the ListRecords batch size.

@cessda-bitbucket-importer
Copy link
Author

Original comment by CESSDA Support Team (GitHub: cessda).


@‌sergio.dias BASE Validator results much improved.

Screenshot 2020-04-09 at 17.17.54.png

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Endpoint doesn’t appear to support ListHeaders verb, which is used by Harvester:

2020-04-10 21:57:11.467 INFO (RepositoryUrlService.java:55) - [http://dbkapps.gesis.org/dbkoai3] Final ListHeaders Handler url [http://cdc-osmh-repo:9091/v0/ListRecordHeaders?Repository=http://dbkapps.gesis.org/dbkoai3] constructed.
2020-04-10 21:57:21.684 ERROR DefaultHarvesterConsumerService.java:75) - ListRecordHeaders failed for repo [Repo(url=http://dbkapps.gesis.org/dbkoai3, name=GESIS, handler=http://cdc-osmh-repo:9091)].%5D.) CDC Handler Error object Msg [Unsuccessful response from CDC Handler [http://cdc-osmh-repo:9091/v0/ListRecordHeaders?Repository=http://dbkapps.gesis.org/dbkoai3].].
2020-04-10 21:57:21.686 INFO efaultHarvesterConsumerService.java:122) - Nothing filterable by date [null].
2020-04-10 21:57:21.686 INFO (ConsumerScheduler.java:173) - Repo [Repo(url=http://dbkapps.gesis.org/dbkoai3, name=GESIS, handler=http://cdc-osmh-repo:9091)].%5D.) Returned with [0] record headers
2020-04-10 21:57:21.687 INFO (ConsumerScheduler.java:186) - Repo Name [repo_name=GESIS] of [repo_endpoint_url=http://dbkapps.gesis.org/dbkoai3] Endpoint. There are [present_cmm_record=0] presentCMMStudies out of [0] totalCMMStudies from [cmm_records_rejected=0] Record Identifiers. Therefore CMMStudiesRejected is 0
2020-04-10 21:57:21.687 INFO (LanguageDocumentExtractor.java:58) - Mapping CMMStudy to CMMStudyOfLanguage for SP[GESIS] with [0] records
2020-04-10 21:57:21.688 WARN (ConsumerScheduler.java:156) - CmmStudies list is empty and henceforth there is nothing to BulkIndex for repo[repo_name=GESIS] with LangIsoCode [lang_code=de].

@cessda-bitbucket-importer
Copy link
Author

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


This appears to be caused by the resumption token having illegal characters for a URI. These are now escaped by the repository handler as of https://github.com/cessda/cessda.cdc.osmh-repository-handler.oai-pmh/commit/424661e283b15e51862b3eff79967c1ef75f2f3d.

java.net.URISyntaxException: Illegal character in query at index 80: http://dbkapps.gesis.org/dbkoai3?verb=ListIdentifiers&resumptionToken=31.12.9999|31.12.1000||oai_ddi25|oai:dbk.gesis.org:DBK/ZA0209|2020-04-14T00:03:19Z

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


If any change is needed regarding the resumption token having illegal characters, let us know! Thanks so far :slight_smile:

@cessda-bitbucket-importer
Copy link
Author

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


We’ve fixed it on our end so no change is necessary.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


We’ve harvested 5766 GESIS records into the Staging instance.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


Great - is it possible for me to check the staging instance? (can you tell the address ?) Thanks!

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


https://datacatalogue-staging.cessda.eu/

cessda

CESSDA2018

see e.g. "GESIS__oai:dbk.gesis.org:DBK/ZA5521" for an example of an encoding issue.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Re record DBK/ZA1129, there is an empty date field

2020-04-14 09:11:32.406 ERROR (GetRecordController.java:84) - CustomHandlerException occurred whilst getting record message [eu.cessda.pasc.osmhhandler.oaipmh.exception.InternalSystemException: Unable to parse xml! FullUrl [http://dbkapps.gesis.org/dbkoai3?verb=GetRecord&identifier=oai:dbk.gesis.org:DBK/ZA1129&metadataPrefix=oai_ddi25]], for studyID [oai_cl_dbk_dt_gesis_dt_org_cl_DBK_sl_ZA1129]

2020-04-14 09:11:45.074 ERROR (TimeUtility.java:51) - Cannot parse date string [] using expected date formats [[yyyy-MM-dd'T'HH:mm:ss.SSSXXX, yyyy-MM-dd'T'HH:mm:ss'Z', yyyy-dd-MM HH:mm:ss.SSS, yyyy-MM-dd, yyyy-MM-dd'T'HH:mm:ssZ, yyyy-MM, yyyy]], Exception Message [{}]

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


I can confirm that ZA5521 has an encoding issue, we need to work on that.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


I cannot see where ZA1129 has an empty date field? At the staging server, all dates seem to be present. Can you specify the field?

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Sorry, I think there was an interleaving issues with the various log messages.

There appear to be 22 records where there are issues parsing one or more dates, but the records are not rejected.

One example is ZA1537.

GESIS catalogue entry shows this:

Date of Collection: 09.1986 - 02.1987

and further down shows this:


CESSDA Catalogue parses and converts to JSON as:

"dataCollectionPeriodStartdate":"","dataCollectionPeriodEnddate":"1987-02","dataCollectionYear":0

Which is displayed by CESSDA Data Catalogue as:


@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


This is an example where the mapping to DDI makes problems: We have a list of collection dates. Each element is specified not with start and end date, but with a single month specified. Can we use the attribute event=”single” of DDI-C to provide this? We currently only provide this in the event=”end” field.

NB: The GESIS catalogue does a summary on the data collection dates: “Date of Collection: 09.1986 - 02.1987” is this summary calculated from the list below.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


@‌TainaFSD This looks like a question for the CDC User Group

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


A question on language: Our xml files contain both German and English, using the attribute “xml:lang”. The staging server only displays English currently. How is it possible to make CDC display both languages? Or do we need to decide for only 1 language?

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Must be a glitch, as the DEV server displays both DE and EN versions of the metadata. No action required at your end.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


GESIS records in German are now visible in Staging version of CDC.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


Very good! Thanks!

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


I have enabled HTTPS on the endpoint now as well:

https://dbkapps.gesis.org/dbkoai3/?verb=Identify

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


CDC Harvester updated to use HTTPS connection.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


Great. Today I have disabled TLS 1.0/1.1 for security reasons, I hope this does not interfere with the harvesting process.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


Can you say at which point the GESIS studies will be included into the production system?

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Just testing that harvesting via HTTPS works as expected.

There is no fix at present for the issue caused by using a range of start and end dates, as discussed above. Are you happy to go live with that as is?

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


OK, yes we can live with this. If affects 22 studies only.

I am thinking about filling the start and end elements with identical values in these cases. Do you think this would be a good workaround for the moment?

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Would it be possible to use Start date = start of first wave. End date = end of final wave. Collection year = End date year.

So the example above becomes:

dataCollectionPeriodStartdate":"1986-09-01","dataCollectionPeriodEnddate":"1987-02-01","dataCollectionYear":1987

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


For this study, it could make sense. However, we would lose the more detailed information, that three waves were carried out at certain points of time. Also, for most studies the situation is that we only have one point of time. This would still lack start and end date in some cases.

So is a repeatable dataCollectionPeriodStartdate-dataCollectionPeriodEnddate supported at the moment?

If yes, I would go for

  • "dataCollectionPeriodStartdate":"1986-09","dataCollectionPeriodEnddate":"1986-09"
  • "dataCollectionPeriodStartdate":"1987-01","dataCollectionPeriodEnddate":"1987-01"
  • "dataCollectionPeriodStartdate":"1987-02","dataCollectionPeriodEnddate":"1987-02"

If no, your proposal would be the best option.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


I don’t know for certain. Can you change a couple of records to use repeatable start and end dates, and I’ll harvest them and let you know.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Wolfgang, can you check that your endpoint is supporting incremental updates correctly? (see section 3 of OAI-PMH 2.0 spec).

When I do a full harvest, I get the GESIS study records. When I do an incremental harvest ('which records have be added or updated since a specified date&time?'), I get nothing. The log messages show this:

2020-04-23 10:55:23.114 INFO (ConsumerScheduler.java:169) - Processing Repo [Repo(url=https://dbkapps.gesis.org/dbkoai3, name=GESIS, handler=OAI_PMH)]
2020-04-23 10:55:23.114 INFO (RepositoryUrlService.java:54) - [https://dbkapps.gesis.org/dbkoai3] Final ListHeaders Handler url [http://cdc-osmh-repo:9091/v0/ListRecordHeaders?Repository=https%3A%2F%2Fdbkapps.gesis.org%2Fdbkoai3] constructed.
2020-04-23 10:57:58.153 INFO DefaultHarvesterConsumerService.java:117) - Returning [0] filtered recordHeaders by date greater than [2020-04-22T00:00] | out of [5769] unfiltered.
2020-04-23 10:57:58.155 INFO (ConsumerScheduler.java:173) - Repo [Repo(url=https://dbkapps.gesis.org/dbkoai3, name=GESIS, handler=OAI_PMH)]. Returned with [0] record headers
2020-04-23 10:57:58.156 INFO (ConsumerScheduler.java:186) - Repo Name [repo_name=GESIS] of [repo_endpoint_url=https://dbkapps.gesis.org/dbkoai3] Endpoint. There are [present_cmm_record=0] presentCMMStudies out of [0] totalCMMStudies from [cmm_records_rejected=0] Record Identifiers. Therefore CMMStudiesRejected is 0
2020-04-23 10:57:58.156 INFO (LanguageDocumentExtractor.java:58) - Mapping CMMStudy to CMMStudyOfLanguage for SP[GESIS] with [0] records

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


I assume the problem is the condition: greater than [2020-04-22T00:00]

The update procedure runs once a day, and I remember that it is around or after 7 PM my time. So the current records are all from yesterday evening.

There was always the additional issue that our process creates all the files each day, and OAI uses the file creation date. I will look into that, so that we only write new files for changed metadata records.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


I have updated the procedure, so that collDate is now present always with a start and end date, also it is repeated in case of multiple entries. There was a manual update of all studies last night.

@cessda-bitbucket-importer
Copy link
Author

Original comment by Wolfgang Zenk-Möltgen.


Also, in the future only new or changed studies will be updated on the file storage, so that the OAI-PMH date should indicate the actual release date of a study.

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


@‌TainaFSD I want to include GESIS in the catalogue before the end of April. Not all of the requested UI changes have been made yet, but the remaining ones can be rolled-out later. So I propose releasing the version of CDC that is currently on Staging. Are you OK with that? Authorise release via issue #170

@cessda-bitbucket-importer
Copy link
Author

Original comment by John Shepherdson (GitHub: john-shepherdson).


Deployed to production on 3 May 2020, as CDC v2.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants