-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GESIS: new endpoint for harvesting GESIS studies #162
Comments
Original comment by John Shepherdson (GitHub: john-shepherdson). Please add HTTPS option asap |
Original comment by John Shepherdson (GitHub: john-shepherdson). Some issues with implementation of OAI-PMH protocol, according to BASE Validator (http://oval.base-search.net/) |
Original comment by Wolfgang Zenk-Möltgen. I have corrected the validation error by including a locally provided xsd for http://purl.org/dc/elements/1.1/ - seems to be an error retrieving it online. Also set the ListSize to 200, not 1000 for the ListRecords batch size. |
Original comment by CESSDA Support Team (GitHub: cessda). @sergio.dias BASE Validator results much improved. |
Original comment by John Shepherdson (GitHub: john-shepherdson). Endpoint doesn’t appear to support ListHeaders verb, which is used by Harvester: 2020-04-10 21:57:11.467 INFO (RepositoryUrlService.java:55) - [http://dbkapps.gesis.org/dbkoai3] Final ListHeaders Handler url [http://cdc-osmh-repo:9091/v0/ListRecordHeaders?Repository=http://dbkapps.gesis.org/dbkoai3] constructed. |
Original comment by Matthew Morris (GitHub: matthew-morris-cessda). This appears to be caused by the resumption token having illegal characters for a URI. These are now escaped by the repository handler as of https://github.com/cessda/cessda.cdc.osmh-repository-handler.oai-pmh/commit/424661e283b15e51862b3eff79967c1ef75f2f3d.
|
Original comment by Wolfgang Zenk-Möltgen. If any change is needed regarding the resumption token having illegal characters, let us know! Thanks so far :slight_smile: |
Original comment by Matthew Morris (GitHub: matthew-morris-cessda). We’ve fixed it on our end so no change is necessary. |
Original comment by John Shepherdson (GitHub: john-shepherdson). We’ve harvested 5766 GESIS records into the Staging instance. |
Original comment by Wolfgang Zenk-Möltgen. Great - is it possible for me to check the staging instance? (can you tell the address ?) Thanks! |
Original comment by John Shepherdson (GitHub: john-shepherdson). https://datacatalogue-staging.cessda.eu/ cessda CESSDA2018 see e.g. "GESIS__oai:dbk.gesis.org:DBK/ZA5521" for an example of an encoding issue. |
Original comment by John Shepherdson (GitHub: john-shepherdson). Re record
|
Original comment by Wolfgang Zenk-Möltgen. I can confirm that ZA5521 has an encoding issue, we need to work on that. |
Original comment by Wolfgang Zenk-Möltgen. I cannot see where ZA1129 has an empty date field? At the staging server, all dates seem to be present. Can you specify the field? |
Original comment by John Shepherdson (GitHub: john-shepherdson). Sorry, I think there was an interleaving issues with the various log messages. There appear to be 22 records where there are issues parsing one or more dates, but the records are not rejected. One example is ZA1537. GESIS catalogue entry shows this: Date of Collection: 09.1986 - 02.1987 and further down shows this: CESSDA Catalogue parses and converts to JSON as:
Which is displayed by CESSDA Data Catalogue as: |
Original comment by Wolfgang Zenk-Möltgen. This is an example where the mapping to DDI makes problems: We have a list of collection dates. Each element is specified not with start and end date, but with a single month specified. Can we use the attribute event=”single” of DDI-C to provide this? We currently only provide this in the event=”end” field. NB: The GESIS catalogue does a summary on the data collection dates: “Date of Collection: 09.1986 - 02.1987” is this summary calculated from the list below. |
Original comment by John Shepherdson (GitHub: john-shepherdson). @TainaFSD This looks like a question for the CDC User Group |
Original comment by Wolfgang Zenk-Möltgen. A question on language: Our xml files contain both German and English, using the attribute “xml:lang”. The staging server only displays English currently. How is it possible to make CDC display both languages? Or do we need to decide for only 1 language? |
Original comment by John Shepherdson (GitHub: john-shepherdson). Must be a glitch, as the DEV server displays both DE and EN versions of the metadata. No action required at your end. |
Original comment by John Shepherdson (GitHub: john-shepherdson). GESIS records in German are now visible in Staging version of CDC. |
Original comment by Wolfgang Zenk-Möltgen. Very good! Thanks! |
Original comment by Wolfgang Zenk-Möltgen. I have enabled HTTPS on the endpoint now as well: https://dbkapps.gesis.org/dbkoai3/?verb=Identify |
Original comment by John Shepherdson (GitHub: john-shepherdson). CDC Harvester updated to use HTTPS connection. |
Original comment by Wolfgang Zenk-Möltgen. Great. Today I have disabled TLS 1.0/1.1 for security reasons, I hope this does not interfere with the harvesting process. |
Original comment by Wolfgang Zenk-Möltgen. Can you say at which point the GESIS studies will be included into the production system? |
Original comment by John Shepherdson (GitHub: john-shepherdson). Just testing that harvesting via HTTPS works as expected. There is no fix at present for the issue caused by using a range of start and end dates, as discussed above. Are you happy to go live with that as is? |
Original comment by Wolfgang Zenk-Möltgen. OK, yes we can live with this. If affects 22 studies only. I am thinking about filling the start and end elements with identical values in these cases. Do you think this would be a good workaround for the moment? |
Original comment by John Shepherdson (GitHub: john-shepherdson). Would it be possible to use Start date = start of first wave. End date = end of final wave. Collection year = End date year. So the example above becomes:
|
Original comment by Wolfgang Zenk-Möltgen. For this study, it could make sense. However, we would lose the more detailed information, that three waves were carried out at certain points of time. Also, for most studies the situation is that we only have one point of time. This would still lack start and end date in some cases. So is a repeatable If yes, I would go for
If no, your proposal would be the best option. |
Original comment by John Shepherdson (GitHub: john-shepherdson). I don’t know for certain. Can you change a couple of records to use repeatable start and end dates, and I’ll harvest them and let you know. |
Original comment by John Shepherdson (GitHub: john-shepherdson). Wolfgang, can you check that your endpoint is supporting incremental updates correctly? (see section 3 of OAI-PMH 2.0 spec). When I do a full harvest, I get the GESIS study records. When I do an incremental harvest ('which records have be added or updated since a specified date&time?'), I get nothing. The log messages show this: 2020-04-23 10:55:23.114 INFO (ConsumerScheduler.java:169) - Processing Repo [Repo(url=https://dbkapps.gesis.org/dbkoai3, name=GESIS, handler=OAI_PMH)] |
Original comment by Wolfgang Zenk-Möltgen. I assume the problem is the condition: greater than [2020-04-22T00:00] The update procedure runs once a day, and I remember that it is around or after 7 PM my time. So the current records are all from yesterday evening. There was always the additional issue that our process creates all the files each day, and OAI uses the file creation date. I will look into that, so that we only write new files for changed metadata records. |
Original comment by Wolfgang Zenk-Möltgen. I have updated the procedure, so that collDate is now present always with a start and end date, also it is repeated in case of multiple entries. There was a manual update of all studies last night. |
Original comment by Wolfgang Zenk-Möltgen. Also, in the future only new or changed studies will be updated on the file storage, so that the OAI-PMH date should indicate the actual release date of a study. |
Original comment by John Shepherdson (GitHub: john-shepherdson). @TainaFSD I want to include GESIS in the catalogue before the end of April. Not all of the requested UI changes have been made yet, but the remaining ones can be rolled-out later. So I propose releasing the version of CDC that is currently on Staging. Are you OK with that? Authorise release via issue #170 |
Original comment by John Shepherdson (GitHub: john-shepherdson). Deployed to production on 3 May 2020, as CDC v2.2.1 |
Original report on BitBucket by Wolfgang Zenk-Möltgen.
We do have now a new OAI-PMH endpoint available for harvesting the GESIS studies into the CDC:
http://dbkapps.gesis.org/dbkoai3/?verb=Identify
The text was updated successfully, but these errors were encountered: