Skip to content

Latest commit

 

History

History
311 lines (217 loc) · 17.7 KB

CHANGELOG.md

File metadata and controls

311 lines (217 loc) · 17.7 KB

Changelog

All notable changes to the OSMH Consumer Indexer will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

For each release, use the following subsections:

  • Added (for new features)
  • Changed (for changes in existing functionality)
  • Deprecated (for soon-to-be removed features)
  • Removed (for now removed features)
  • Fixed (for any bug fixes)
  • Security (in case of vulnerabilities)

3.7.0 - 2024-11-12

DOI

Added

  • Added identifier information, such as an ORCID, to the creators field (#290)
  • Parse Data Access as Open/Restricted according to CV values or free text mappings (#557)
  • Parse series information which includes names, descriptions and URIs (#559)
  • Parse data kind and general data format into the CMMStudy model (#567)

Changed

  • Keep the funding agency even if the grant number is missing (#560)
  • Copy PID, creator and URL elements without a xml:lang attribute to all languages (#674)

Fixed

  • Restart the Elasticsearch client if it shuts down due to an OOM error (PR-121)
  • Fix OOM errors by setting the maximum XML file size to be parsed to 20 MB (#678)

3.6.0 - 2024-06-18

DOI

Added

  • Add support for parsing funding information into the CMMStudy model (#560
  • Add support for parsing DDI 3.2 and 3.3 lifecycle instance documents (#626)
  • Add support for parsing DDI 3.2 and 3.3 lifecycle fragment documents (#652)

Changed

  • Updated OpenJDK to version 21 (#657)
  • Updated SQAAaS badge (#661)

Removed

  • Removed mapping related publications from other languages into the current language if not present (#502)

Fixed

  • Fixed instances where structured logging fields wouldn't be set as expected (PR-81)

3.5.0 - 2023-01-30

DOI

Added

  • Parse dataAccessUrl from useStmt/specPerm (#606)
  • Added XPaths for DDI 3.x documents (#622)
  • Implemented basic parsing of DDI 3.x documents (#625)

Changed

  • Edited mappings to normalise the classifications and keywords fields (#609)

3.4.0 - 2023-08-29

DOI

Changed

Removed

  • Removed the ability to harvest OAI-PMH endpoints (#533)

3.2.0 - 2022-12-08

DOI

Added

  • Parse multipart language codes such as en-GB (#219)
  • Parse relPubl DDI elements into related publications entries (#471)
  • Delete studies from the Elasticsearch index if the source XML is no longer present (#486)
  • Parse universe DDI elements (#499)

Changed

  • Optimise the loading of repository configurations to avoid delays while repositories are being discovered (#409)

3.0.2 - 2022-09-06

Changed

  • Update Elasticsearch to version 7 (#429)

3.0.0 - 2022-06-07

DOI

Added

  • Filter out invalid URLs from the study URL field (#390)
  • Add a publisherFilter field (#430)

Changed

  • Updated Elasticsearch imports (#269)
  • Generate identifiers based on the CDC identifier specification (#386)
  • Convert the indexer into a command line application that can be run as a scheduled task (#392)
  • Add support for defining repositories using a pipeline.json file. Update ReadMe file accordingly (#409)
  • Refactor the indexer, simplify configuration (#428)

Removed

  • Remove Spring Data Elasticsearch as a dependency, use the Elasticsearch client directly (#405)

2.5.0 - 2021-11-25

DOI

Added

  • Added support for indexing a repository from a folder on disk (#146)
  • Add support for Year-Month dates to Elasticsearch (#352)

Changed

  • Updated UniData's Endpoint (#366)
  • Updated OpenJDK to 17 (#269)
  • Removed usages of Spring Data Elasticsearch and replaced them with direct use of the Elasticsearch client (#146)

Fixes

  • Fixed the metadata prefix always being null #385)
  • Fixed some code smells identified by SonarQube (#369)
  • Fixed ElasticsearchSet throwing an ArrayIndexOutOfBoundsException when accessing an Elasticsearch scroll that does not have a scroll ID (#INDEXER-2)

2.4.0 - 2021-06-23

DOI

Added

  • Log the time it takes to harvest each repository (#296)
  • Use the autocomplete analyser in Elasticsearch for the abstract, country and term fields (#285)

Changed

  • Update GESIS' repository endpoint URL (#303)
  • Update SoDaNet's repository endpoint URL (#277)
  • Support Elasticsearch 6.8, migrate to the Elasticsearch REST client and remove the Transport Client (#312)
    • This allows the indexer to connect to secured Elasticsearch clusters
  • Update Spring Boot to 2.2.13(#312)
  • Disable dynamic Elasticsearch mapping (#312)
  • Add support for Elasticsearch security (#321)

Fixed

  • Fix nested fields in Elasticsearch not being searchable (#335)

2.3.1 - 2021-02-11

DOI

Added

  • Add ADP to the list of harvested endpoints (#201)

Changed

  • Included DANS twice with different metadata parameters to pick up English and Dutch study versions (#280)
  • Improved the debug logging of studies dropped for having no languages with the minimum required fields (#278)

2.3.0 - 2021-02-09

10.5281/zenodo.4525896

Added

  • Add HTTP compression to the repository handler (#167)
  • Add Code of Conduct file (#174)
  • Add new PROGEDO endpoint (#177)
  • Harvest each repository endpoint with a dedicated thread (#178, #225)
  • Add SODA endpoint (#190)
  • Add option to set default language as part of endpoint specification (#192)
  • Add more details to 'Configured Repos' log output (#195)
  • Add code as an additional field in the indexer model (#199)
  • Add ADP Kuha2 Endpoint (#201)
  • Add stopwords for Hungarian and Portuguese language analysers (#204)
  • Improve the logging of remote repository handlers (#207)
  • Implement a country filter so that only countries with ISO country codes are accepted (#214)
  • Delete inactive records from Elasticsearch (#217)
  • Add a run_type variable to the logs to distinguish different types of harvester runs (#227)

Changed

  • Remove "not available" if no PID agency is present (#156)
  • Revise XML Schema Definition to ensure compliance with system implementation (#59)
  • Search Optimisation (#131)
  • Remove (not available) if no PID agency (#156)
  • Modify Harvester to output Required logs (#159)
  • Disable access to external XML entities in the repository handlers (#176)
  • Log statistics for created, deleted and updated studies (#181)
  • Cleaning Publisher filter (#183)
  • Update Elasticsearch to 5.6 (#188)
  • Support Spring Boot Admin 2 for metrics and remote management (#191, #211)
  • Add more details to 'Configured Repos' log output (#194)
  • Change SODA publisher name (#197)
  • Update SND set spec (#200)
  • Refine the list of fields to be indexed (#238)
  • Map langAvailableIn as a keyword, so that it can be used for sorting and filtering (#241)
  • Add a search field for country metadata (#252)

Fixed

  • Set the study url field from any language before replacing it with the language specific element (#142)
  • Fix alphabetical sorting issues caused by not normalising upper and lower case letters (#171)
  • Fix rejection reason not showing in the logs (#184)
  • Cleanup code (#203)
  • Fix title ascending/descending sort options not functioning (#209)

2.2.1 - 2020-05-04

DOI

Added

  • new GESIS endpoint (#162)
  • file appender
  • format error log message for successful indexing
  • implemented correlation id using MDC.putClosable
  • correlation ID to the log messages
  • dependency for JSON logging support (logstash-logback-encoder 5.2)

Changed

  • changed GESIS endpoint from HTTP to HTTPS (#162)
  • use Java Time APIs for the PerfRequestSyncInterceptor stopwatch
  • increased test coverage
  • updated SonarQube scanner to 3.7.0
  • updated Spring Boot to 1.5.21
  • unified timeout and SSL verification settings
  • refined error log for unsuccessful indexing
  • marked all utility classes as final
  • close the Elasticsearch client on shutdown
  • revised and re-ordered list of endpoints
  • use Jib to containerise the indexer
  • updated Maven wrapper to 0.5.3
  • refactored the error handling code in DaoBase.postForStringResponse() to better align with Java best practices
  • refactored exception handling to avoid catching RuntimeException and a cast
  • print the config in StatusService.printPaSCHandlerOaiPmhConfig() directly
  • change behaviour when Study PID Agency is not specified. Before: '10.5279/DK-SA-DDA-868 (not available)'. After: '10.5279/DK-SA-DDA-868 (Agency not available)' (#156)
  • log queries at the info level
  • moved recursion out of the try-with-resources block to reduce resource consumption
  • reformatted the message when the record headers could not be parsed (because the parser could have failed at any point and left the InputStream in an inconsistent state)
  • use input streams instead of strings (avoids a double copy)
  • renamed dev profile to gcp
  • improved logging to help determine quality of harvested metadata (#191)

Removed

  • caches of RuntimeException in ESIngestService
  • option to disable HTTPS verification
  • unnecessary null check

Fixed

  • compiler warnings, as recommended by Error Prone
  • time zone bugs
  • logging pattern for the file logger
  • unused micrometer dependencies
  • unused DocumentBuilder bean
  • issues reported by SonarQube
  • register DocumentBuilderFactories as beans instead of DocumentBuilders. DocumentBuilders are not thread safe and need resetting after use. DocumentBuilderFactory.createDocumentBuilder() is thread safe and should be used instead
  • fixed logs not showing in Spring Boot Admin
  • encoded the resumption token in case characters invalid for URIs are returned
  • time zone bugs

Security

  • verify SSL
  • removed the option to disable HTTPS verification