Changelog

All notable changes to the OSMH Consumer Indexer will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

For each release, use the following subsections:

Added (for new features)
Changed (for changes in existing functionality)
Deprecated (for soon-to-be removed features)
Removed (for now removed features)
Fixed (for any bug fixes)
Security (in case of vulnerabilities)

3.7.0 - 2024-11-12

Added

Added identifier information, such as an ORCID, to the creators field (#290)
Parse Data Access as Open/Restricted according to CV values or free text mappings (#557)
Parse series information which includes names, descriptions and URIs (#559)
Parse data kind and general data format into the CMMStudy model (#567)

Changed

Keep the funding agency even if the grant number is missing (#560)
Copy PID, creator and URL elements without a xml:lang attribute to all languages (#674)

Fixed

Restart the Elasticsearch client if it shuts down due to an OOM error (PR-121)
Fix OOM errors by setting the maximum XML file size to be parsed to 20 MB (#678)

3.6.0 - 2024-06-18

Added

Add support for parsing funding information into the CMMStudy model (#560
Add support for parsing DDI 3.2 and 3.3 lifecycle instance documents (#626)
Add support for parsing DDI 3.2 and 3.3 lifecycle fragment documents (#652)

Changed

Updated OpenJDK to version 21 (#657)
Updated SQAAaS badge (#661)

Removed

Removed mapping related publications from other languages into the current language if not present (#502)

Fixed

Fixed instances where structured logging fields wouldn't be set as expected (PR-81)

3.5.0 - 2023-01-30

Added

Parse dataAccessUrl from useStmt/specPerm (#606)
Added XPaths for DDI 3.x documents (#622)
Implemented basic parsing of DDI 3.x documents (#625)

Changed

Edited mappings to normalise the classifications and keywords fields (#609)

3.4.0 - 2023-08-29

Changed

Replaced the deprecated Elasticsearch RestHighLevelClient with the new Elasticsearch client (#539)
- See https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/8.7/migrate-hlrc.html for details on what has changed between the old and new clients
Converted all the models to Java records (PR-25)
Refactor the indexing pipeline so that each XML is parsed asynchronously (PR-28)

Removed

Removed the ability to harvest OAI-PMH endpoints (#533)

3.2.0 - 2022-12-08

Added

Parse multipart language codes such as en-GB (#219)
Parse relPubl DDI elements into related publications entries (#471)
Delete studies from the Elasticsearch index if the source XML is no longer present (#486)
Parse universe DDI elements (#499)

Changed

Optimise the loading of repository configurations to avoid delays while repositories are being discovered (#409)

3.0.2 - 2022-09-06

Changed

Update Elasticsearch to version 7 (#429)

3.0.0 - 2022-06-07

Added

Filter out invalid URLs from the study URL field (#390)
Add a publisherFilter field (#430)

Changed

Updated Elasticsearch imports (#269)
Generate identifiers based on the CDC identifier specification (#386)
Convert the indexer into a command line application that can be run as a scheduled task (#392)
Add support for defining repositories using a pipeline.json file. Update ReadMe file accordingly (#409)
Refactor the indexer, simplify configuration (#428)

Removed

Remove Spring Data Elasticsearch as a dependency, use the Elasticsearch client directly (#405)

2.5.0 - 2021-11-25

Added

Added support for indexing a repository from a folder on disk (#146)
Add support for Year-Month dates to Elasticsearch (#352)

Changed

Updated UniData's Endpoint (#366)
Updated OpenJDK to 17 (#269)
Removed usages of Spring Data Elasticsearch and replaced them with direct use of the Elasticsearch client (#146)

Fixes

Fixed the metadata prefix always being null #385)
Fixed some code smells identified by SonarQube (#369)
Fixed ElasticsearchSet throwing an ArrayIndexOutOfBoundsException when accessing an Elasticsearch scroll that does not have a scroll ID (#INDEXER-2)

2.4.0 - 2021-06-23

Added

Log the time it takes to harvest each repository (#296)
Use the autocomplete analyser in Elasticsearch for the abstract, country and term fields (#285)

Changed

Update GESIS' repository endpoint URL (#303)
Update SoDaNet's repository endpoint URL (#277)
Support Elasticsearch 6.8, migrate to the Elasticsearch REST client and remove the Transport Client (#312)
- This allows the indexer to connect to secured Elasticsearch clusters
Update Spring Boot to 2.2.13(#312)
Disable dynamic Elasticsearch mapping (#312)
Add support for Elasticsearch security (#321)

Fixed

Fix nested fields in Elasticsearch not being searchable (#335)

2.3.1 - 2021-02-11

Added

Add ADP to the list of harvested endpoints (#201)

Changed

Included DANS twice with different metadata parameters to pick up English and Dutch study versions (#280)
Improved the debug logging of studies dropped for having no languages with the minimum required fields (#278)

2.3.0 - 2021-02-09

10.5281/zenodo.4525896

Added

Add HTTP compression to the repository handler (#167)
Add Code of Conduct file (#174)
Add new PROGEDO endpoint (#177)
Harvest each repository endpoint with a dedicated thread (#178, #225)
Add SODA endpoint (#190)
Add option to set default language as part of endpoint specification (#192)
Add more details to 'Configured Repos' log output (#195)
Add code as an additional field in the indexer model (#199)
Add ADP Kuha2 Endpoint (#201)
Add stopwords for Hungarian and Portuguese language analysers (#204)
Improve the logging of remote repository handlers (#207)
Implement a country filter so that only countries with ISO country codes are accepted (#214)
Delete inactive records from Elasticsearch (#217)
Add a run_type variable to the logs to distinguish different types of harvester runs (#227)

Changed

Remove "not available" if no PID agency is present (#156)
Revise XML Schema Definition to ensure compliance with system implementation (#59)
Search Optimisation (#131)
Remove (not available) if no PID agency (#156)
Modify Harvester to output Required logs (#159)
Disable access to external XML entities in the repository handlers (#176)
Log statistics for created, deleted and updated studies (#181)
Cleaning Publisher filter (#183)
Update Elasticsearch to 5.6 (#188)
Support Spring Boot Admin 2 for metrics and remote management (#191, #211)
Add more details to 'Configured Repos' log output (#194)
Change SODA publisher name (#197)
Update SND set spec (#200)
Refine the list of fields to be indexed (#238)
Map langAvailableIn as a keyword, so that it can be used for sorting and filtering (#241)
Add a search field for country metadata (#252)

Fixed

Set the study url field from any language before replacing it with the language specific element (#142)
Fix alphabetical sorting issues caused by not normalising upper and lower case letters (#171)
Fix rejection reason not showing in the logs (#184)
Cleanup code (#203)
Fix title ascending/descending sort options not functioning (#209)

2.2.1 - 2020-05-04

Added

new GESIS endpoint (#162)
file appender
format error log message for successful indexing
implemented correlation id using MDC.putClosable
correlation ID to the log messages
dependency for JSON logging support (logstash-logback-encoder 5.2)

Changed

changed GESIS endpoint from HTTP to HTTPS (#162)
use Java Time APIs for the PerfRequestSyncInterceptor stopwatch
increased test coverage
updated SonarQube scanner to 3.7.0
updated Spring Boot to 1.5.21
unified timeout and SSL verification settings
refined error log for unsuccessful indexing
marked all utility classes as final
close the Elasticsearch client on shutdown
revised and re-ordered list of endpoints
use Jib to containerise the indexer
updated Maven wrapper to 0.5.3
refactored the error handling code in DaoBase.postForStringResponse() to better align with Java best practices
refactored exception handling to avoid catching RuntimeException and a cast
print the config in StatusService.printPaSCHandlerOaiPmhConfig() directly
change behaviour when Study PID Agency is not specified. Before: '10.5279/DK-SA-DDA-868 (not available)'. After: '10.5279/DK-SA-DDA-868 (Agency not available)' (#156)
log queries at the info level
moved recursion out of the try-with-resources block to reduce resource consumption
reformatted the message when the record headers could not be parsed (because the parser could have failed at any point and left the InputStream in an inconsistent state)
use input streams instead of strings (avoids a double copy)
renamed dev profile to gcp
improved logging to help determine quality of harvested metadata (#191)

Removed

caches of RuntimeException in ESIngestService
option to disable HTTPS verification
unnecessary null check

Fixed

compiler warnings, as recommended by Error Prone
time zone bugs
logging pattern for the file logger
unused micrometer dependencies
unused DocumentBuilder bean
issues reported by SonarQube
register DocumentBuilderFactories as beans instead of DocumentBuilders. DocumentBuilders are not thread safe and need resetting after use. DocumentBuilderFactory.createDocumentBuilder() is thread safe and should be used instead
fixed logs not showing in Spring Boot Admin
encoded the resumption token in case characters invalid for URIs are returned
time zone bugs

Security

verify SSL
removed the option to disable HTTPS verification

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

3.7.0 - 2024-11-12

Added

Changed

Fixed

3.6.0 - 2024-06-18

Added

Changed

Removed

Fixed

3.5.0 - 2023-01-30

Added

Changed

3.4.0 - 2023-08-29

Changed

Removed

3.2.0 - 2022-12-08

Added

Changed

3.0.2 - 2022-09-06

Changed

3.0.0 - 2022-06-07

Added

Changed

Removed

2.5.0 - 2021-11-25

Added

Changed

Fixes

2.4.0 - 2021-06-23

Added

Changed

Fixed

2.3.1 - 2021-02-11

Added

Changed

2.3.0 - 2021-02-09

Added

Changed

Fixed

2.2.1 - 2020-05-04

Added

Changed

Removed

Fixed

Security