data-reconciliation

Summary

The data reconciliation app verifies consistency across different data sets used by Companies House including Oracle, MongoDB and Elasticsearch.
Groups of comparators are responsible for comparing data sets, aggregating results and then publishing a message to a Kafka topic.
Comparators are configured using environment variables.

System requirements

Building and Running Locally

From the command line, in the same folder as the Makefile run make clean build
Configure project environment variables where necessary (see below).
Ensure MongoDB and Elasticsearch are running within the Companies House developer environment
Start the service in the CHS developer environment

Architecture

The data reconciliation app is implemented using Apache Camel.
A comparison is triggered when a timer elapses.
The route triggers the desired function, which fetches the required data sets.
Retrieved data sets are marshalled into a suitable model and compared with each other.
Supported comparisons include:
- Count number of resources in a particular data set.
- Calculate symmetric difference between two data sets.
- Identify discrepancies between resources in two data sets.
The result is transformed into a CSV file and uploaded to S3.
A message is sent to Kafka after all required comparisons in the group have run.

Maintenance

Queries used to retrieve data from Oracle and retrieve search hits from Elasticsearch are located on the classpath.

Environment variables

Oracle

Variable	Description	Example
SPRING_DATASOURCE_URL	The URL of the Oracle instance where CHIPS application data is stored	jdbc:oracle:thin@oraclehost:1521:db
SPRING_DATASOURCE_USERNAME	The username that will be used to connect to Oracle	username
SPRING_DATASOURCE_PASSWORD	The password that will be used to connect to Oracle	password
SPRING_DATASOURCE_DRIVER_CLASS_NAME	The fully qualified class name of the driver that will be used to connect to Oracle	oracle.jdbc.OracleDriver

MongoDB

Variable	Description	Example
SPRING_DATA_MONGODB_URI	The URL of the MongoDB instance where CHS application data is stored	mongodb://mongohost:27017
ENDPOINT_MONGODB_COMPANY_PROFILE_DB_NAME	The name of the MongoDB database used to store company profiles	db_name
ENDPOINT_MONGODB_COMPANY_PROFILE_COLLECTION_NAME	The name of the MongoDB collection used to store company profiles	collection_name
ENDPOINT_MONGODB_READ_PREFERENCE	Determines how the MongoDB client routes read operations to members of a replica set	PRIMARY
ENDPOINT_MONGODB_DSQ_OFFICER_DB_NAME	The name of the MongoDB database used to store disqualified officers	db_name
ENDPOINT_MONGODB_DSQ_OFFICER_COLLECTION_NAME	The name of the MongoDB collection used to store disqualified officers	collection_name
ENDPOINT_MONGODB_INSOLVENCY_DB_NAME	The name of the MongoDB database used to store company insolvency data	db_name
ENDPOINT_MONGODB_INSOLVENCY_COLLECTION_NAME	The name of the MongoDB collection used to store company insolvency data	collection_name

Elasticsearch

Variable	Description	Example
ELASTICSEARCH_ALPHA_HOST	The hostname that will be used to connect to the Elasticsearch alphabetical search cluster	example.com
ELASTICSEARCH_ALPHA_INDEX	The name of the index that alphabetical search hits will be retrieved from	index_name
ELASTICSEARCH_ALPHA_PORT	The port number that will be used to connect to the Elasticsearch alphabetical search cluster	9200
ELASTICSEARCH_ALPHA_PROTOCOL	The protocol that will be used to connect to the Elasticsearch alphabetical search cluster	https
ELASTICSEARCH_ALPHA_SEGMENTS	The number of slices that the scrolling search will be split into	3
ELASTICSEARCH_ALPHA_SLICE_SIZE	The number of hits that the scrolling search will return in each response	10000
ELASTICSEARCH_ALPHA_SLICE_FIELD	The field that will be used to split results of a scrolling search	_uid
ELASTICSEARCH_PRIMARY_HOST	The hostname that will be used to connect to the Elasticsearch primary search cluster	example.com
ELASTICSEARCH_PRIMARY_INDEX	The name of the index that primary search hits will be retrieved from	index_name
ELASTICSEARCH_PRIMARY_PORT	The port number that will be used to connect to the Elasticsearch primary search cluster	9200
ELASTICSEARCH_PRIMARY_PROTOCOL	The protocol that will be used to connect to the Elasticsearch primary search cluster	https
ELASTICSEARCH_PRIMARY_SEGMENTS	The number of slices that the scrolling search will be split into	3
ELASTICSEARCH_PRIMARY_SLICE_SIZE	The number of hits that the scrolling search will return in each response	10000
ELASTICSEARCH_PRIMARY_SLICE_FIELD	The field that will be used to split results of a scrolling search	_uid
ENDPOINT_ELASTICSEARCH_LOG_INDICES	Used to log a tally of the number of Elasticsearch search hits that have been processed	10000

AWS

Variable	Description	Example
RESULTS_BUCKET	The S3 bucket to which results will be uploaded	bucket_name
AWS_ACCESS_KEY_ID	The access key that will be used to connect to AWS	access_key
AWS_SECRET_ACCESS_KEY	The secret access key that will be used to connect to AWS	secret_access_key
AWS_REGION	The AWS region that the S3 client will connect to	eu-west-2
RESULTS_EXPIRY_TIME_IN_MILLIS	The duration in milliseconds for which comparison results can be accessed	600000

Kafka

Variable	Description	Example
SCHEMA_REGISTRY_URL	The URL of the Kafka schema registry	example.com
KAFKA_BROKER_ADDR	The URL of the Kafka broker	example.com

Caffeine Cache

Variable	Description	Example
CACHE_EXPIRY_IN_SECONDS	The duration in seconds after which cached results will be evicted	300

Comparison Groups

Description

Each comparator belongs to a comparison group. After all comparators in the comparison group have run, results produced by each comparator will be published to S3 and a message will be sent to a Kafka topic.

The following tables contain toggles (for enabling/disabling each comparator) and timer delays (after application startup for each comparator).

Note: the application will only start when one or more comparison group toggles have been enabled; when no comparison group toggles have been enabled an error message will be logged i.e.

No aggregation group models enabled; must be at least one

Company Profile Comparisons - MongoDB-Oracle

Variable	Description	Example
COMPANY_COUNT_MONGO_ORACLE_ENABLED	Company count comparator toggle	"true" / "false"
COMPANY_COUNT_MONGO_ORACLE_DELAY	Company count comparator delay	"30s"
COMPANY_NUMBER_MONGO_ORACLE_ENABLED	Company number comparator toggle	"true" / "false"
COMPANY_NUMBER_MONGO_ORACLE_DELAY	Company number comparator delay	"1m30s"
COMPANY_STATUS_MONGO_ORACLE_ENABLED	Company status comparator toggle	"true" / "false"
COMPANY_STATUS_MONGO_ORACLE_DELAY	Company status comparator delay	"9m30s"

Disqualified Officer Comparisons - MongoDB-Oracle

Variable	Description	Example
DSQ_OFFICER_ID_MONGO_ORACLE_ENABLED	Disqualified officer comparator toggle	"true" / "false"
DSQ_OFFICER_ID_MONGO_ORACLE_DELAY	Disqualified officer comparator delay	"4m30s"

Elasticsearch Comparisons - MongoDB-Elasticsearch

Variable	Description	Example
COMPANY_NUMBER_MONGO_PRIMARY_ENABLED	Primary index company number comparator toggle	"true" / "false"
COMPANY_NUMBER_MONGO_PRIMARY_DELAY	Primary index company number comparator delay	"2m30s"
COMPANY_NUMBER_MONGO_ALPHA_ENABLED	Alpha index company number comparator toggle	"true" / "false"
COMPANY_NUMBER_MONGO_ALPHA_DELAY	Alpha index company number comparator delay	"3m30s"
COMPANY_NAME_MONGO_PRIMARY_ENABLED	Primary index company name comparator toggle	"true" / "false"
COMPANY_NAME_MONGO_PRIMARY_DELAY	Primary index company name comparator delay	"5m30s"
COMPANY_NAME_MONGO_ALPHA_ENABLED	Alpha index company name comparator toggle	"true" / "false"
COMPANY_NAME_MONGO_ALPHA_DELAY	Alpha index company name comparator delay	"6m30s"
COMPANY_STATUS_MONGO_PRIMARY_ENABLED	Primary index company status comparator toggle	"true" / "false"
COMPANY_STATUS_MONGO_PRIMARY_DELAY	Primary index company status comparator delay	"7m30s"
COMPANY_STATUS_MONGO_ALPHA_ENABLED	Primary index company status comparator toggle	"true" / "false"
COMPANY_STATUS_MONGO_ALPHA_DELAY	Primary index company status comparator delay	"8m30s"

Company Insolvency Comparisons

Variable	Description	Example
INSOLVENCY_COMPANY_NUMBER_MONGO_ORACLE_ENABLED	Insolvency company number comparator toggle	"true" / "false"
INSOLVENCY_COMPANY_NUMBER_MONGO_ORACLE_DELAY	Insolvency company number comparator delay	"10m30s"
INSOLVENCY_CASE_COUNT_MONGO_ORACLE_ENABLED	Insolvency case count comparator toggle	"true" / "false"
INSOLVENCY_CASE_COUNT_MONGO_ORACLE_DELAY	Insolvency case count comparator delay	"11m30s"

Output aggregation configuration

Variable	Description	Example
EMAIL_RECIPIENT_LIST	The email accounts that will be notified when results from a comparison are available	[email protected]
EMAIL_APPLICATION_ID	Template configuration for the email sender	application_id
EMAIL_MESSAGE_ID	Template configuration for the email sender	message_id
EMAIL_MESSAGE_TYPE	Template configuration for the email sender	message_type
EMAIL_SENDER	The value of the email's To field	[email protected]

Miscellaneous

Variable	Description	Example
RESULTS_INITIAL_CAPACITY	Used to optimise collections for the number of expected results	1000000

Building the docker image

mvn compile jib:dockerBuild -Dimage=169942020521.dkr.ecr.eu-west-1.amazonaws.com/local/data-reconciliation

Running Locally using Docker

Clone Docker CHS Development and follow the steps in the README.
Enable the data-reconciliation module
Run tilt up and wait for all services to start

To make local changes

Development mode is available for this service in Docker CHS Development.

./bin/chs-dev development enable data-reconciliation

This will clone the data reconciliation app into the repositories folder. Any changes to the code, or resources will automatically trigger a rebuild and reluanch.

Terraform ECS

What does this code do?

The code present in this repository is used to define and deploy a dockerised container in AWS ECS. This is done by calling a module from terraform-modules. Application specific attributes are injected and the service is then deployed using Terraform via the CICD platform 'Concourse'.

Application specific attributes	Value	Description
ECS Cluster	data-reconciliation-service	ECS cluster (stack) the service belongs to
Load balancer	non required	The load balancer that sits in front of the service
Concourse pipeline	Pipeline link Pipeline code	Concourse pipeline link in shared services

Contributing

Please refer to the ECS Development and Infrastructure Documentation for detailed information on the infrastructure being deployed.

Testing

Ensure the terraform runner local plan executes without issues. For information on terraform runners please see the Terraform Runner Quickstart guide.
If you encounter any issues or have questions, reach out to the team on the #platform slack channel.

Vault Configuration Updates

Any secrets required for this service will be stored in Vault. For any updates to the Vault configuration, please consult with the #platform team and submit a workflow request.

Name		Name	Last commit message	Last commit date
Latest commit History 318 Commits
ecs-image-build		ecs-image-build
src		src
terraform/groups/ecs-service		terraform/groups/ecs-service
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Tiltfile.dev		Tiltfile.dev
pom.xml		pom.xml
start.sh		start.sh
version		version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-reconciliation

Summary

System requirements

Building and Running Locally

Architecture

Maintenance

Environment variables

Oracle

MongoDB

Elasticsearch

AWS

Kafka

Caffeine Cache

Comparison Groups

Description

Company Profile Comparisons - MongoDB-Oracle

Disqualified Officer Comparisons - MongoDB-Oracle

Elasticsearch Comparisons - MongoDB-Elasticsearch

Company Insolvency Comparisons

Output aggregation configuration

Miscellaneous

Building the docker image

Running Locally using Docker

To make local changes

Terraform ECS

What does this code do?

Contributing

Testing

Vault Configuration Updates

Useful Links

About

Releases 52

Packages

Contributors 9

Languages

License

companieshouse/data-reconciliation

Folders and files

Latest commit

History

Repository files navigation

data-reconciliation

Summary

System requirements

Building and Running Locally

Architecture

Maintenance

Environment variables

Oracle

MongoDB

Elasticsearch

AWS

Kafka

Caffeine Cache

Comparison Groups

Description

Company Profile Comparisons - MongoDB-Oracle

Disqualified Officer Comparisons - MongoDB-Oracle

Elasticsearch Comparisons - MongoDB-Elasticsearch

Company Insolvency Comparisons

Output aggregation configuration

Miscellaneous

Building the docker image

Running Locally using Docker

To make local changes

Terraform ECS

What does this code do?

Contributing

Testing

Vault Configuration Updates

Useful Links

About

Resources

License

Stars

Watchers

Forks

Releases 52

Packages 0

Contributors 9

Languages

Packages