WIP: Define Redis Queue System #4314

btylerburton · 2023-05-11T18:12:44Z

User Story

In order to manage large numbers of harvest jobs, data.gov wants to define a series of queue systems using Redis.

Acceptance Criteria

First we will define the queues themselves:

queue	purpose
job	jobs waiting to be picked up by the harvester pipeline
extract	harvest source waiting to have catalog parsed
compare	an incoming record with unique UUID waiting to be compared with the current record of same identifier
validate	a record in need of validation against expected schema
transform	a record in need of transformation from one schema to another
load	a record ready to be uploaded into current catalog UI (currently `created`, `updated`, or `deleted` in CKAN)

Then we will define their lifecycles:

queue	lifecycle state	definition
job	create	a job is awaiting being picked up by the harvester
	extract	a harvest source being extracted to catalog of records
	compare	a catalog of records is awaiting being compared with its companion in CKAN
	processing	record-level processing of add, update, delete
	completion	harvest job has finished successfully or in error

extract	create	a harvest source in queue
	processing	a harvest source is being extracted
	completion	all records extracted from harvest source and saved to S3 under appropriate prefix

compare	create	a catalog of records in queue
	processing	a catalog of records is being compared with harvest source found in UI
	completion	individual records have been sent to next step, determined by whether they need to be added, updated or deleted

validate	create	an individual record
	processing	validation against given schema
	completion	pass/fail parsed against schema

transform	create	an individual record
	processing	transforming a record in one schema to another schema
	completion	record sent to validation queue for final validation of successful transformation

load	create	an individual record
	processing	RESTful operation against CKAN catalog based on whether the record should be created, updated, or deleted
	completion	success or failure of that process

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Multiple harvest jobs running concurrently will consume excessive system resources. Regardless of pipeline speed, we would like to define a definitive FIFO (first in, first out) system to guarantee linear processing of harvest sources.

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

None

Sketch

The text was updated successfully, but these errors were encountered:

btylerburton added H2.0/Harvest-General General Harvesting 2.0 Issues H2.0/controller labels May 11, 2023

btylerburton added this to data.gov team board May 11, 2023

btylerburton mentioned this issue May 11, 2023

WIP: Implement Redis queue system #4316

Closed

2 tasks

hkdctol moved this to New Dev in data.gov team board May 11, 2023

nickumia-reisys mentioned this issue May 12, 2023

Harvesting 2.0 Controller Module #4305

Closed

3 tasks

robert-bryson self-assigned this May 24, 2023

robert-bryson moved this from New Dev to 🏗 In Progress [8] in data.gov team board May 24, 2023

btylerburton mentioned this issue May 26, 2023

Setup external service for S3 ( dev ) #4308

Closed

5 tasks

robert-bryson removed their assignment May 26, 2023

robert-bryson moved this from 🏗 In Progress [8] to New Dev in data.gov team board May 26, 2023

btylerburton moved this from New Dev to 📟 Sprint Backlog [7] in data.gov team board May 26, 2023

btylerburton mentioned this issue May 26, 2023

WIP: Setup external service for Redis #4333

Closed

5 tasks

btylerburton changed the title ~~Define Redis Queue System~~ WIP: Define Redis Queue System May 26, 2023

btylerburton moved this from 📟 Sprint Backlog [7] to New Dev in data.gov team board May 26, 2023

btylerburton removed the H2.0/Harvest-General General Harvesting 2.0 Issues label Dec 13, 2023

btylerburton added H2.0/Harvest-General General Harvesting 2.0 Issues and removed H2.0/Airflow labels Feb 16, 2024

btylerburton moved this to 🧊 Icebox in data.gov team board Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Define Redis Queue System #4314

WIP: Define Redis Queue System #4314

btylerburton commented May 11, 2023 •

edited

Loading

WIP: Define Redis Queue System #4314

WIP: Define Redis Queue System #4314

Comments

btylerburton commented May 11, 2023 • edited Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

btylerburton commented May 11, 2023 •

edited

Loading