Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvesting 2.0 Controller Module #4305

Closed
3 tasks
btylerburton opened this issue May 8, 2023 · 1 comment
Closed
3 tasks

Harvesting 2.0 Controller Module #4305

btylerburton opened this issue May 8, 2023 · 1 comment

Comments

@btylerburton
Copy link
Contributor

User Story

In order to ensure observability into our Harvesting 2.0 pipeline, data.gov wants a Controller module that orchestrates the harvesting sub-processes.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a Python controller
    WHEN a harvesting API event is triggered by Data providers or Datagov admins
    THEN a new harvesting job is initiated by the controller

Background

Data.gov wants observability and resilience build into the Harvesting 2.0 pipeline.

Controller should ensure that a job is tracked throughout the harvesting lifecycle of:

  • Extraction
  • Validation
  • Transformation

This process should be transparent, observable and idempotent.

Security Considerations (required)

All traffic should be encrypted in transit and at rest.

Sketch

  • Create python controller module
  • Create tests in pytest that ensure mock output from modules is traceable and that errors which are thrown at any step in the process are reported accurately
@nickumia-reisys
Copy link
Contributor

Upon review of this issue, it has been decided that we will combine the core code of this module into the existing harvesting module as another feature. We are creating this capability specifically to interface with the harvesting extract, transform, validate, compare and load submodules. There is an abstraction between the logical algorithm (the code to run) and the implementation of the application/service (running the code) (i.e. We will create a separate repo for the deployment of the application or service which would call this core logic from the harvesting module).

As a result, the current layout of the harvesting module should look something like:

harvesting/
├── compare
├── controller
├── extract
├── load
├── transform
└── validate

In terms of the features of the controller, those will be continuously updated in our Wiki doc and deals with the infrastructure to support the management of the 'job' and 'record' queues. The next ticket in this sequence is:

@nickumia-reisys nickumia-reisys self-assigned this May 25, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board May 25, 2023
@btylerburton btylerburton removed the H2.0/Harvest-General General Harvesting 2.0 Issues label Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants