Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Initial Compare Function for DCAT-US #4557

Closed
14 tasks
btylerburton opened this issue Dec 13, 2023 · 8 comments
Closed
14 tasks

Create Initial Compare Function for DCAT-US #4557

btylerburton opened this issue Dec 13, 2023 · 8 comments
Assignees
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0

Comments

@btylerburton
Copy link
Contributor

btylerburton commented Dec 13, 2023

User Story

In order to load test our compare solution, datagovteam wants to develop the initial iteration of our compare app functionality for DCAT-US.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN I have a DCAT-US harvest source that has been loaded into our Dynamic DAG ETL pipeline and extracted into individual records
    THEN I would like to use a hashing function on each record and store that in an iterable map (harvest_source_map) in the form id: source_hash

  • GIVEN I have loaded the same harvest source from the CKAN DB
    THEN I would like to use the recorded result of the same hashing function stored as metadata in the CKAN record to create an iterable map (catalog_source_map) in the form id: source_hash

For the below operations assume a for loop over the harvest_source_map is in progress:

  • GIVEN the ID of the dataset is found in the CKAN catalog but the hash is not the same
    THEN I want to add that dataset to the list of items to update (packages_to_create)

  • GIVEN the ID of the dataset is not found in the CKAN catalog
    THEN I want to add that dataset to the list of items to create (packages_to_create)

  • GIVEN the ID of the dataset is found and the hash is the same
    THEN I know the dataset is unchanged, and I can move onto the next record

  • GIVEN that all records in the harvest source have been traversed
    THEN I know that the ID's which remain in the Catalog hashmap can be deleted from CKAN catalog (packages_to_destroy)

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

diagram

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

  • Add new code to datagov-harvesting-logic to satisfy the AC above
  • Write tests in datagov-harvesting-logic that cover:
    • update
    • create
    • destroy
    • pass
  • Push a new version to PyPi
  • Integrate that into our existing Airflow test instance
@btylerburton btylerburton added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label Dec 13, 2023
@btylerburton btylerburton changed the title Create Initial Compare App for DCAT-US Create Initial Compare Function for DCAT-US Dec 14, 2023
@gujral-rei gujral-rei moved this to 📔 Product Backlog in data.gov team board Dec 14, 2023
@gujral-rei gujral-rei moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Dec 14, 2023
@gujral-rei gujral-rei moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Dec 14, 2023
@gujral-rei gujral-rei moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Dec 21, 2023
@rshewitt rshewitt self-assigned this Dec 21, 2023
@rshewitt rshewitt moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Dec 21, 2023
@rshewitt
Copy link
Contributor

compare branch. added compare logic and unit test. need to add integration test against real ckan endpoint.

@rshewitt
Copy link
Contributor

rshewitt commented Jan 2, 2024

ckan adds things to a dataset which doesn't or may not derive from the catalog itself ( e.g. metadata_created defaults to utcnow, license_id appears to default to "notspecified" if unspecified, information about the dataset organization is auto-populated. see picture )

Screenshot 2024-01-02 at 9 58 08 AM

if we intend to compare by hash we would need to remove everything ckan adds to ensure we're comparing accurately.

@btylerburton
Copy link
Contributor Author

btylerburton commented Jan 2, 2024

Alternately, we can hash the dataset prior to pushing to CKAN and store it in S3. Then we compare the incoming hash with the previously recorded one. This also allows us to bypass CKAN API for fetching the datasets and to control the amount of information that is hashed.

@jbrown-xentity
Copy link
Contributor

Ah. So this becomes a sticky problem. Currently CKAN houses the original, raw metadata in the harvest_object table, and can report that on demand. The dataset has a link to that item in the harvest_object table. We'll need to recreate something similar, whereby we have CKAN (or S3 per Tyler's suggestion) store the original (maybe original but sorted?) metadata to compare the source against.

@jbrown-xentity
Copy link
Contributor

After discussing with @rshewitt , we're going to move forward with putting the raw metadata into the catalog. There are a couple of reasons for this, namely that it is often referenced and available and used in the API currently. Since the transformations from CSDGM and/or ISO to DCAT-US are "lossy" (not all fields in CSDGM and ISO have an equivalent in DCAT-US), we want the raw metadata available to end-users (it's not just for the harvesting process).
That being said, we now need to build a few components into the load function: tracking what "source" a dataset came from, and having the raw DCAT-US json object stored as an extra. Then we can write a "CKAN extract", which pulls all data sets for a given source, hashes the raw object, and then sends to the compare. See diagram above for the full workflow. Tagging @btylerburton for awareness.

@rshewitt
Copy link
Contributor

rshewitt commented Jan 3, 2024

package search on dev wasn't working as intended for a recently added dataset. I used catalog-dev.data.gov as the ckan route. querying for something like the programCode would return no results when it should return something. @jbrown-xentity think's there could be an issue with solr for that route. package creation has to be reenabled via the nginx config, not sure if this contributes to this issue. @FuhuXia do you know what could be causing this?

@rshewitt
Copy link
Contributor

rshewitt commented Jan 3, 2024

changing the ckan route to catalog-dev-admin-datagov.app.cloud.gov fixed the issue.

@rshewitt
Copy link
Contributor

rshewitt commented Jan 3, 2024

pagination is possible using something like...

num_rows = 1000
start = 0 
count = 300000
for i in range( 0, count, num_rows ):
  url = f"https://catalog.data.gov/api/action/package_search?q=*:*&rows={num_rows}&start={i}"
  res = requests.get( url ) 
  # do something with the response
  start = i 

confirmed using catalog

@rshewitt rshewitt moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Jan 4, 2024
@rshewitt rshewitt moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Jan 4, 2024
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Jan 4, 2024
@github-project-automation github-project-automation bot moved this from 🗄 Closed to ✔ Done in data.gov team board Feb 16, 2024
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0
Projects
Archived in project
Development

No branches or pull requests

3 participants