Create Initial Compare Function for DCAT-US #4557

btylerburton · 2023-12-13T23:38:58Z

User Story

In order to load test our compare solution, datagovteam wants to develop the initial iteration of our compare app functionality for DCAT-US.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN I have a DCAT-US harvest source that has been loaded into our Dynamic DAG ETL pipeline and extracted into individual records
THEN I would like to use a hashing function on each record and store that in an iterable map (harvest_source_map) in the form id: source_hash
GIVEN I have loaded the same harvest source from the CKAN DB
THEN I would like to use the recorded result of the same hashing function stored as metadata in the CKAN record to create an iterable map (catalog_source_map) in the form id: source_hash

For the below operations assume a for loop over the harvest_source_map is in progress:

GIVEN the ID of the dataset is found in the CKAN catalog but the hash is not the same
THEN I want to add that dataset to the list of items to update (packages_to_create)
GIVEN the ID of the dataset is not found in the CKAN catalog
THEN I want to add that dataset to the list of items to create (packages_to_create)
GIVEN the ID of the dataset is found and the hash is the same
THEN I know the dataset is unchanged, and I can move onto the next record
GIVEN that all records in the harvest source have been traversed
THEN I know that the ID's which remain in the Catalog hashmap can be deleted from CKAN catalog (packages_to_destroy)

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

The text was updated successfully, but these errors were encountered:

rshewitt · 2023-12-27T20:36:06Z

compare branch. added compare logic and unit test. need to add integration test against real ckan endpoint.

rshewitt · 2024-01-02T17:00:47Z

ckan adds things to a dataset which doesn't or may not derive from the catalog itself ( e.g. metadata_created defaults to utcnow, license_id appears to default to "notspecified" if unspecified, information about the dataset organization is auto-populated. see picture )

if we intend to compare by hash we would need to remove everything ckan adds to ensure we're comparing accurately.

btylerburton · 2024-01-02T17:06:32Z

Alternately, we can hash the dataset prior to pushing to CKAN and store it in S3. Then we compare the incoming hash with the previously recorded one. This also allows us to bypass CKAN API for fetching the datasets and to control the amount of information that is hashed.

jbrown-xentity · 2024-01-02T17:08:03Z

Ah. So this becomes a sticky problem. Currently CKAN houses the original, raw metadata in the harvest_object table, and can report that on demand. The dataset has a link to that item in the harvest_object table. We'll need to recreate something similar, whereby we have CKAN (or S3 per Tyler's suggestion) store the original (maybe original but sorted?) metadata to compare the source against.

jbrown-xentity · 2024-01-02T18:22:45Z

After discussing with @rshewitt , we're going to move forward with putting the raw metadata into the catalog. There are a couple of reasons for this, namely that it is often referenced and available and used in the API currently. Since the transformations from CSDGM and/or ISO to DCAT-US are "lossy" (not all fields in CSDGM and ISO have an equivalent in DCAT-US), we want the raw metadata available to end-users (it's not just for the harvesting process).
That being said, we now need to build a few components into the load function: tracking what "source" a dataset came from, and having the raw DCAT-US json object stored as an extra. Then we can write a "CKAN extract", which pulls all data sets for a given source, hashes the raw object, and then sends to the compare. See diagram above for the full workflow. Tagging @btylerburton for awareness.

rshewitt · 2024-01-03T15:57:08Z

package search on dev wasn't working as intended for a recently added dataset. I used catalog-dev.data.gov as the ckan route. querying for something like the programCode would return no results when it should return something. @jbrown-xentity think's there could be an issue with solr for that route. package creation has to be reenabled via the nginx config, not sure if this contributes to this issue. @FuhuXia do you know what could be causing this?

rshewitt · 2024-01-03T16:10:29Z

changing the ckan route to catalog-dev-admin-datagov.app.cloud.gov fixed the issue.

rshewitt · 2024-01-03T22:05:07Z

pagination is possible using something like...

num_rows = 1000
start = 0 
count = 300000
for i in range( 0, count, num_rows ):
  url = f"https://catalog.data.gov/api/action/package_search?q=*:*&rows={num_rows}&start={i}"
  res = requests.get( url ) 
  # do something with the response
  start = i

confirmed using catalog

btylerburton added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label Dec 13, 2023

btylerburton added this to data.gov team board Dec 13, 2023

btylerburton changed the title ~~Create Initial Compare App for DCAT-US~~ Create Initial Compare Function for DCAT-US Dec 14, 2023

gujral-rei moved this to 📔 Product Backlog in data.gov team board Dec 14, 2023

gujral-rei moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Dec 14, 2023

gujral-rei moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Dec 14, 2023

gujral-rei moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Dec 21, 2023

rshewitt self-assigned this Dec 21, 2023

rshewitt moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Dec 21, 2023

rshewitt mentioned this issue Jan 3, 2024

Feature/compare GSA/datagov-harvester#28

Merged

4 tasks

rshewitt moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Jan 4, 2024

rshewitt moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Jan 4, 2024

btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Jan 4, 2024

btylerburton closed this as completed Feb 16, 2024

github-project-automation bot moved this from 🗄 Closed to ✔ Done in data.gov team board Feb 16, 2024

btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Initial Compare Function for DCAT-US #4557

Create Initial Compare Function for DCAT-US #4557

btylerburton commented Dec 13, 2023 •

edited

Loading

rshewitt commented Dec 27, 2023

rshewitt commented Jan 2, 2024

btylerburton commented Jan 2, 2024 •

edited

Loading

jbrown-xentity commented Jan 2, 2024

jbrown-xentity commented Jan 2, 2024

rshewitt commented Jan 3, 2024

rshewitt commented Jan 3, 2024

rshewitt commented Jan 3, 2024

Create Initial Compare Function for DCAT-US #4557

Create Initial Compare Function for DCAT-US #4557

Comments

btylerburton commented Dec 13, 2023 • edited Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

rshewitt commented Dec 27, 2023

rshewitt commented Jan 2, 2024

btylerburton commented Jan 2, 2024 • edited Loading

jbrown-xentity commented Jan 2, 2024

jbrown-xentity commented Jan 2, 2024

rshewitt commented Jan 3, 2024

rshewitt commented Jan 3, 2024

rshewitt commented Jan 3, 2024

btylerburton commented Dec 13, 2023 •

edited

Loading

btylerburton commented Jan 2, 2024 •

edited

Loading