-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Initial Compare Function for DCAT-US #4557
Comments
compare branch. added compare logic and unit test. need to add integration test against real ckan endpoint. |
ckan adds things to a dataset which doesn't or may not derive from the catalog itself ( e.g. if we intend to compare by hash we would need to remove everything ckan adds to ensure we're comparing accurately. |
Alternately, we can hash the dataset prior to pushing to CKAN and store it in S3. Then we compare the incoming hash with the previously recorded one. This also allows us to bypass CKAN API for fetching the datasets and to control the amount of information that is hashed. |
Ah. So this becomes a sticky problem. Currently CKAN houses the original, raw metadata in the harvest_object table, and can report that on demand. The dataset has a link to that item in the harvest_object table. We'll need to recreate something similar, whereby we have CKAN (or S3 per Tyler's suggestion) store the original (maybe original but sorted?) metadata to compare the source against. |
After discussing with @rshewitt , we're going to move forward with putting the raw metadata into the catalog. There are a couple of reasons for this, namely that it is often referenced and available and used in the API currently. Since the transformations from CSDGM and/or ISO to DCAT-US are "lossy" (not all fields in CSDGM and ISO have an equivalent in DCAT-US), we want the raw metadata available to end-users (it's not just for the harvesting process). |
package search on dev wasn't working as intended for a recently added dataset. I used |
changing the ckan route to |
pagination is possible using something like... num_rows = 1000
start = 0
count = 300000
for i in range( 0, count, num_rows ):
url = f"https://catalog.data.gov/api/action/package_search?q=*:*&rows={num_rows}&start={i}"
res = requests.get( url )
# do something with the response
start = i confirmed using catalog |
User Story
In order to load test our compare solution, datagovteam wants to develop the initial iteration of our compare app functionality for DCAT-US.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
GIVEN I have a DCAT-US harvest source that has been loaded into our Dynamic DAG ETL pipeline and extracted into individual records
THEN I would like to use a hashing function on each record and store that in an iterable map (harvest_source_map) in the form
id: source_hash
GIVEN I have loaded the same harvest source from the CKAN DB
THEN I would like to use the recorded result of the same hashing function stored as metadata in the CKAN record to create an iterable map (catalog_source_map) in the form
id: source_hash
For the below operations assume a for loop over the harvest_source_map is in progress:
GIVEN the ID of the dataset is found in the CKAN catalog but the hash is not the same
THEN I want to add that dataset to the list of items to update (packages_to_create)
GIVEN the ID of the dataset is not found in the CKAN catalog
THEN I want to add that dataset to the list of items to create (packages_to_create)
GIVEN the ID of the dataset is found and the hash is the same
THEN I know the dataset is unchanged, and I can move onto the next record
GIVEN that all records in the harvest source have been traversed
THEN I know that the ID's which remain in the Catalog hashmap can be deleted from CKAN catalog (packages_to_destroy)
Background
[Any helpful contextual notes or links to artifacts/evidence, if needed]
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
The text was updated successfully, but these errors were encountered: