Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

Scrape and harvest collections / sources - PROTOTYPE [P1(OCR)] #115

Closed
6 of 8 tasks
nightsh opened this issue Apr 14, 2020 · 0 comments · Fixed by #128
Closed
6 of 8 tasks

Scrape and harvest collections / sources - PROTOTYPE [P1(OCR)] #115

nightsh opened this issue Apr 14, 2020 · 0 comments · Fixed by #128
Assignees

Comments

@nightsh
Copy link
Contributor

nightsh commented Apr 14, 2020

Depends on #114

Harvesting collections and sources depends on the schema validation allowing groups of different types and package relationships in the data.json source files.

Proposed Spec For Implementation is located here: Specs For Implementing data.json Validation Schema for Dept of Ed

Format

  • Scraping output
    • existing files (packages) go into a subdirectory called datasets
    • new subdirectories under the scraper output bucket: collections and sources
    • all the code iterating through output buckets needs to be adjusted
  • Datajson transformer
    • package level:
      • collections new key - list of collection names
    • root level
      • a new @type value for collection type
        • same metadata as package, same rules, plus:
          • sources: list of sources names if type is collection
          • collections: list of collection names if type is source

Scraping rules:

  • If a HTML page contains multiple datasets -> extract the page itself as a collection
  • If a HTML page contains no datasets, but it has multiple links to pages that are collections -> extract the page as a source

CKAN extensions updates:

The datajson extension needs Collection / Source processing capabilities based on the data it finds in the data.json file.

Tasks:

  • Finish up the tech spec
  • Implement the scraping output changes
  • Implement the scraping rules for Collection / Source
  • Add the new items to the datajson schema we are using
  • Load a datajson containing collections and sources into a harvester source and test

Acceptance criteria:

@osahon-okungbowa osahon-okungbowa changed the title [stub] Scrape and harvest collections / sources Scrape and harvest collections / sources- PROTOTYPE P1(OCR) May 3, 2020
@osahon-okungbowa osahon-okungbowa self-assigned this May 3, 2020
@osahon-okungbowa osahon-okungbowa changed the title Scrape and harvest collections / sources- PROTOTYPE P1(OCR) Scrape and harvest collections / sources - PROTOTYPE [P1(OCR)] May 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants