diff --git a/README.MD b/README.MD index dea32e4..6742244 100644 --- a/README.MD +++ b/README.MD @@ -1,23 +1,124 @@ -# Curation-Pipeline +# Curation pipeline for CORD19 - Allen Institute +------------- -Where we receive one or more PDFs, extract the figures and captions (`PDFigCapX`), split the figures into subfigures (`FigSplit`) and store the information in the curation database. While database stores the metadata, we store the PDF, figures, subfigures and extraction logs are stored in an user-determined output folder. For the curation front-end purposes, this locations should be the folder serving the static files. +This version of the Curation pipeline takes as input the metadata provided in [CORD19](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html) by [Allen Institue for AI](https://www.semanticscholar.org/paper/CORD-19%3A-The-Covid-19-Open-Research-Dataset-Wang-Lo/4a10dffca6dcce9c570cb75aa4d76522c34a2fd4) -## Dependencies -`PDFigCapX` uses Selenium, Xpdf command line tools, and ImageMagick to extract the document content. Therefore, the environment should have installed a chrome-driver, and Xpdf/bin64 and ImageMagick/convert.exe should be available on a given location (`DEPENDENCIES`). +## Directory structure ++ **input**: In case you are provided with a set of papers to filter out, e.g. list of cords ids, you can copy it here so the next steps can take into account. ++ **log**: All generated log and tracker files to support the pipeline ++ **output**: For resulting generated files that different stages will create. ++ **src**: Python souce code + + **(default)**: Main Wrappers for Download, PDF + + **image_extraction**: Core funcionality of the cord19 data processing + + **init.config**: The configuration file containing all the global variables that will be used through the pipeline. e.g. home_directory, number_processors, tracker_file_path, etc + + **test**: Different test scenarios using the Wrappers. e.g. one directory, from a file, parallel processes, etc -To store the databse content and create a task on the curation system, we need to specify the endpoints for those services on the configuration file (config.json) or through command line arguments (see `main.py`). -We wrote the pipeline originally in Python 2.7 but then migrated most of the content to Python 3.X. However, `PDFigCapX` binaries (in the compiled folder) still use Python 2.7. We integrated these components using `execnet` but hopefully we will have a native Python 3.X soon. Finally, `PDFigCapX` relies on opencv 2.4.X which is no longer available as a PiPy package. We solved this problem by building opencv from source. For more details about all the dependencies setup, please refer to the Dockerfile for Ubuntu 18.04. +## Pipeline stages +------------------------ +The pipeline consists of a series of modules developed in Python to extract image and text data (image captions) throughout different stages by using and generating input/output files -## Running from command line (legacy) +![pipeline overview](images/pipeline.png) -1. Place the PDFs documents on a folder. -2. Locate your output folder. **Limitation:** the pipeline skips a document if the output folder already contains a document with the same name. -3. Set up `config.json` with the location of binaries and endpoints. Also, indicate the `groupname` and `organization` to apply the round-robin strategy for task delegation. -4. Execute the following commands. **MAX_NUMBER_DOCS** indicates the maximum number of documents to process at this time; if you need to process all the documents, enter a big number. +#### Stage 0: Preprocessing +Here we can contemplate two main scenarios: +1. **Metadata as input**: + This is the straightforward method, where use the metadata from CORD19. We simply copy the _metadata.csv_, from the extracted collection to the input directory under the home path. + ```sh + $ cd src/image_extraction + $ python3 Cleaning.py + ``` + +2. **Considering additional file**: + Here we have been provided an additional file to work with, namely "CORD_UID.csv". This contains a list of unique ID of papers. The idea is to match these IDs and get the corresponding match from the metadata.csv file. + We have to copay the _cord_uid.csv_ to the input directory under the home path. + ```sh + $ cd src/image_extraction/preprocesion/download/ + $ python3 preprocess_data.py + ``` +Important variables to consider in the config file: +```sh + metadata_file + home_dir + input_dir + pmcid_file + cord_uid_file +``` + +**Output**: Either case, it will generate a PMCIDS.csv file located in the input directory which will be used for the next stages. + + + + +#### Stage 1: Download +**Goal**: Here we take the PMCIDS.csv file and iterate over the IDs to request the corresponding compressed file by pointing out to the NCBI (1) FTP Server, then we only extract the PDF files. + +(1) PubMed CentralĀ® (PMC) at National Center for Biotechnology Information +Here we can contemplate two main scenarios: + +**Main Classes**: + - image_extraction/Master_Download.py + - image_extraction/Download.py + - Downloadpaper.py + +In order to use parallel processing, make sure to update the number of processors in the **_init.config_** file if needed. + +This implementation support incremental processing, meaning if the process stops at certain point, you can resume the processing with new records, this is achieved using the tracking files. +```sh + $ cd src/image_extraction + $ python3 Master_Download.py +``` +It will start printing out the PMCIDs processed, with times and counters. + +**output**: + - The **_download_track.csv_** will contain the record of PMCIDs either those were success or error. It is located in the {home}/log directory. + - Also, every single PDF file downloaded will be located in a folder named after its own PMCID. e.g. **PMC102030**/main.pdf. It is located in {home}/output + + + + + +#### Stage 2: Image Extraction +**Goal**: The downloaded pdf is the input for this next stage. This component will extract images and captions from the PDF file of each paper. + +**Main Classes**: + - image_extraction/Master_Extract.py + - image_extraction/Extract.py + - PDFigCapX.py + +This step can also be executed in parelallel proccess. See the number of processors in **_init.config_**. Also, similar to the previous stage, Extraction also support incremental processing, with the use of tracking files. + +```sh + $ cd src/image_extraction + $ python3 Master_Extract.py ``` -Xvfb :99 & export DISPLAY=:99 # run chromedriver headless -python pipeline_runner.py /route/to/config.json /route/to/input/folder /route/to/output/folder MAX_NUMBER_DOCS +It will start printing out the PMCIDs processed, with times and counters. + +**output**: + - The **_extract_track.csv_** will contain the record of PMCIDs either those were success or error. It is located in the {home}/log directory. + - Also every PMCID folder will now contain a directory with the images extracted. + + + +#### Stage 3: Image Segementation +**Goal**: Once we get the individual extracted images from a paper, for those ones which are compounds of sub-images, we need to perform segmentation to get the subfigures + +**Main Classes**: + - image_extraction/Master_Split.py + - image_extraction/Split.py + - FigSplitWrapper.py + +This step can also be executed in parelallel proccess. See the number of processors in **_init.config_**. Also, similar to the previous stage, Extraction also support incremental processing, with the use of tracking files. + +```sh + $ cd src/image_extraction + $ python3 Master_Split.py ``` +It will start printing out the PMCIDs processed, with times and counters. + +**output**: + - The **_split_track.csv_** will contain the record of PMCIDs either those were success or error. It is located in the {home}/log directory. + - Also every PMCID folder will now contain a bew directory (figsplit_*) with the sub-images. + \ No newline at end of file diff --git a/images/pipeline.png b/images/pipeline.png new file mode 100644 index 0000000..df16cdd Binary files /dev/null and b/images/pipeline.png differ diff --git a/src/DownloadPaper.py b/src/DownloadPaper.py index 0738b36..75b19d0 100644 --- a/src/DownloadPaper.py +++ b/src/DownloadPaper.py @@ -76,7 +76,7 @@ def download_and_extract(self, url, _id): self.__track(str(_id), 'E', 'No PDF in the compressed file') except Exception as e: self.__track(str(_id), 'E', 'Exception {0}'.format(str(e))) - self.__logError('Exception {0}'.format(str(e))) + self.__logError(str(_id),'Exception {0}'.format(str(e))) def download_batch_ids(self, ids , output_dir): for id in ids: diff --git a/src/image_extraction/init.config b/src/image_extraction/init.config index 5771c1e..0d9ea0e 100644 --- a/src/image_extraction/init.config +++ b/src/image_extraction/init.config @@ -3,11 +3,13 @@ metadata_file: /workspace/allen/dataset/2020-10-15/metadata.csv home_dir: /workspace/allen/Allen-Collection-Curation output_dir: ${home_dir}/output log_dir: ${home_dir}/log -pmcid_file: ${log_dir}/PMCIDS.csv +input_dir: ${home_dir}/input +pmcid_file: ${input_dir}/PMCIDS.csv +cord_uid_file: ${input_dir}/cord_uid.csv delta_diff_file: ${log_dir}/diff_PMCID.csv [Download] -download_number_processors: 10 +download_number_processors: 20 download_track_file: ${DEFAULT:log_dir}/download_track.csv [Image Extraction] diff --git a/src/image_extraction/preprocessing/download/preprocess_data.py b/src/image_extraction/preprocessing/download/preprocess_data.py index 6b3fc4c..8584c2f 100644 --- a/src/image_extraction/preprocessing/download/preprocess_data.py +++ b/src/image_extraction/preprocessing/download/preprocess_data.py @@ -1,13 +1,24 @@ import pandas as pd -METADATA_PATH = "/workspace/allen/dataset/2020-10-15/metadata.csv" -CORDUID_PATH = "/workspace/allen/Allen-Collection-Curation/input/cord_uid.csv" +import configparser +from configparser import ConfigParser, ExtendedInterpolation + +config = configparser.ConfigParser(interpolation=ExtendedInterpolation()) +config.read('../../init.config') + +METADATA_PATH = config['DEFAULT']['metadata_file'] +CORDUID_PATH = config['DEFAULT']['cord_uid_file'] +PMCID_PATH = config['DEFAULT']['pmcid_file'] metadata = pd.read_csv(METADATA_PATH, low_memory=False) corduids = pd.read_csv(CORDUID_PATH, low_memory=False) -#Get document not in the file provided -pmcids = metadata[~ metadata.cord_uid.isin(corduids.iloc[:,0])]['pmcid'] +#Get those in CORD_UID, that have match in metadata +pmcids = metadata[metadata.cord_uid.isin(corduids.iloc[:,0])]['pmcid'] + +#Get those in MetaData, that are in the CORD_UID +#pmcids = metadata[~ metadata.cord_uid.isin(corduids.iloc[:,0])]['pmcid'] + #Filter those with pmcid pmcids = pmcids[~pmcids.isna()] -pmcids.to_csv (r'../../PMCIDS.csv', index = False, header=False) \ No newline at end of file +pmcids.to_csv (PMCID_PATH, index = False, header=['pmcid']) \ No newline at end of file