The DataProcessingPipeline
is the core module for orchestrating various components involved in processing data, calculating similarity scores, and generating assignments for the Reviewer Matcher system.
This pipeline integrates multiple stages, including:
- Data Loading: Loading and preprocessing data for projects, experts, and publications.
- Metadata Enrichment: Adding contextual information like MeSH terms and summaries.
- Similarity Calculation: Generating similarity scores between projects and experts.
- Expert Ranking: Ranking experts for projects based on multiple criteria.
- Expert Assignment: Assigning experts to projects based on rankings and constraints.
To initialize the pipeline, you need to pass the following parameters:
config_manager
: A configuration manager object for handling settings.call
(optional): Call-specific settings to override defaults.all_components
(optional): A list of components to execute in the pipeline.test_mode
(default:False
): Run the pipeline in test mode with a reduced dataset.test_number
(default:10
): Number of rows to process in test mode.force_recompute
(default:False
): Recompute data even if existing outputs are available.
from data_processing_pipeline import DataProcessingPipeline
pipeline = DataProcessingPipeline(config_manager, test_mode=True, test_number=5)
You can run the entire pipeline with all components included:
pipeline.run_pipeline()
You can specify a list of components to run, or exclude specific ones:
components_to_run = ["project_classification", "similarity_computation"]
pipeline.run_pipeline(components=components_to_run)
Exclude components by using the exclude
parameter:
components_to_exclude = ["publication_data_loading"]
pipeline.run_pipeline(exclude=components_to_exclude)
The pipeline supports the following components, which can be run individually:
project_data_loading
: Loads project data.expert_data_loading
: Loads expert data.publication_data_loading
: Loads publication data.project_classification
: Classifies projects with research areas and approaches.project_summarization
: Summarizes project content.project_mesh_tagging
: Tags projects with MeSH terms.publication_summarization
: Summarizes publication content.publication_mesh_tagging
: Tags publications with MeSH terms.similarity_computation
: Computes similarity scores between experts and projects.expert_ranking
: Ranks experts based on similarity scores.expert_assignment
: Assigns experts to projects.
Run the pipeline to compute similarity scores between experts and projects:
components_to_run = ["similarity_computation"]
pipeline.run_pipeline(components=components_to_run)
Run the pipeline to assign experts to projects:
components_to_run = ["expert_assignment"]
pipeline.run_pipeline(components=components_to_run)
-
The
test_mode
flag reduces the dataset size for faster execution during testing or debugging. -
Use the
force_recompute
flag to recompute data even if pre-existing outputs are found. -
All intermediate outputs are saved to directories specified in the configuration.
This documentation provides a comprehensive guide to understanding and using the DataProcessingPipeline
effectively.