Preprocessing Pipeline

This pipeline reads XML files from the official Stack Exchange data dump and extracts normalized text blocks into JSONL files.

To run the pipeline in Google Cloud, you need to set the following environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="$PWD/google-cloud-key.json"

First, you need to install the preprocessing_pipeline package:

python3 setup.py install

Then, you can run the pipeline:

preprocessing-pipeline --config_file "$PWD/config.json"

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
datasets		datasets
doc		doc
output		output
preprocessing_pipeline		preprocessing_pipeline
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
config.json		config.json
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback