Skip to content

Preprocessing pipeline to extract and normalize text/code blocks from Stack Exchange forum posts and comments.

License

Notifications You must be signed in to change notification settings

sotorrent/preprocessing-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preprocessing Pipeline

This pipeline reads XML files from the official Stack Exchange data dump and extracts normalized text blocks into JSONL files.

To run the pipeline in Google Cloud, you need to set the following environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="$PWD/google-cloud-key.json"

First, you need to install the preprocessing_pipeline package:

python3 setup.py install

Then, you can run the pipeline:

preprocessing-pipeline --config_file "$PWD/config.json"

About

Preprocessing pipeline to extract and normalize text/code blocks from Stack Exchange forum posts and comments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages