This is the repository for the code and data of the VLDB 2023 paper: RECA: Related Tables Enhanced Column Semantic Type Annotation Framework.
$ git clone [link to repo]
$ cd RECA
$ pip install -r requirements.txt
If you are using Anaconda, you can create a virtual environment and install all the packages:
$ conda create --name RECA python=3.7
$ conda activate RECA
$ pip install -r requirements.txt
In order to reproduce the results on Semtab2019 dataset, please go through the following steps:
- Remove the init files in all the directories (they are created for placeholding purpose on github).
- Download the pre-trained models and pre-processed data, please check the instructions in checkpoints and jsonl_data for details.
- Tokenize the jsonl data, please follow the suggestions in pre-process (Alternatively, you can download the raw dataset from here and pre-process the data from scratch. For detailed instructions please check pre-process).
- Run the experiments, you can either load the pre-trained models and run RECA-semtab-test-from-pre-trained.py or train from scratch by running RECA-semtab-train+test.py
In order to reproduce the results on WebTables dataset, please go through the following steps:
- Remove the init files in all the directories (they are created for placeholding purpose on github).
- Download the tokenized data or raw dataset, please check the instructions in pre-process.
- Pre-process the raw dataset if you want to star from the raw dataset, please follow the steps described in 'Start from scrtach' in pre-process, you can skip this step if use the tokenized data directly.
- Run the experiment file RECA-webtables-train.py in the experiment folder to start training.
RECA/
└── Semtab
├── checkpoints
├── data
├── distance-files (store the edit distances)
├── json (store the base json files that contain the table content)
├── jsonl_data (store the pre-process table data)
├── raw_data (the raw table dataset)
└── tokenized_data (the tokenized data used for the experiments)
├── experiment
├── semtab_labels.json (class types)
├── RECA-semtab-test-from-pre-trained.py (directly reproduce the result by running this file)
└── RECA-semtab-train+test.py (train the model from scratch)
└── pre-process
├── transform_to_json.py (generate base json files from the raw data)
├── NER_extraction.py (NER tagging)
├── pre-process.py (table finding and alignment)
├── make_json_input.py (generate jsonl data)
├── jaccard_filterjson.py (table filtering)
└── semtab-datasets.py (generate tokenized data)
└── WebTables
├── checkpoints
├── data
├── distance-files (store the edit distances)
├── json (store the base json files that contain the table content)
├── jaccard (store the jaccard distance)
├── webtables (the raw table dataset)
├── out (store the pre-process table data)
└── tokenized_data (the tokenized data used for the experiments)
├── experiment
├── label_dict.json (class types)
└── RECA-webtables-train.py (train the model from scratch)
└── pre-process
├── compute_jaccard.py (compute the jaccard distance between tables)
├── pre-process-webtables.py (table finding, alignment, filtering, generate json files)
└── webtables-datasets.py (generate tokenized data)
└── requirements.txt
@article{sun2023reca,
title={Reca: Related tables enhanced column semantic type annotation framework},
author={Sun, Yushi and Xin, Hao and Chen, Lei},
journal={Proceedings of the VLDB Endowment},
volume={16},
number={6},
pages={1319--1331},
year={2023},
publisher={VLDB Endowment}
}