RECA

This is the repository for the code and data of the VLDB 2023 paper: RECA: Related Tables Enhanced Column Semantic Type Annotation Framework.

Install

$ git clone [link to repo]
$ cd RECA
$ pip install -r requirements.txt

If you are using Anaconda, you can create a virtual environment and install all the packages:

$ conda create --name RECA python=3.7
$ conda activate RECA
$ pip install -r requirements.txt

Reproduce the results

In order to reproduce the results on Semtab2019 dataset, please go through the following steps:

Remove the init files in all the directories (they are created for placeholding purpose on github).
Download the pre-trained models and pre-processed data, please check the instructions in checkpoints and jsonl_data for details.
Tokenize the jsonl data, please follow the suggestions in pre-process (Alternatively, you can download the raw dataset from here and pre-process the data from scratch. For detailed instructions please check pre-process).
Run the experiments, you can either load the pre-trained models and run RECA-semtab-test-from-pre-trained.py or train from scratch by running RECA-semtab-train+test.py

In order to reproduce the results on WebTables dataset, please go through the following steps:

Remove the init files in all the directories (they are created for placeholding purpose on github).
Download the tokenized data or raw dataset, please check the instructions in pre-process.
Pre-process the raw dataset if you want to star from the raw dataset, please follow the steps described in 'Start from scrtach' in pre-process, you can skip this step if use the tokenized data directly.
Run the experiment file RECA-webtables-train.py in the experiment folder to start training.

Repository Structure

RECA/
└── Semtab
    ├── checkpoints 
    ├── data 
        ├── distance-files (store the edit distances)
        ├── json (store the base json files that contain the table content)
        ├── jsonl_data (store the pre-process table data)
        ├── raw_data (the raw table dataset)
        └── tokenized_data (the tokenized data used for the experiments)
    ├── experiment
        ├── semtab_labels.json (class types)
        ├── RECA-semtab-test-from-pre-trained.py (directly reproduce the result by running this file)
        └── RECA-semtab-train+test.py (train the model from scratch)
    └── pre-process
        ├── transform_to_json.py (generate base json files from the raw data)
        ├── NER_extraction.py (NER tagging)
        ├── pre-process.py (table finding and alignment)
        ├── make_json_input.py (generate jsonl data)
        ├── jaccard_filterjson.py (table filtering)
        └── semtab-datasets.py (generate tokenized data)
└── WebTables
    ├── checkpoints
    ├── data
        ├── distance-files (store the edit distances)
        ├── json (store the base json files that contain the table content)
        ├── jaccard (store the jaccard distance)
        ├── webtables (the raw table dataset)
        ├── out (store the pre-process table data)
        └── tokenized_data (the tokenized data used for the experiments)
    ├── experiment
        ├── label_dict.json （class types）
        └── RECA-webtables-train.py (train the model from scratch)
    └── pre-process
        ├── compute_jaccard.py (compute the jaccard distance between tables)
        ├── pre-process-webtables.py (table finding, alignment, filtering, generate json files)
        └── webtables-datasets.py (generate tokenized data)
└── requirements.txt

Citations

@article{sun2023reca,
  title={Reca: Related tables enhanced column semantic type annotation framework},
  author={Sun, Yushi and Xin, Hao and Chen, Lei},
  journal={Proceedings of the VLDB Endowment},
  volume={16},
  number={6},
  pages={1319--1331},
  year={2023},
  publisher={VLDB Endowment}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Semtab		Semtab
WebTables		WebTables
imgs		imgs
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RECA

Install

Reproduce the results

Repository Structure

Citations

About

Releases

Packages

Languages

ysunbp/RECA-paper

Folders and files

Latest commit

History

Repository files navigation

RECA

Install

Reproduce the results

Repository Structure

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages