Skip to content

Code and data for the VLDB 2023 paper: RECA: Related Tables Enhanced Column Semantic Type Annotation Framework

Notifications You must be signed in to change notification settings

ysunbp/RECA-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RECA

This is the repository for the code and data of the VLDB 2023 paper: RECA: Related Tables Enhanced Column Semantic Type Annotation Framework.

Overview of RECA

Install

$ git clone [link to repo]
$ cd RECA
$ pip install -r requirements.txt 

If you are using Anaconda, you can create a virtual environment and install all the packages:

$ conda create --name RECA python=3.7
$ conda activate RECA
$ pip install -r requirements.txt

Reproduce the results

In order to reproduce the results on Semtab2019 dataset, please go through the following steps:

  1. Remove the init files in all the directories (they are created for placeholding purpose on github).
  2. Download the pre-trained models and pre-processed data, please check the instructions in checkpoints and jsonl_data for details.
  3. Tokenize the jsonl data, please follow the suggestions in pre-process (Alternatively, you can download the raw dataset from here and pre-process the data from scratch. For detailed instructions please check pre-process).
  4. Run the experiments, you can either load the pre-trained models and run RECA-semtab-test-from-pre-trained.py or train from scratch by running RECA-semtab-train+test.py

In order to reproduce the results on WebTables dataset, please go through the following steps:

  1. Remove the init files in all the directories (they are created for placeholding purpose on github).
  2. Download the tokenized data or raw dataset, please check the instructions in pre-process.
  3. Pre-process the raw dataset if you want to star from the raw dataset, please follow the steps described in 'Start from scrtach' in pre-process, you can skip this step if use the tokenized data directly.
  4. Run the experiment file RECA-webtables-train.py in the experiment folder to start training.

Repository Structure

RECA/
└── Semtab
    ├── checkpoints 
    ├── data 
        ├── distance-files (store the edit distances)
        ├── json (store the base json files that contain the table content)
        ├── jsonl_data (store the pre-process table data)
        ├── raw_data (the raw table dataset)
        └── tokenized_data (the tokenized data used for the experiments)
    ├── experiment
        ├── semtab_labels.json (class types)
        ├── RECA-semtab-test-from-pre-trained.py (directly reproduce the result by running this file)
        └── RECA-semtab-train+test.py (train the model from scratch)
    └── pre-process
        ├── transform_to_json.py (generate base json files from the raw data)
        ├── NER_extraction.py (NER tagging)
        ├── pre-process.py (table finding and alignment)
        ├── make_json_input.py (generate jsonl data)
        ├── jaccard_filterjson.py (table filtering)
        └── semtab-datasets.py (generate tokenized data)
└── WebTables
    ├── checkpoints
    ├── data
        ├── distance-files (store the edit distances)
        ├── json (store the base json files that contain the table content)
        ├── jaccard (store the jaccard distance)
        ├── webtables (the raw table dataset)
        ├── out (store the pre-process table data)
        └── tokenized_data (the tokenized data used for the experiments)
    ├── experiment
        ├── label_dict.json (class types)
        └── RECA-webtables-train.py (train the model from scratch)
    └── pre-process
        ├── compute_jaccard.py (compute the jaccard distance between tables)
        ├── pre-process-webtables.py (table finding, alignment, filtering, generate json files)
        └── webtables-datasets.py (generate tokenized data)
└── requirements.txt

Citations

@article{sun2023reca,
  title={Reca: Related tables enhanced column semantic type annotation framework},
  author={Sun, Yushi and Xin, Hao and Chen, Lei},
  journal={Proceedings of the VLDB Endowment},
  volume={16},
  number={6},
  pages={1319--1331},
  year={2023},
  publisher={VLDB Endowment}
}

About

Code and data for the VLDB 2023 paper: RECA: Related Tables Enhanced Column Semantic Type Annotation Framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages