Skip to content

Latest commit

 

History

History
116 lines (97 loc) · 6.38 KB

README.md

File metadata and controls

116 lines (97 loc) · 6.38 KB

GAIT: Graph Neural Network Approach to Semantic Type Detection in Tables

This repository contains source code of the paper: Graph Neural Network Approach to Semantic Type Detection in Tables, published at Pacific-Asia Conference on Knowledge Discovery and Data Mining 2024.

Structure of the repository

At the root folder of the project, you will see:

├── data  # Contains training and testing data
├── model # Source files for GNN models of GAIT
|  └──  configs.py  # Stores address of current directory
|  └──  dataset.py  # Manages data loading
|  └──  GAIT_GAT.py  # Applies GAT to the output of single-column prediction module
|  └──  GAIT_GCN.py  # Applies GCN to the output of single-column prediction module
|  └──  GAIT_GGNN.py  # Applies GGNN to the output of single-column prediction module
|  └──  gnn.py  # handles defining different GNNs
├── Pre-processing  #Codes for preparing RECA's output for GNN  
├── results  #classification report
├── saved_models_GAT  # Stores trained GAT models
├── saved_models_GCN  # Stores trained GCN models
├── saved_models_GGNN  # Stores trained GGNN models
├── README.md
├── requirements.txt  # Lists required Python libraries
├── ...

Environment setup

The following instruction were tested on Python 3.10, NVIDIA Tesla V100, and also NVIDIA GeForce RTX 4090. We recommend using a python virtual environment:

mkdir venv
virtualenv --python=python3 venv

#fill in $BASEPATH with the repository address
export BASEPATH=[path to the repo]
source venv/bin/activate

After activating the virtual environment, install required library packages:

cd $BASEPATH
pip3 install -r requirements.txt

Environment setup is done. Next time just simply run the following code to activate the virtual environment:

source venv/bin/activate

Single-column prediction

GAIT uses RECA as its single-column prediction module. The source code for RECA is accessible here. However, GAIT is compatible with any model like Sherlock capable of producing a logit vector for each table column.

Dataset and Data Formatting

Obtaining and Preparing Data

The input for GNN part of GAIT is the logit generated by the single-column prediction model (RECA). This data should be prepared as follows:

  1. Download the Preprocessed Data: You can download the prepared logit outputs directly from:

    • RECA Output. Place these files in the data directory.
  2. Generate Data Using RECA:

    • Modify RECA's Output: To adapt the output of RECA for specific datasets like Semtab and Webtables to the format required by our GNN, we have included python codes in the /Pre-processing directory. These codes are tailored to modify RECA's output and are stored in their corresponding experiment directory within RECA's source code. Utilize these codes to ensure that the data is formatted correctly for GNN's input requirements.
    • Run RECA and Pre-processing Scripts: Execute RECA's source code along with the aforementioned pre-processing codes. This will convert RECA's output into the structured logit vectors needed by GNN part of GAIT.
    • Place the Processed Data: Once processed, place the output files in the data directory.

Data Format

The input data for GAIT should be stored in a serialized (pickle) file, containing a numpy array that represents the tables to be processed. Each element in the array corresponds to a table, structured as a dictionary with several keys that describe the features and properties of each table. Here is how our code opens the file:

with open(path to data, "rb") as f:
    data = pickle.load(f)

Structure of the Input Data

Each entry in the numpy array is a dictionary representing a single table, with the following keys:

  • features: A numpy array of logit vectors outputted by the single-column prediction model (e.g., RECA). These vectors act as the initial features for the nodes in the graph neural network (GNN).
  • labels: A numpy array containing the labels for each column within the table, where each label categorizes the semantic type of the column.
  • masks: A numpy array of binary values (1 or 0), where 1 indicates a valid column and 0 indicates an invalid column. This helps to manage tables with varying numbers of columns.
  • table_id: A string identifier unique to each table.

Example of Input Data

Below is an example demonstrating how a single table's data is structured within the numpy array. This example includes just one table:

[{'features': array([[-2.52733374, ... , -3.81204939],
        [ 2.22598338, ... , -3.21132755],
        [ 0, ..., 0],
        [ 0, ..., 0],
        [ 0, ..., 0]]), 'labels': array([ 5, 14, -1, -1, -1, -1]), 'masks': array([1, 1, 0, 0, 0, 0]), 'table_id': '0_1438042987171.38;warc;CC-MAIN-20150728002307-00339-ip-10-236-191-2.ec2.internal.json.gz_993-Vanderbilt University | Co_D2SH5A2V3AI2PNTZ'}]
]

Model Training and Evaluation

Training Commands

To train the GNN module of GAIT, execute the commands below, selecting parameters as needed for your specific requirements. Each script corresponds to a different variant of the GNN (GAT, GCN, GGNN). For detailed information on available parameters and their effects, refer to the explanation within each code. Adjust these parameters based on your dataset characteristics and training preferences:

cd $BASEPATH/model
python GAIT_GAT.py --data-name dataset_semtab_4 --classes 275 --epochs 100 --num-heads 1 --num-out-heads 1  --num-layers 1 --mode train
python GAIT_GCN.py --data-name dataset_semtab_4 --classes 275 --epochs 100  --num-layers 1 --mode train
python GAIT_GGNN.py --data-name dataset_semtab_4 --classes 275 --epochs 100  --num-layers 1 --mode train

Evaluation

To evaluate an existing model, use the --mode eval option with the desired model code.

Citing this Work

To cite this work, please use the below bibtex:

@inproceedings{hoseinzade2024graph,
  title={Graph Neural Network Approach to Semantic Type Detection in Tables},
  author={Hoseinzade, Ehsan and Wang, Ke},
  booktitle={Pacific-Asia Conference on Knowledge Discovery and Data Mining},
  pages={121--133},
  year={2024},
  organization={Springer}
}