Cross-Domain Semantic Parsing / Natural Language Interface

The Cold Start Problem

Semantic parsing, which maps natural language utterances into computer-understandable logical forms, has drawn substantial attention recently as a promising direction for developing natural language interfaces to computers. There are so many domains (healthcare, finance, IoT, sports, etc.) for which we can build a natural language interface, making portability / scalability an impending challenge. Or in other words, it's the cold start problem of natural language interface:

Given a new domain, how can we build a natural language interface for it?

Solution

There are three complementary ways to solve the cold start problem:

Re-use the training data for some existing domains via transfer learning (this repo)
Collect training data for the new domain via crowdsourcing [1] [2]
Once we cold-started a natural language interface with reasonable performance, develop some user-friendly interaction mechanism, deploy the system and let it interact with real users so it can keep refining itself [3] [4]

Use of This Repo

Requirements

Python 2.7
Tensorflow 0.11 (yes the TF version is a bit old..but it's still working reasonably well!)
Yaml (for logging)

Setup

Install Tensorflow 0.11:

(GPU support)

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0-cp27-none-linux_x86_64.whl
pip install --ignore-installed --upgrade $TF_BINARY_URL

(CPU only)

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0-cp27-none-linux_x86_64.whl
pip install --ignore-installed --upgrade $TF_BINARY_URL

Install other dependencies:

pip install pyyaml

Training

We use the Overnight dataset, which contains 8 domains including Basketball, Calendar, and Restaurants. The dataset is already pre-processed and can be found under data.

Assume we are at the the root of the repo. All of training and testing can be done with the following command:

sh scripts/batch_run.sh 0 train_grid_unit_var overnight 0

The arguments are:

GPU ID: which GPU to use for this run?
Training Script: each word embedding initialization has a separate script
Dataset: for now, the only option is overnight
Execution Number: a unique number for this execution. A corresponding dir will be created under execs/ to host the trained model and the log of this execution.

The command will do the following tasks for each of the 8 domains:

In-domain: Train and test
In-domain: Re-train with the full training data (i.e., training+validation) and then test (final results for in-domain setting)
Cross-domain: Pre-training on the source domains
Cross-domain: Warm-start on target domain with pre-trained model, fine-tune with in-domain data, and test
Cross-domain: Re-train with full in-domain training data and then test (final results for cross-domain setting)

It's easy to train for another word embedding initialization strategy, e.g., original word2vec embedding without standardization. Just change the training script and execution number:

sh scripts/batch_run.sh 0 train_grid_original overnight 1

Extract Testing Results

We provide a script to make it easy to extract the testing results across all of the domains. For example,

In-domain, exec_num=0, re-training with full training data:

python scripts/extract_test_result.py in-domain overnight 0 1

In-domain, exec_num=0, no re-training:

python scripts/extract_test_result.py in-domain overnight 0 0

Cross-domain, exec_num=5, re-training with full training data:

python scripts/extract_test_result.py cross-domain overnight 5 1

References

Please refer to the following paper for more details. If you find it useful, please kindly consider to cite:

@InProceedings {su2017cross,
    author    = {Su, Yu and Yan, Xifeng},
    title     = {Cross-domain Semantic Parsing via Paraphrasing},
    booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing},
    pages     = {1235--1246},
    year      = {2017},
    address   = {Copenhagen, Denmark},
    month     = {Sept},
    publisher = {Association for Computational Linguistics}
}

Other references for cold-starting a natural language interface

[1] Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, Mark Encarnacion. Building Natural Language Interfaces to Web APIs. CIKM 2017.

[2] Yu Su, Huan Sun, Brian Sadler, Mudhakar Srivatsa, Izzeddin Gur, Zenghui Yan, Xifeng Yan. On Generating Characteristic-rich Question Sets for QA Evaluation. EMNLP 2016.

[3] Izzeddin Gur, Semih Yavuz, Yu Su, Xifeng Yan. DialSQL: Dialogue Based Structured Query Generation. ACL 2018.

[4] Yu Su, Ahmed Hassan Awadallah, Miaosen Wang, Ryen White. Natural Language Interfaces with Fine-Grained User Interaction: A Case Study on Web APIs. SIGIR 2018

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data/overnight		data/overnight
misc		misc
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logging.yaml		logging.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Domain Semantic Parsing / Natural Language Interface

The Cold Start Problem

Solution

Use of This Repo

Requirements

Setup

Training

Extract Testing Results

References

Other references for cold-starting a natural language interface

About

Releases

Packages

Languages

License

ysu1989/CrossSemparse

Folders and files

Latest commit

History

Repository files navigation

Cross-Domain Semantic Parsing / Natural Language Interface

The Cold Start Problem

Solution

Use of This Repo

Requirements

Setup

Training

Extract Testing Results

References

Other references for cold-starting a natural language interface

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages