Generate to Understand for Representation

This repository contains the code and models discussed in our paper "Generate to Understand for Representation"(https://arxiv.org/abs/2306.10056).

Introducing GUR: a pretraining framework that combines language modeling and contrastive learning objectives in a single training step. We select similar text pairs based on their Longest Common Substring (LCS) from raw unlabeled documents and train the model using masked language modeling and unsupervised contrastive learning. The resulting model, GUR, achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting. Additionally, GUR maintains its language modeling ability, as demonstrated in our ablation experiment.

Architecture

generate samples

python sents2pair.py convet corpus to pairs

train

bash train.sh

init model(optional)

python convert.py

Citation

@INPROCEEDINGS{10438270,
  author={Xue, Changshang and Zhong, Xiande and Liu, Xiaoqing},
  booktitle={2023 11th International Conference on Information Systems and Computing Technology (ISCTech)}, 
  title={Generate to Understand for Representation in One Pre-training Stage}, 
  year={2023},
  volume={},
  number={},
  pages={258-267},
  keywords={Training;Computational modeling;Self-supervised learning;Benchmark testing;Market research;Natural language processing;Task analysis;self-supervised pre-train;contrastive learning;language model;zero-shot learning;text representation;NLP;NLU;NLG;retrieval},
  doi={10.1109/ISCTech60480.2023.00054}}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Architecture.PNG		Architecture.PNG
Generate to Understand for Representation.pdf		Generate to Understand for Representation.pdf
GurDataset.py		GurDataset.py
GurTask.py		GurTask.py
README.md		README.md
SpanMasker.py		SpanMasker.py
convert.py		convert.py
modeling_gurt5.py		modeling_gurt5.py
requirements.txt		requirements.txt
sents2pair.py		sents2pair.py
train.sh		train.sh
train_files.py		train_files.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generate to Understand for Representation

Architecture

generate samples

train

init model(optional)

Citation

About

Releases

Packages

Languages

laohur/GUR

Folders and files

Latest commit

History

Repository files navigation

Generate to Understand for Representation

Architecture

generate samples

train

init model(optional)

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages