This repository contains the code and dataset for the paper Open-world Multi-label Text Classification with Extremely Weak Supervision.
We study open-world multi-label text classification under extremely weak supervision, where the user only provides a brief description for classification objectives without any labels or ground-truth label space.
conda create -n X-MLClass python=3.9
conda activate X-MLClass
python -m pip install -r requirements.txt
If you need to use OpenAI APIs, you will need to obtain an API key here.
export OPENAI_API_KEY=[your OpenAI API Key]
All datasets referenced in the paper are available here.
There are three steps to follow the framework outlined in this paper.
- Initial label space construction.
- Assign labels using a custom keyphrase-chunk zero-shot textual entailment classifier.
- Label space improvement.
Below, we provide an example of open-world multi-label text classification using the AAPD dataset.
We placed the keyphrases file we generated in the dataset folder. You are also welcome to generate the keyphrases yourself using the following command:
CUDA_VISIBLE_DEVICES=... python llama_keyword.py \
--path ./datasets \
--data_dir train_texts_split_50.txt \
--task AAPD \
--batch_size 32 \
--model meta-llama/Llama-2-13b-chat-hf \
--output_dir llama_label2_50.txt
To generate the initial label space, use the following command. The output will be saved in llama2/init_label_space.txt
cd OpenWordMLTC/keyword_generator
bash label_space_construct.sh
cd OpenWordMLTC/zero-shot
bash multi_label_classifier.sh
Please note that in this code, we use the initial label space for multi-label text classification. The results presented in the paper are based on the label space after improvements made in the next step.
cd OpenWordMLTC/self_training
CUDA_VISIBLE_DEVICES=... python self_training.py \
--path ../../datasets \
--data_dir train_texts_split_50.txt \
--keyphrase_dir llama2_label_50.txt\
--task AAPD \
--llama_model llama2 \
--tail_set_size 500 \
--majority_num 350 \
--max_majority_num 5 \
--sim_threshold 0.55 \
--max_add_label 10 \
--model MoritzLaurer/deberta-v3-large-zeroshot-v1.1-all-33
The improved label space is stored in llama2/result/update_labelspace.txt