Skip to content

Latest commit

 

History

History
108 lines (88 loc) · 4.23 KB

README.md

File metadata and controls

108 lines (88 loc) · 4.23 KB

Introduction

Welcome to the official repository of our paper Self-Pluralising Culture Alignment for Large Language Models.

Data resources used in experiments are download from the World Values Survey Wave 7 (2017-2022). We implemented the SFT for CultureSPA based on LLaMA-Factory.

Getting Started

Environments

pip install -r requirements.txt
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
cd ..

Data Resources

The data directory contains some useful data resources:

  • wvs_questions.json: 290 questions extract from the World Values Survey Wave 7 (2017-2022).
  • proportions_group_by_country.json: aggregated answers to WVS questions from participants belonging to different countries. participants are from 18 countries across five continents:
    • America: USA (American), CAN (Canadian), BOL (Bolivian), BRA (Brazilian);
    • Europe: GBR (British), NLD (Dutch), DEU (German), UKR (Ukrainian);
    • Asia: CHN (Chinese), RUS (Russian), IND (Indian), THA (Thai);
    • Africa: KEN (Kenyan), NGA (Nigerian), ETH (Ethiopian), ZWE (Zimbabwean);
    • Oceania: AUS (Australian), NZL (New Zealand).
  • country_similarity.json: Similarity between different countries for Cross-Culture Thinking method
  • self_alignment_examples.json: Similarity between different questions for Self-Alignment method
  • Meta-Llama-3-8B-Instruct folder: 13,000 questions on 13 culture topic generated by Meta-Llama-3-8B-Instruct, along with model responses to these questions and final training data for CultureSPA (joint/specific). Please note that training data for CultureSPA (CCT) is not included due to the limited storage space.

Assessing Cultural Alignment of LLMs

We measure the cultural alignment of LLMs by simulating surveys that have been conducted by sociologists across different populations on LLMs. For each culture, we compute the similarity between the outputs of LLMs and the actual survey responses from that culture to determine the degree of LLMs alignment to the culture.

To complete the assessment, please first run the inference scripts of for LLM outputs to the World Values Survey (WVS) questions. Below provides 6 scripts using different prompting templates.

  1. Culture-Unaware Prompting:
bash culture_unaware_prompting.sh
  1. Culture-Aware Prompting (P1):
bash culture_aware_prompting.sh
  1. Cross-Culture Thinkinkng (P2):
bash cross_culture_thinking_prompting.sh
  1. Self-Alignment (P3)
bash self_alignment_prompting.sh
  1. P1+P3
bash culture_aware_self_alignment_prompting.sh
  1. P2+P3
bash cross_culture_thinking_prompting_self_alignment.sh

Then run python result_analysis_run_3.py to compute the Cultural Alignment Score.

CultureSPA

CultureSPA is a Self-Pluralising Culture Alignment framework that allows LLMs to simultaneously align to pluralistic cultures. It involves 4 key steps: 1. Generating Diverse Culture-Related Questions; 2. Yielding Culture-Unaware/Aware LLM Outputs; 3. Culture-Related QA Pairs Collecting; 4. Culture-Joint/Specific SFT. Below provides the script for each step.

1. Generating Diverse Culture-Related Questions

bash 1.GDCRQ.sh
python data_process/1.merge_data.py
cd data_process
open&run 2.question_filtering.ipynb
cd ..

2. Yielding Culture-Unaware/Aware LLM Outputs

bash 2.YCULO.sh
bash 2.YCALO.sh
#bash 2.YCALO_CCT.sh

3. Culture-Related QA Pairs Collecting

python data_process/3.CRQPC.py

4. Culture-Joint/Specific SFT

python data_process/4.format_data_alpaca.py
cd LLaMA-Factory
llamafactory-cli train CultureSPA_culture_aware.yaml

After these steps, we can obtain the model CultureSPA. Please edit these scripts to obtain variants like CultureSPA (CCT), CultureSPA (secific).

Citation

@article{CultureSPA,
      title={Self-Pluralising Culture Alignment for Large Language Models}, 
      author={Shaoyang Xu, Yongqi Leng, Linhao Yu, Deyi Xiong},
      journal={arxiv preprint arXiv:2410.12971},
      year={2024},
      url={https://arxiv.org/abs/2410.12971}
}