Skip to content

jack-ai661/CultureSPA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Welcome to the official repository of our paper Self-Pluralising Culture Alignment for Large Language Models.

Data resources used in experiments are download from the World Values Survey Wave 7 (2017-2022). We implemented the SFT for CultureSPA based on LLaMA-Factory.

Getting Started

Environments

pip install -r requirements.txt
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
cd ..

Data Resources

The data directory contains some useful data resources:

  • wvs_questions.json: 290 questions extract from the World Values Survey Wave 7 (2017-2022).
  • proportions_group_by_country.json: aggregated answers to WVS questions from participants belonging to different countries. participants are from 18 countries across five continents:
    • America: USA (American), CAN (Canadian), BOL (Bolivian), BRA (Brazilian);
    • Europe: GBR (British), NLD (Dutch), DEU (German), UKR (Ukrainian);
    • Asia: CHN (Chinese), RUS (Russian), IND (Indian), THA (Thai);
    • Africa: KEN (Kenyan), NGA (Nigerian), ETH (Ethiopian), ZWE (Zimbabwean);
    • Oceania: AUS (Australian), NZL (New Zealand).
  • country_similarity.json: Similarity between different countries for Cross-Culture Thinking method
  • self_alignment_examples.json: Similarity between different questions for Self-Alignment method
  • Meta-Llama-3-8B-Instruct folder: 13,000 questions on 13 culture topic generated by Meta-Llama-3-8B-Instruct, along with model responses to these questions and final training data for CultureSPA (joint/specific). Please note that training data for CultureSPA (CCT) is not included due to the limited storage space.

Assessing Cultural Alignment of LLMs

We measure the cultural alignment of LLMs by simulating surveys that have been conducted by sociologists across different populations on LLMs. For each culture, we compute the similarity between the outputs of LLMs and the actual survey responses from that culture to determine the degree of LLMs alignment to the culture.

To complete the assessment, please first run the inference scripts of for LLM outputs to the World Values Survey (WVS) questions. Below provides 6 scripts using different prompting templates.

  1. Culture-Unaware Prompting:
bash culture_unaware_prompting.sh
  1. Culture-Aware Prompting (P1):
bash culture_aware_prompting.sh
  1. Cross-Culture Thinkinkng (P2):
bash cross_culture_thinking_prompting.sh
  1. Self-Alignment (P3)
bash self_alignment_prompting.sh
  1. P1+P3
bash culture_aware_self_alignment_prompting.sh
  1. P2+P3
bash cross_culture_thinking_prompting_self_alignment.sh

Then run python result_analysis_run_3.py to compute the Cultural Alignment Score.

CultureSPA

CultureSPA is a Self-Pluralising Culture Alignment framework that allows LLMs to simultaneously align to pluralistic cultures. It involves 4 key steps: 1. Generating Diverse Culture-Related Questions; 2. Yielding Culture-Unaware/Aware LLM Outputs; 3. Culture-Related QA Pairs Collecting; 4. Culture-Joint/Specific SFT. Below provides the script for each step.

1. Generating Diverse Culture-Related Questions

bash 1.GDCRQ.sh
python data_process/1.merge_data.py
cd data_process
open&run 2.question_filtering.ipynb
cd ..

2. Yielding Culture-Unaware/Aware LLM Outputs

bash 2.YCULO.sh
bash 2.YCALO.sh
#bash 2.YCALO_CCT.sh

3. Culture-Related QA Pairs Collecting

python data_process/3.CRQPC.py

4. Culture-Joint/Specific SFT

python data_process/4.format_data_alpaca.py
cd LLaMA-Factory
llamafactory-cli train CultureSPA_culture_aware.yaml

After these steps, we can obtain the model CultureSPA. Please edit these scripts to obtain variants like CultureSPA (CCT), CultureSPA (secific).

Citation

@article{CultureSPA,
      title={Self-Pluralising Culture Alignment for Large Language Models}, 
      author={Shaoyang Xu, Yongqi Leng, Linhao Yu, Deyi Xiong},
      journal={arxiv preprint arXiv:2410.12971},
      year={2024},
      url={https://arxiv.org/abs/2410.12971}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 66.6%
  • Jupyter Notebook 32.9%
  • Other 0.5%