Introduction

Welcome to the official repository of our paper Self-Pluralising Culture Alignment for Large Language Models.

Data resources used in experiments are download from the World Values Survey Wave 7 (2017-2022). We implemented the SFT for CultureSPA based on LLaMA-Factory.

Getting Started

Environments

pip install -r requirements.txt
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
cd ..

Data Resources

The data directory contains some useful data resources:

wvs_questions.json: 290 questions extract from the World Values Survey Wave 7 (2017-2022).
proportions_group_by_country.json: aggregated answers to WVS questions from participants belonging to different countries. participants are from 18 countries across five continents:
- America: USA (American), CAN (Canadian), BOL (Bolivian), BRA (Brazilian);
- Europe: GBR (British), NLD (Dutch), DEU (German), UKR (Ukrainian);
- Asia: CHN (Chinese), RUS (Russian), IND (Indian), THA (Thai);
- Africa: KEN (Kenyan), NGA (Nigerian), ETH (Ethiopian), ZWE (Zimbabwean);
- Oceania: AUS (Australian), NZL (New Zealand).
country_similarity.json: Similarity between different countries for Cross-Culture Thinking method
self_alignment_examples.json: Similarity between different questions for Self-Alignment method
Meta-Llama-3-8B-Instruct folder: 13,000 questions on 13 culture topic generated by Meta-Llama-3-8B-Instruct, along with model responses to these questions and final training data for CultureSPA (joint/specific). Please note that training data for CultureSPA (CCT) is not included due to the limited storage space.

Assessing Cultural Alignment of LLMs

We measure the cultural alignment of LLMs by simulating surveys that have been conducted by sociologists across different populations on LLMs. For each culture, we compute the similarity between the outputs of LLMs and the actual survey responses from that culture to determine the degree of LLMs alignment to the culture.

To complete the assessment, please first run the inference scripts of for LLM outputs to the World Values Survey (WVS) questions. Below provides 6 scripts using different prompting templates.

Culture-Unaware Prompting:

bash culture_unaware_prompting.sh

Culture-Aware Prompting (P1):

bash culture_aware_prompting.sh

Cross-Culture Thinkinkng (P2):

bash cross_culture_thinking_prompting.sh

Self-Alignment (P3)

bash self_alignment_prompting.sh

P1+P3

bash culture_aware_self_alignment_prompting.sh

P2+P3

bash cross_culture_thinking_prompting_self_alignment.sh

Then run python result_analysis_run_3.py to compute the Cultural Alignment Score.

CultureSPA

CultureSPA is a Self-Pluralising Culture Alignment framework that allows LLMs to simultaneously align to pluralistic cultures. It involves 4 key steps: 1. Generating Diverse Culture-Related Questions; 2. Yielding Culture-Unaware/Aware LLM Outputs; 3. Culture-Related QA Pairs Collecting; 4. Culture-Joint/Specific SFT. Below provides the script for each step.

1. Generating Diverse Culture-Related Questions

bash 1.GDCRQ.sh
python data_process/1.merge_data.py
cd data_process
open&run 2.question_filtering.ipynb
cd ..

2. Yielding Culture-Unaware/Aware LLM Outputs

bash 2.YCULO.sh
bash 2.YCALO.sh
#bash 2.YCALO_CCT.sh

3. Culture-Related QA Pairs Collecting

python data_process/3.CRQPC.py

4. Culture-Joint/Specific SFT

python data_process/4.format_data_alpaca.py
cd LLaMA-Factory
llamafactory-cli train CultureSPA_culture_aware.yaml

After these steps, we can obtain the model CultureSPA. Please edit these scripts to obtain variants like CultureSPA (CCT), CultureSPA (secific).

Citation

@article{CultureSPA,
      title={Self-Pluralising Culture Alignment for Large Language Models}, 
      author={Shaoyang Xu, Yongqi Leng, Linhao Yu, Deyi Xiong},
      journal={arxiv preprint arXiv:2410.12971},
      year={2024},
      url={https://arxiv.org/abs/2410.12971}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Getting Started

Environments

Data Resources

Assessing Cultural Alignment of LLMs

CultureSPA

1. Generating Diverse Culture-Related Questions

2. Yielding Culture-Unaware/Aware LLM Outputs

3. Culture-Related QA Pairs Collecting

4. Culture-Joint/Specific SFT

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LLaMA-Factory		LLaMA-Factory
__pycache__		__pycache__
data		data
data_process		data_process
result/Meta-Llama-3-8B-Instruct		result/Meta-Llama-3-8B-Instruct
1.GDCRQ.py		1.GDCRQ.py
1.GDCRQ.sh		1.GDCRQ.sh
2.YCALO.sh		2.YCALO.sh
2.YCALO_CCT.sh		2.YCALO_CCT.sh
2.YCULO.sh		2.YCULO.sh
GDCRQ_utils.py		GDCRQ_utils.py
README.md		README.md
cross_culture_thinking_prompting.py		cross_culture_thinking_prompting.py
cross_culture_thinking_prompting.sh		cross_culture_thinking_prompting.sh
cross_culture_thinking_prompting_self_alignment.sh		cross_culture_thinking_prompting_self_alignment.sh
culture_aware_prompting.py		culture_aware_prompting.py
culture_aware_prompting.sh		culture_aware_prompting.sh
culture_aware_self_alignment_prompting.sh		culture_aware_self_alignment_prompting.sh
culture_unaware_prompting.py		culture_unaware_prompting.py
culture_unaware_prompting.sh		culture_unaware_prompting.sh
requirements.txt		requirements.txt
result_analysis_run_3.py		result_analysis_run_3.py
self_alignment_prompting.py		self_alignment_prompting.py
self_alignment_prompting.sh		self_alignment_prompting.sh
utils.py		utils.py

jack-ai661/CultureSPA

Folders and files

Latest commit

History

Repository files navigation

Introduction

Getting Started

Environments

Data Resources

Assessing Cultural Alignment of LLMs

CultureSPA

1. Generating Diverse Culture-Related Questions

2. Yielding Culture-Unaware/Aware LLM Outputs

3. Culture-Related QA Pairs Collecting

4. Culture-Joint/Specific SFT

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages