QA: kmmlu dataset class #58

h-albert-lee · 2025-01-22T00:19:28Z

No description provided.

h-albert-lee · 2025-01-25T08:39:20Z

QA 가이드

1. Prerequisite Test

-> requirements.txt의 dependency가 올바른지 확인

2. 단일 파이프라인 테스트

-> runner cli test (w/o scaling law)
-> evaluator cli test (w/o scaling law)

3. 스케일링 테스트

-> best_of_n, self-consistency, beam search

4. llm-as-a-judge QA

성능 및 시간 체크

-> 모든 option에서 걸리는 시간 체크 및 의도한 결과가 나오는지 확인

ksyint · 2025-02-11T11:57:01Z

QA 가이드

1. Prerequisite Test

-> requirements.txt의 dependency가 올바른지 확인

2. 단일 파이프라인 테스트

-> runner cli test (w/o scaling law) -> evaluator cli test (w/o scaling law)

3. 스케일링 테스트

-> best_of_n, self-consistency, beam search

4. llm-as-a-judge QA

성능 및 시간 체크

-> 모든 option에서 걸리는 시간 체크 및 의도한 결과가 나오는지 확인

디펜더시 문제 없음
스케일링 없을 때 문제 무발견
스케일링 실험

best_of_n, self-consistency 문제 무발견
beam search 진행중
@h-albert-lee

ksyint · 2025-02-12T02:37:30Z

facebook/opt-350m 기준 , accounting_dev : 총 5개 row , python 3.10.6

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="kmmlu", # or "kmmlu", "qarv", ...
subset=["Accounting"], # optional subset(s)
split="dev", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":"facebook/opt-350m","device":"cuda"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method=None, # or "beam_search", "best_of_n"
scaling_params={}, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

batch size : 1, gpu memory : 2030 mb

{'dataset_name': 'kmmlu', 'subset': ['Accounting'], 'split': 'dev', 'model_backend_name': 'huggingface', 'scaling_method_name': None, 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 7.246223211288452}
Metrics: {'accuracy': 0.0}

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="kmmlu", # or "kmmlu", "qarv", ...
subset=["Accounting"], # optional subset(s)
split="dev", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":
"facebook/opt-350m","device":"cuda"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method="best_of_n", # or "beam_search", "best_of_n"
scaling_params={}, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

batch size : 1, gpu memory : 2030 mb

{'dataset_name': 'kmmlu', 'subset': ['Accounting'], 'split': 'dev', 'model_backend_name': 'huggingface', 'scaling_method_name': 'best_of_n', 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 21.756542444229126}
Metrics: {'accuracy': 0.0}

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="kmmlu", # or "kmmlu", "qarv", ...
subset=["Accounting"], # optional subset(s)
split="dev", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":
"facebook/opt-350m","device":"cuda"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method="self_consistency", # or "beam_search", "best_of_n"
scaling_params={}, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

batch size : 1, gpu memory : 2030 mb

{'dataset_name': 'kmmlu', 'subset': ['Accounting'], 'split': 'dev', 'model_backend_name': 'huggingface', 'scaling_method_name': 'self_consistency', 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 23.204894542694092}
Metrics: {'accuracy': 0.0}

llm_eval/scaling_methods/beam_search.py
final_candidates = list(set(final_candidates))
TypeError: unhashable type: 'Beam'

results = evaluator.run(
model="multi", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="kmmlu", # or "kmmlu", "qarv", ...
subset=["Accounting"], # optional subset(s)
split="dev", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"generate_model": { "name": "huggingface", "params": { "model_name_or_path": "facebook/opt-350m","device":"cuda" } },
"judge_model": None,
"reward_model":{ "name": "huggingface_reward", "params": { "model_name_or_path": "facebook/opt-350m","device":"cuda" } } }, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method="best_of_n", # or "beam_search", "best_of_n"
scaling_params=None, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

batch size:1, 3200mb

{'dataset_name': 'kmmlu', 'subset': ['Accounting'], 'split': 'dev', 'model_backend_name': 'multi', 'scaling_method_name': 'best_of_n', 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 74.12414789199829}
Metrics: {'accuracy': 0.0}
@h-albert-lee

h-albert-lee · 2025-02-12T12:04:36Z

@ksyint 좋습니다! 다만 실험에 사용하시는 환경(파이썬 버전, OS, GPU, 등) 및 실행 커맨드도 함께 공유 부탁드립니다.

h-albert-lee added this to Haerae-eval-toolkit develop Jan 22, 2025

ksyint self-assigned this Jan 24, 2025

h-albert-lee moved this to Todo in Haerae-eval-toolkit develop Jan 27, 2025

ksyint moved this from Todo to In Progress in Haerae-eval-toolkit develop Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA: kmmlu dataset class #58

QA: kmmlu dataset class #58

h-albert-lee commented Jan 22, 2025

h-albert-lee commented Jan 25, 2025

ksyint commented Feb 11, 2025 •

edited

Loading

QA 가이드

1. Prerequisite Test

2. 단일 파이프라인 테스트

3. 스케일링 테스트

4. llm-as-a-judge QA

성능 및 시간 체크

ksyint commented Feb 12, 2025 •

edited

Loading

h-albert-lee commented Feb 12, 2025

QA: kmmlu dataset class #58

QA: kmmlu dataset class #58

Comments

h-albert-lee commented Jan 22, 2025

h-albert-lee commented Jan 25, 2025

QA 가이드

1. Prerequisite Test

2. 단일 파이프라인 테스트

3. 스케일링 테스트

4. llm-as-a-judge QA

성능 및 시간 체크

ksyint commented Feb 11, 2025 • edited Loading

QA 가이드

1. Prerequisite Test

2. 단일 파이프라인 테스트

3. 스케일링 테스트

4. llm-as-a-judge QA

성능 및 시간 체크

ksyint commented Feb 12, 2025 • edited Loading

h-albert-lee commented Feb 12, 2025

ksyint commented Feb 11, 2025 •

edited

Loading

ksyint commented Feb 12, 2025 •

edited

Loading