Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA: kmmlu dataset class #58

Open
h-albert-lee opened this issue Jan 22, 2025 · 4 comments
Open

QA: kmmlu dataset class #58

h-albert-lee opened this issue Jan 22, 2025 · 4 comments
Assignees

Comments

@h-albert-lee
Copy link
Member

No description provided.

@h-albert-lee
Copy link
Member Author

QA 가이드

1. Prerequisite Test

-> requirements.txt의 dependency가 올바른지 확인

2. 단일 파이프라인 테스트

-> runner cli test (w/o scaling law)
-> evaluator cli test (w/o scaling law)

3. 스케일링 테스트

-> best_of_n, self-consistency, beam search

4. llm-as-a-judge QA

성능 및 시간 체크

-> 모든 option에서 걸리는 시간 체크 및 의도한 결과가 나오는지 확인

@ksyint ksyint moved this from Todo to In Progress in Haerae-eval-toolkit develop Feb 11, 2025
@ksyint
Copy link
Contributor

ksyint commented Feb 11, 2025

QA 가이드

1. Prerequisite Test

-> requirements.txt의 dependency가 올바른지 확인

2. 단일 파이프라인 테스트

-> runner cli test (w/o scaling law) -> evaluator cli test (w/o scaling law)

3. 스케일링 테스트

-> best_of_n, self-consistency, beam search

4. llm-as-a-judge QA

성능 및 시간 체크

-> 모든 option에서 걸리는 시간 체크 및 의도한 결과가 나오는지 확인

  1. 디펜더시 문제 없음
  2. 스케일링 없을 때 문제 무발견
  3. 스케일링 실험
  • best_of_n, self-consistency 문제 무발견
  • beam search 진행중
    @h-albert-lee

@ksyint
Copy link
Contributor

ksyint commented Feb 12, 2025

facebook/opt-350m 기준 , accounting_dev : 총 5개 row , python 3.10.6

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="kmmlu", # or "kmmlu", "qarv", ...
subset=["Accounting"], # optional subset(s)
split="dev", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":"facebook/opt-350m","device":"cuda"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method=None, # or "beam_search", "best_of_n"
scaling_params={}, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

batch size : 1, gpu memory : 2030 mb

{'dataset_name': 'kmmlu', 'subset': ['Accounting'], 'split': 'dev', 'model_backend_name': 'huggingface', 'scaling_method_name': None, 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 7.246223211288452}
Metrics: {'accuracy': 0.0}

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="kmmlu", # or "kmmlu", "qarv", ...
subset=["Accounting"], # optional subset(s)
split="dev", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":
"facebook/opt-350m","device":"cuda"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method="best_of_n", # or "beam_search", "best_of_n"
scaling_params={}, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

batch size : 1, gpu memory : 2030 mb

{'dataset_name': 'kmmlu', 'subset': ['Accounting'], 'split': 'dev', 'model_backend_name': 'huggingface', 'scaling_method_name': 'best_of_n', 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 21.756542444229126}
Metrics: {'accuracy': 0.0}

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="kmmlu", # or "kmmlu", "qarv", ...
subset=["Accounting"], # optional subset(s)
split="dev", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":
"facebook/opt-350m","device":"cuda"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method="self_consistency", # or "beam_search", "best_of_n"
scaling_params={}, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

batch size : 1, gpu memory : 2030 mb

{'dataset_name': 'kmmlu', 'subset': ['Accounting'], 'split': 'dev', 'model_backend_name': 'huggingface', 'scaling_method_name': 'self_consistency', 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 23.204894542694092}
Metrics: {'accuracy': 0.0}

llm_eval/scaling_methods/beam_search.py
final_candidates = list(set(final_candidates))
TypeError: unhashable type: 'Beam'

results = evaluator.run(
model="multi", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="kmmlu", # or "kmmlu", "qarv", ...
subset=["Accounting"], # optional subset(s)
split="dev", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"generate_model": { "name": "huggingface", "params": { "model_name_or_path": "facebook/opt-350m","device":"cuda" } },
"judge_model": None,
"reward_model":{ "name": "huggingface_reward", "params": { "model_name_or_path": "facebook/opt-350m","device":"cuda" } } }, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method="best_of_n", # or "beam_search", "best_of_n"
scaling_params=None, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

batch size:1, 3200mb

{'dataset_name': 'kmmlu', 'subset': ['Accounting'], 'split': 'dev', 'model_backend_name': 'multi', 'scaling_method_name': 'best_of_n', 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 74.12414789199829}
Metrics: {'accuracy': 0.0}
@h-albert-lee

@h-albert-lee
Copy link
Member Author

@ksyint 좋습니다! 다만 실험에 사용하시는 환경(파이썬 버전, OS, GPU, 등) 및 실행 커맨드도 함께 공유 부탁드립니다.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants