Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA: hrm8k dataset class #65

Open
h-albert-lee opened this issue Jan 25, 2025 · 4 comments
Open

QA: hrm8k dataset class #65

h-albert-lee opened this issue Jan 25, 2025 · 4 comments
Assignees

Comments

@h-albert-lee
Copy link
Member

QA 가이드

1. Prerequisite Test

-> requirements.txt의 dependency가 올바른지 확인

2. 단일 파이프라인 테스트

-> runner cli test (w/o scaling law)
-> evaluator cli test (w/o scaling law)

3. 스케일링 테스트

-> best_of_n, self-consistency, beam search

4. llm-as-a-judge QA

성능 및 시간 체크

-> 모든 option에서 걸리는 시간 체크 및 의도한 결과가 나오는지 확인

@ksyint
Copy link
Contributor

ksyint commented Feb 14, 2025

llm_eval/datasets/hrm8k.py, line 87-90 사이
answer = item.get("answer", "") ---> answer = str(item.get("answer", ""))

해설 : item.get("answer", "") 타입이 int일때 코드가 에러가 남 , 모델 test time 까지 측정이후 pr 예정

@h-albert-lee

@h-albert-lee
Copy link
Member Author

오오 버그리포트 부탁드립니다

@ksyint
Copy link
Contributor

ksyint commented Feb 14, 2025

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="hrm8k", # or "kmmlu", "qarv", ...
subset=["MMMLU"], # optional subset(s)
split="test", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":"facebook/opt-350m"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method=None, # or "beam_search", "best_of_n"
scaling_params=None, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

{'dataset_name': 'hrm8k', 'subset': ['MMMLU'], 'split': 'test', 'model_backend_name': 'huggingface', 'scaling_method_name': None, 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 1059.6291382312775}
Metrics: {'accuracy': 0.0}

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="hrm8k", # or "kmmlu", "qarv", ...
subset=["MMMLU"], # optional subset(s)
split="test", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":"facebook/opt-350m"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method="best_of_n", # or "beam_search", "best_of_n"
scaling_params=None, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

Dataset: hrm8k
Subset: MMMLU
Split: Test
Model Backend: Hugging Face
Scaling Method: Best of N
Evaluation Metric: String Match
Elapsed Time: 3833.72s (~1 hour 4 minutes)
Accuracy: 0.0%

results = evaluator.run(
model="huggingface", # or "vllm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="hrm8k", # or "kmmlu", "qarv", ...
subset=["MMMLU"], # optional subset(s)
split="test", # "train"/"validation"/"test"
dataset_params={}, # example HF config
model_params={"model_name_or_path":"facebook/opt-350m"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method="self_consistency", # or "beam_search", "best_of_n"
scaling_params=None, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)

{'dataset_name': 'hrm8k', 'subset': ['MMMLU'], 'split': 'test', 'model_backend_name': 'huggingface', 'scaling_method_name': 'self_consistency', 'evaluation_method_name': 'string_match', 'elapsed_time_sec': 3581.719957590103}
Metrics: {'accuracy': 0.0}

@ksyint ksyint moved this from In Progress to Done in Haerae-eval-toolkit develop Feb 28, 2025
@ksyint ksyint closed this as completed by moving to Done in Haerae-eval-toolkit develop Feb 28, 2025
@ksyint ksyint moved this from Done to In Progress in Haerae-eval-toolkit develop Feb 28, 2025
@ksyint ksyint moved this from Done to In Progress in Haerae-eval-toolkit develop Feb 28, 2025
@ksyint
Copy link
Contributor

ksyint commented Feb 28, 2025

File "/home/user/workspaces/Soo/haerae-evaluation-toolkit/llm_eval/datasets/hrm8k.py", line 90, in _convert_to_list
"reference": answer.strip(),
AttributeError: 'float' object has no attribute 'strip'

@ksyint ksyint reopened this Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants