A new standard for evaluating AI on legal & professional tasks.
BigLaw Bench is a comprehensive framework for evaluating the performance of large language models (LLMs) on complex, real-world legal tasks. Developed by Harvey's legal research team, BigLaw Bench aims to supplement existing benchmarks by focusing on tasks that mirror actual billable work performed by lawyers, providing a more accurate measure of an AI model's utility in professional legal settings.
BigLaw Bench Core is a set of core tasks for benchmarking baseline legal problem-solving. Core tasks are organized into two primary categories, each encompassing several specific sub-task types:
- Corporate Strategy & Advising
- Drafting
- Legal Research
- Due Diligence
- Risk Assessment & Compliance
- Negotiation Strategy
- Deal Management
- Transaction Structuring
- Regulatory & Advising
- Analysis of Litigation Filings
- Case Management
- Drafting
- Case Law Research
- Transcript Analysis
- Document Review and Analysis
- Trial Preparations & Oral Argument
BigLaw Bench Workflows represent a set of composite legal tasks that are used to evaluate agentic systems. We currently provide benchmarks for:
- SPA Deal Points: Evaluates the ability of LLM agents to extract a variety of standard deal points from a dataset of Share Purchase Agreements (SPAs).
BigLaw Bench Retrieval is a set of datasets and tasks for benchmarking the quality of retrieval systems. We currently provide benchmarks for:
-
Contracts: Complex documents (e.g., hundreds of pages and potentially hundreds of thousands of tokens of text) with cross-references and defined terms which must be tracked to effectively contextualize relevant text. We currently support two types of contracts -- Merger Agreements and SPAs.
-
Discovery Emails: Relatively short documents that come in high-volume and have complex relationships (e.g., email threads) and rich metadata (sender, recipient, attachments) essential to identifying relevant messages.
Each task in BigLaw Bench is assessed using custom-designed rubrics that measure:
- Answer Quality: Evaluates the completeness, accuracy, and appropriateness of the model's response based on specific criteria essential for effective task completion.
- Source Reliability: Assesses the model's ability to provide verifiable and correctly cited sources for its assertions, enhancing trust and facilitating validation.
Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps (e.g. hallucinations). The final answer score represents: What % of a lawyer-quality work product does the model complete for the user?
Sample tasks and grading rubrics can be found at the links below.
For access to the full dataset and additional resources, please contact Harvey directly.
Julio Pereyra, Elizabeth Lebens, Matthew Guillod, Laura Toulme, Cameron MacGregor, David Murdter, Karl de la Roche, Emilie McConnachie, Jeremy Pushkin, Rina Kim, Aaron Chan, Jenny Pan, Boling Yang, Nan Wu, Niko Grupen, Lauren Oh, Aatish Nayak, Gabriel Pereyra