Skip to content

harveyai/biglaw-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigLaw Bench

A new standard for evaluating AI on legal & professional tasks.

Overview

BigLaw Bench is a comprehensive framework for evaluating the performance of large language models (LLMs) on complex, real-world legal tasks. Developed by Harvey's legal research team, BigLaw Bench aims to supplement existing benchmarks by focusing on tasks that mirror actual billable work performed by lawyers, providing a more accurate measure of an AI model's utility in professional legal settings.

Benchmarks

1. BigLaw Bench — Core

BigLaw Bench Core is a set of core tasks for benchmarking baseline legal problem-solving. Core tasks are organized into two primary categories, each encompassing several specific sub-task types:

Transactional Task Categories

  • Corporate Strategy & Advising
  • Drafting
  • Legal Research
  • Due Diligence
  • Risk Assessment & Compliance
  • Negotiation Strategy
  • Deal Management
  • Transaction Structuring
  • Regulatory & Advising

Litigation Task Categories

  • Analysis of Litigation Filings
  • Case Management
  • Drafting
  • Case Law Research
  • Transcript Analysis
  • Document Review and Analysis
  • Trial Preparations & Oral Argument

2. BigLaw Bench — Workflows

BigLaw Bench Workflows represent a set of composite legal tasks that are used to evaluate agentic systems. We currently provide benchmarks for:

  • SPA Deal Points: Evaluates the ability of LLM agents to extract a variety of standard deal points from a dataset of Share Purchase Agreements (SPAs).

3. BigLaw Bench — Retrieval

BigLaw Bench Retrieval is a set of datasets and tasks for benchmarking the quality of retrieval systems. We currently provide benchmarks for:

  • Contracts: Complex documents (e.g., hundreds of pages and potentially hundreds of thousands of tokens of text) with cross-references and defined terms which must be tracked to effectively contextualize relevant text. We currently support two types of contracts -- Merger Agreements and SPAs.

  • Discovery Emails: Relatively short documents that come in high-volume and have complex relationships (e.g., email threads) and rich metadata (sender, recipient, attachments) essential to identifying relevant messages.

Evaluation Methodology

Each task in BigLaw Bench is assessed using custom-designed rubrics that measure:

  • Answer Quality: Evaluates the completeness, accuracy, and appropriateness of the model's response based on specific criteria essential for effective task completion.
  • Source Reliability: Assesses the model's ability to provide verifiable and correctly cited sources for its assertions, enhancing trust and facilitating validation.

Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps (e.g. hallucinations). The final answer score represents: What % of a lawyer-quality work product does the model complete for the user?

Data Samples

Sample tasks and grading rubrics can be found at the links below.

  1. BLB-core: here
  2. BLB-workflows-spa: here
  3. BLB-retrieval: here

For access to the full dataset and additional resources, please contact Harvey directly.

Credits

Julio Pereyra, Elizabeth Lebens, Matthew Guillod, Laura Toulme, Cameron MacGregor, David Murdter, Karl de la Roche, Emilie McConnachie, Jeremy Pushkin, Rina Kim, Aaron Chan, Jenny Pan, Boling Yang, Nan Wu, Niko Grupen, Lauren Oh, Aatish Nayak, Gabriel Pereyra

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published