The source code in this thesis was prepared at the Technical University of Denmark, Department of Applied Mathematics and Computer Science in fulfillment of the requirements for acquiring a Masters of Science degree in Human-Centered Artificial Intelligence. The project was prepared by Gustav Selfort Hartz. The workload of the project is 30 ECTS.
Project Title Exploring CUAD using RoBERTa span-selection QA models for legal contract review
Project Abstract Deep learning models for span-selection question and answering have, with the advent of transformer models such as BERT and RoBERTa, shown excellent performance on established tasks like SQuAD and GLUE. Training these models requires large annotated datasets, making them expensive to use in highly specialized domains.
The Contract Understanding Atticus Dataset (CUAD) is a new public dataset by The Atticus Project, Inc., suitable for training span-selection models for legal contract review. This thesis presents an in-depth analysis of the performance obtained from the RoBERTa models presented by the authors of CUAD. It examines the models' robustness to multiple different new formulations of the original
The thesis improves top 3 Has Ans metric from 92.5% to 94.1%, but cannot conclude that this is due to augmentations made to the dataset. Additionally, it suggests that performance gains can be found by adjusting objective functions to suit the performance metrics better.
This training setup requires some slight altercations to the squad processor of transformers including a smaller chunksize for squad preprocessing and a different dataset structure. There is not provided a fork of transformer, but you can find the altered file in the root dir.
# Update squad file on ai platform GCP
cp squad.py /opt/conda/lib/python3.7/site-packages/transformers/data/processors/squad.py
See notebooks/Custom Questions Creation.ipynb
Tested with the following python versions Python 3.7.12 Python 3.8.13
The conda_enviroment.py
file is from a local development enviroment and the training configuration can be extracted from the GCP vm image listed below.
Setup GCP Enviroment used for training
gcloud notebooks instances create contract2 --vm-image-project="deeplearning-platform-release" --vm-image-name=pytorch-1-11-cu113-notebooks-v20220316-debian-10 --machine-type=n1-highmem-8 --location=europe-west4-a
# Adjust the data disk size
# Start the notebook
# Use new altered transformers package
cp squad.py /opt/conda/lib/python3.7/site-packages/transformers/data/processors/squad.py
# Install packages not already there
pip install transformers
pip install pytorch-lightning
# Run training
The thesis makes references to the code used for training transformers on SQuAD. This can be found from the commit history, but has been removed from the final repository to improve readability since the code was not pretty.
The best model is now available through Hugging Face 🤗 gustavhartz/roberta-base-cuad-finetuned
# Requirements
from transformers import pipeline
import PyPDF2
pipe = pipeline("question-answering", model="gustavhartz/roberta-base-cuad-finetuned", device=0)
# The contract to investigate
pdf_file = open('data/CUAD_v1/full_contract_pdf/Part_I/Affiliate_Agreements/LinkPlusCorp_20050802_8-K_EX-10_3240252_EX-10_Affiliate Agreement.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdf_file)
num_pages = pdfReader.numPages
count = 0
# Read each page
texts = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
text = pageObj.extractText()
texts+= "\n" + text
count += 1
# Format data and add question
context = texts
question = "Highlight the parts (if any) of this contract related to \"Document Name\" that should be reviewed by a lawyer. Details: The name of the contract"
# Get a pred using the V2 pred logic
pipe(question = question, context = context, top_k=3)
If you found the code of thesis helpful you can please cite it :)
@thesis{ha2022,
author = {Hartz, Gustav Selfort},
title = {Exploring CUAD using RoBERTa span-selection QA models for legal contract review},
language = {English},
format = {thesis},
year = {2022},
publisher = {DTU Department of Applied Mathematics and Computer Science}
}
Based on the following resources
Enable GPU monitor - for maximizing GPU utilization - batch size etc. https://cloud.google.com/compute/docs/gpus/monitor-gpus
This repo also has a function of syncing data between a remote server and my local instance, thus some commits might appear weird or not living up to good standards.
Have something you want to add just open a pull request :-)