Installation | Usage | Using Your Own Datasets
A simple and effective method for estimating the fraction of text in a large corpus that has been substantially modified or generated by AI:
- Distributional GPT Detection. In contrast with instance-level detection, this framework focuses on population-level estimates. We demonstrate how to estimate the proportion of content in a given corpus that has been generated or significantly modified by AI, without the need to perform inference on any individual instance.
- Easy Deployment and Usage. Our code can quickly estimate the distribution of both AI- and human-generated text without an expensive model training procedure. Using these estimated text distributions, we can accurately predict the fraction of text in a large corpus that has been substantially modified or generated by AI.
This repository was built using Python 3.8.19, but should be backwards compatible with any Python >= 3.8. This repository was developed and has been thoroughly tested with pandas 2.0.3, numpy 1.24.4, pyarrow 15.0.2, fastparquet 2024.2.0, scipy 1.10.1, and ipykernel 6.29.4.
You can install this package locally via an editable installation or the provided yml file:
git clone https://github.com/Weixin-Liang/Mapping-the-Increasing-Use-of-LLMs-in-Scientific-Papers.git
cd Mapping-the-Increasing-Use-of-LLMs-in-Scientific-Papers
conda env create -f environment.yml
If you run into any problems during the installation process, please file a GitHub Issue.
Once installed, estimating distributions and running inference is easy (see demo.ipynb for the full demo):
from src.estimation import estimate_text_distribution
from src.MLE import MLE
# call function estimate_text_distribution to get the AI content distribution & human content distribution
estimate_text_distribution(f"data/training_data/CS/human_data.parquet",f"data/training_data/CS/ai_data.parquet",f"distribution/CS.parquet")
# load the word occurrences frequency into our framework
model=MLE(f"distribution/CS.parquet")
# validate our method using mixed corpus with known ground truth alpha
for alpha in [0,0.025,0.05,0.075,0.1,0.125,0.15,0.175,0.2,0.225,0.25]:
estimated,ci=model.inference(f"data/validation_data/CS/ground_truth_alpha_{alpha}.parquet")
error=abs(estimated-alpha)
print(f"{'Ground Truth':>10},{'Prediction':>10},{'CI':>10},{'Error':>10}")
print(f"{alpha:10.3f},{estimated:10.3f},{ci:10.3f},{error:10.3f}")
For a complete demonstration, check out demo.ipynb.
This repository includes the arXiv abstracts used for the analysis in our second paper. However, our framework can easily be extended to other domains of your choice. It requires two datasets--one consisting of documents written entirely by humans, and another consisting of documents written entirely by AI--which are used to estimate the distribution of human- and AI-generated text in your chosen domain. Using these estimates, you can perform inference on a target dataset with an unknown fraction of AI-generated content.
The function estimate_text_distribution in src.estimation requires two file path as input to indicate where human- and AI-generated text are stored. The two input files should be .parquet format. For human-generated text, our provided function need the input parquet file to have a column named as human_sentence and required data to be organized as one tokenized sentence(a list of word) per row. Similarly, for ai-generated text, our provided function need a column named as ai_sentence and required data to be organized as one tokenized sentence(a list of word) per row.
example of human-generated data:
human_sentence |
---|
["This", "is", "an", "example"] |
["Another", "sentence", "for", "you"] |
example of ai-generated data:
ai_sentence |
---|
["This", "is", "an", "example"] |
["Another", "sentence", "for", "you"] |
For inference on target dataset, the function inference in class MLE also requires a file path as input. It also need the input parquet file to have a column named as inference_sentence and required data to be organized as one tokenized sentence(a list of word) per row.
example of inference data:
inference_sentence |
---|
["This", "is", "an", "example"] |
["Another", "sentence", "for", "you"] |
Note that we provide our tokenize function in tokenize_demo.ipynb for reference.
Below is a high-level overview of the repository/project file-tree:
data/
- Data source consisting of arXiv abstract data across five main fields (Physics, Mathematics, Computer Science, Statistics, and Electrical Engineering and Systems Science). Thetraining_data
folder contains corpora known to be entirely AI-generated or human-written, which are used for distribution estimation. Thevalidation_data
folder contains corpora with mixed AI-generated and human-written data, whose ground truth portion is known. This is used to validate the effectiveness of our framework. Details on the data can be found in our second paper.distribution/
- Folder to save the distribution parquet generated by theestimate_text_distribution
function for demo purposes.src/
- Package source providing core utilities for distribution estimation, framework loading, data inference, etc.LICENSE
- All code is made available under the MIT License; happy hacking!demo.ipynb
- Demonstration of our framework on arXiv abstracts across five main fields. This includes estimating the distributions of human- and AI-generated content, followed by a validation process on manually mixed data with a known ground truth portion of AI-written text.tokenize_demo.ipynb
- Demonstration of how we tokenize arXiv abstracts, including the tokenize function. Here we use the spaCy(https://spacy.io/) library, but other tools like nltk are also feasible. You may need to modify the function in your own cases.environment.yml
- Full project configuration details (including dependencies), as well as tool configurations.README.md
- You are here!
If you find our code or framework useful in your work, please cite our first paper and second paper:
@article{liang2024monitoring,
title={Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews},
author={Liang, Weixin and Izzo, Zachary and Zhang, Yaohui and Lepp, Haley and Cao, Hancheng and Zhao, Xuandong and Chen, Lingjiao and Ye, Haotian and Liu, Sheng and Huang, Zhi and others},
journal={arXiv preprint arXiv:2403.07183},
year={2024}
}
@article{liang2024mapping,
title={Mapping the Increasing Use of LLMs in Scientific Papers},
author={Liang, Weixin and Zhang, Yaohui and Wu, Zhengxuan and Lepp, Haley and Ji, Wenlong and Zhao, Xuandong and Cao, Hancheng and Liu, Sheng and He, Siyu and Huang, Zhi and others},
journal={arXiv preprint arXiv:2404.01268},
year={2024}
}