This repository contains EMTeC, the Eye Movements on Machine-Generated Texts Corpus. EMTeC is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five decoding strategies, and the yfall into six different text type categories. EMTeC contains the eye movement data at all stages of pre-processing and further provides an original and a corrected version of the fixation sequences. Moreover, the corpus includes the language models' internals (attention scores, transition scores, hidden states). The stimuli are annotated for a variety of linguistic features on both text- and word-level.
This repository contains all code that has been used to generate post-process and annotate the stimuli; the code used for the eye-tracking-data pre-processing pipeline; and the code for the psycholinguistic analysis.
- The preprint EMTeC: Eye Movements on Machine-Generated Texts Corpus is available on arXiv
- The eye-tracking data is available via OSF or can be automatically downloaded using a python script (see below).
- The tensors are available on Harvard Dataverse or can be automatically downloaded using a python script (see below)
Use Python version 3.10 or higher
git clone https://github.com/DiLi-Lab/EMTeC.git
cd EMTeC
The code is based on the PyTorch and huggingface modules.
pip install -r requirements.txt
The eye-tracking data is stored in an OSF Repository. To download and extract it directly. please run
python get_et_data.py --extract
which will automatically extract all zipped files. It will create a directory EMTeC/data
that has the folder structure
needed for reproducibility purposes.
The transition scores, attention scores, hidden states, and beam indices are stored in a Harvard Dataverse Dataset. They can be downloaded from there or via calling
python get_tensors.py
The tensors will be saved to the directory tensor_data/tensors/
. For a different directory, specify it as argument:
python get_tensors.py --output-dir [DIRNAME]
If one does not want to download all tensors, one can subset for the model, the tensor type, and the decoding strategy with the following command line arguments:
--model
phi2
mistral
wizardlm
--dec
beam_search
greedy_search
sampling
topk
topp
--tensor
attentions
beam_indices
hidden_states
sequences
scores
(the transition scores)
Example:
python get_tensors.py --model mistral --dec sampling topk --tensor hidden_states scores
Attention: The tensors amount to about 340 GB in size.
- Note: The following tensors are not provided via Dataverse because they exceed the maximum file size
- Mistral: beam search item 34, item 43; greedy search item 43
- WizardLM: beam search item 34, item 35, item 40, item 43; greedy search item 43; sampling item 43, top-k item 43, top-p item 43
- Note: Phi-2 does not return attention scores or hidden states.
During the generation, the transition scores, beam indices, generated sequences, hidden states, and attention scores will be saved to disk. Please adapt the path names in generation/generation_constants.py
.
In order to prompt Phi-2, Mistral and WizardLM, you can run the bash script
bash run_generation.sh
Note: In order to prompt the models, you need GPUs set up with CUDA.
Beware that the GPUs are hard-coded in the bash script and depending on the kind of GPUs available, please adapt them accordingly. Moreover, running WizardLM with beam search as decoding strategy might result in CUDA OOM Errors for items that have a very long context. In such cases, you might subset the prompting process, as indicated within the bash script.
The text generation scripts not only prompt the models but they also post-process the output. Because the experimental stimuli presented in the eye-tracking experiment should follow certain criteria and formats, the model outputs are also truncated (in order to remove unfinished sentences and trailing whitespace characters) and cleaned.
Only a subset of the generated texts are presented in the eye-tracking experiment. In this step, the generated outputs
are subset to the chosen items, which are found in stimuli_selection/util_files/selected_items_0-105.npy
. Then the
selected outputs of all models from all decoding strategies are merged. Each one is attributed to its list, the
assignment of which can be found in stimuli_selection/util_files/list_attribution.csv
, and the comprehension questions
are added for each generated text, which can separately be found in stimuli_selection/util_files/stimuli_and_questions_balanced.csv
.
bash run_stimuli_selection.sh
In order to reproduce the pre-processing of the eye-tracking data, place the subject directories into the data
directory,
the result of which should look like the following:
├── data
├── subject_level_data
│ ├── ET_01
│ │ ├── aoi
│ │ ├── ET_01.edf
│ │ ├── RESULTS_QUESTIONNAIRE.txt
│ │ ├── RESULTS_QUESTIONS.txt
│ ├── ET_02
│ │ ├── aoi
│ │ ├── ET_02.edf
│ │ ├── RESULTS_QUESTIONNAIRE.txt
│ │ ├── RESULTS_QUESTIONS.txt
│ └── ...
├── stimuli_columns_descriptions.csv
├── stimuli.csv
This step of the pre-processing pipeline can be skipped. Go directly to conversion of .ask
to .csv
.
The .edf
files returned by the experimental software are not published. We nevertheless publish the code that implements
this conversion. The result is an .asc
file in each subject directory. In order to convert the .edf
into .asc
files, we employed the edf2asc
tool by SR Research. Beware that this application that converts the .edf
to .asc
files only works in Linux- or Windows-based systems.
bash run_edf2asc.sh
├── data
├── subject_level_data
│ ├── ET_01
│ │ ├── aoi
│ │ ├── ET_01.asc
│ │ ├── ET_01.edf
│ │ ├── RESULTS_QUESTIONNAIRE.txt
│ │ ├── RESULTS_QUESTIONS.txt
│ └── ...
├── stimuli_columns_descriptions.csv
├── stimuli.csv
The .asc
files are parsed and only the relevant information is extracted and written to csv files. While the .edf
and the .asc
files are not published, we nevertheless publish the code. To run the parsing of the .asc
files,
simply run
bash run_asc2csv.sh
The resulting data directory structure will then look like this:
├── data
├── subject_level_data
│ ├── ET_01
│ │ ├── aoi
│ │ ├── ET_01.asc
│ │ ├── ET_01.csv
│ │ ├── ET_01.edf
│ │ ├── RESULTS_QUESTIONNAIRE.txt
│ │ ├── RESULTS_QUESTIONS.txt
│ └── ...
├── stimuli_columns_descriptions.csv
├── stimuli.csv
To group coordinate samples from the raw .csv
files together into fixations and map them to the correct areas of
interest (words), please run
bash run_csv2events.sh
The following arguments can be passed:
--disablle-parallel
: disable parallel processing--plot-px-time
: if given, the raw x- and y-coordinates are plotted over time and the fixations extracted with the algorithm are marked.--plot-ampl-vel
: if given, the peak saccade velocities are plotted over saccade amplitudes.--threshold
: the threshold to use in the microsaccade detection algorithm. The default istrial_based
, i.e. the threshold is estimated for each experimental stimulus individually.--threshold-factor
: the factor with which the threshold is multiplied to obtain the radius--threshold-method
: the method to compute the threshold--min-fixation-duration-ms
: the minimum fixation duration in ms--min-saccade-duration-ms
: the minimum saccade duration in ms--max-saccade-velocity
: the maximum saccade velocity in deg/s--theta
: the velocity threshold in deg/s
Since we publish the raw eye-tracking data, everything from this step onwards is reproducible. For each subject, the resulting folder structure will then look like this (if both the coordinates over time as well as the saccade velocity over amplitude is plotted):
├── data
├── subject_level_data
│ ├── ET_01
│ │ ├── aoi
│ │ ├── fixations
│ │ │ ├── event_files
│ │ │ │ ├── ET_01-item01-fixations.csv
│ │ │ │ ├── ET_01-item02-fixations.csv
│ │ │ │ └── ...
│ │ │ ├── plots
│ │ │ │ ├── ampl_vel
│ │ │ │ │ ├── ET_01-item01-ampl_vel.png
│ │ │ │ │ ├── ET_01-item01-ampl_vel_reg.png
│ │ │ │ │ ├── ET_01-item02-ampl_vel.png
│ │ │ │ │ ├── ET_01-item02-ampl_vel_reg.png
│ │ │ │ │ └── ...
│ │ │ │ ├── px_time
│ │ │ │ │ ├── ET_01-item01-px_time.png
│ │ │ │ │ ├── ET_01-item02-px_time.png
│ │ │ │ │ └── ...
│ │ │ ├── command_log.txt
│ │ ├── ET_01.asc
│ │ ├── ET_01.csv
│ │ ├── ET_01.edf
│ │ ├── RESULTS_QUESTIONNAIRE.txt
│ │ ├── RESULTS_QUESTIONS.txt
│ └── ...
├── stimuli_columns_descriptions.csv
├── stimuli.csv
The event_files
directory contains the extracted fixations, one file per screen/experimental stimulus. The directory
ampl_vel
contains the plots of saccade amplitude over velocity, and px_time
contains the coordinates over time plots.
command_log.txt
contains the programm call given in the bash script run_csv2events.sh
.
The manual fixation correction, as opposed to all other preprocessing steps, requires different dependencies. In order to manually correct the extracted fixations, primarily fixing vertical drifts during eye-tracking, please first create a new virtual environment and install the necessary requirements:
pip install -r preprocessing/fixation_correction/fixcorr_requirements.txt
Then run the fixation correction script:
bash run_fixcorr.sh
The argument --run-on-subj
indicates on which subject to run the fixation correction. If this argument is not given, the script
will iterate through all files of all subjects. The resulting directory structure looks like this:
├── data
├── subject_level_data
│ ├── ET_01
│ │ ├── aoi
│ │ ├── fixations
│ │ │ ├── event_files
│ │ │ │ ├── ET_01-item01-fixations.csv
│ │ │ │ ├── ET_01-item02-fixations.csv
│ │ │ │ └── ...
│ │ │ ├── plots
│ │ │ │ ├── ampl_vel
│ │ │ │ │ ├── ET_01-item01-ampl_vel.png
│ │ │ │ │ ├── ET_01-item01-ampl_vel_reg.png
│ │ │ │ │ ├── ET_01-item02-ampl_vel.png
│ │ │ │ │ ├── ET_01-item02-ampl_vel_reg.png
│ │ │ │ │ └── ...
│ │ │ │ ├── px_time
│ │ │ │ │ ├── ET_01-item01-px_time.png
│ │ │ │ │ ├── ET_01-item02-px_time.png
│ │ │ │ │ └── ...
│ │ │ ├── command_log.txt
│ │ ├── fixations_corrected
│ │ │ ├── event_files
│ │ │ │ ├── ET_01-item01-fixations_corrected.csv
│ │ │ │ ├── ET_01-item02-fixations_corrected.csv
│ │ │ │ └── ...
│ │ ├── ET_01.asc
│ │ ├── ET_01.csv
│ │ ├── ET_01.edf
│ │ ├── RESULTS_QUESTIONNAIRE.txt
│ │ ├── RESULTS_QUESTIONS.txt
│ └── ...
├── stimuli_columns_descriptions.csv
├── stimuli.csv
To compute word-based reading measures from the fixation sequences, run
bash run_events2rms.sh
The reading measures are saved in one file per screen/experimental stimulus, and the resulting folder structure is
indicated below. The keyword --corrected
in the bash script indicates that computation of reading measures should only be
done on corrected fixations. If it is omitted, reading measures are computed on the uncorrected fixations.
├── data
├── subject_level_data
│ ├── ET_01
│ │ ├── aoi
│ │ ├── fixations
│ │ │ ├── event_files
│ │ │ │ ├── ET_01-item01-fixations.csv
│ │ │ │ ├── ET_01-item02-fixations.csv
│ │ │ │ └── ...
│ │ │ ├── plots
│ │ │ │ ├── ampl_vel
│ │ │ │ │ ├── ET_01-item01-ampl_vel.png
│ │ │ │ │ ├── ET_01-item01-ampl_vel_reg.png
│ │ │ │ │ ├── ET_01-item02-ampl_vel.png
│ │ │ │ │ ├── ET_01-item02-ampl_vel_reg.png
│ │ │ │ │ └── ...
│ │ │ │ ├── px_time
│ │ │ │ │ ├── ET_01-item01-px_time.png
│ │ │ │ │ ├── ET_01-item02-px_time.png
│ │ │ │ │ └── ...
│ │ │ ├── command_log.txt
│ │ ├── fixations_corrected
│ │ │ ├── event_files
│ │ │ │ ├── ET_01-item01-fixations_corrected.csv
│ │ │ │ ├── ET_01-item02-fixations_corrected.csv
│ │ │ │ └── ...
│ │ ├── reading_measures_corrected
│ │ │ ├── ET_01-item01-reading_measures_corrected.csv
│ │ │ ├── ET_01-item02-reading_measures_corrected.csv
│ │ │ └── ...
│ │ ├── ET_01.asc
│ │ ├── ET_01.csv
│ │ ├── ET_01.edf
│ │ ├── RESULTS_QUESTIONNAIRE.txt
│ │ ├── RESULTS_QUESTIONS.txt
│ └── ...
├── stimuli_columns_descriptions.csv
├── stimuli.csv
Please insert your huggingface access token in line 109 in annotation/annotations.py
, then in order to annotate the stimuli texts with readability scores, please run
bash run_annotation_text.sh
The package readability_local
is a local copy of py-readability-metrics
(see their Documentation and their GitHub),
as I had to adjust the minimum number of words. This annotation will directly add the readability metrics to the stimuli.csv
file.
To annotate the stimuli texts on word-level with frequency scores, PoS tags, dependency tags, surprisal values, etc., please first insert your huggingface token in the file annotation/annotations.py
, then run the bash script
bash run_annotation_word.sh gpt2 gpt2-large opt-350m opt-1.3b mistral-base mistral-instruct phi2 llama2-7b llama2-13b pythia-6.9b pythia-12b
This will first create a folder unique_aois
in the data
directory, which contains the areas of interest on word-level
for each stimulus text (i.e., each condition; item id, model, decoding strategy). First the PoS tags, dependency tags,
word length information, and frequency values are added to a temporary file annotation/temp_annotated_data.csv
. Then
the surprisal values are extracted from the language models specified in the bash call and the final annotations are
saved in annotations/word_level_annotations.csv
.
Surprisal for the texts is extracted in two different ways: once only the text itself is given as input, and once both the text and the prompts are given as input for possible analyses with the transition scores, which are also conditioned on the prompt. The surprisal column names are of the shape surprisal_{model-name}
, or surprisal_p_{model-name}
if the surprisal was extracted for the text with the model receiving also the prompt.
Note: In order to prompt the models, you need GPUs set up with CUDA.
In order to merge
- the fixation sequences
- the corrected fixation sequences
- the reading measures
- the corrected reading measures
- the participant information (questionnaire and comprehension question information), including extracting the information on which eye was tracked for each participant
- the word-level lexical annotation with the reading measures
- the word-level lexical annotation with the corrected reading measures
please run
bash run_merge.sh
The word-level annotations are merged directly into the files data/reading_measures.csv
and data/reading_measures_corrected.csv
.
The resulting folder structure will then look like this:
├── data
├── participant_info
│ ├── participant_info.csv
│ ├── participant_results.csv
├── subject_level_data
│ ├── ET_01
│ │ ├── aoi
│ │ ├── fixations
│ │ │ ├── event_files
│ │ │ │ ├── ET_01-item01-fixations.csv
│ │ │ │ ├── ET_01-item02-fixations.csv
│ │ │ │ └── ...
│ │ │ ├── plots
│ │ │ │ ├── ampl_vel
│ │ │ │ │ ├── ET_01-item01-ampl_vel.png
│ │ │ │ │ ├── ET_01-item01-ampl_vel_reg.png
│ │ │ │ │ ├── ET_01-item02-ampl_vel.png
│ │ │ │ │ ├── ET_01-item02-ampl_vel_reg.png
│ │ │ │ │ └── ...
│ │ │ │ ├── px_time
│ │ │ │ │ ├── ET_01-item01-px_time.png
│ │ │ │ │ ├── ET_01-item02-px_time.png
│ │ │ │ │ └── ...
│ │ │ ├── command_log.txt
│ │ ├── fixations_corrected
│ │ │ ├── event_files
│ │ │ │ ├── ET_01-item01-fixations_corrected.csv
│ │ │ │ ├── ET_01-item02-fixations_corrected.csv
│ │ │ │ └── ...
│ │ ├── reading_measures_corrected
│ │ │ ├── ET_01-item01-reading_measures_corrected.csv
│ │ │ ├── ET_01-item02-reading_measures_corrected.csv
│ │ │ └── ...
│ │ ├── ET_01.asc
│ │ ├── ET_01.csv
│ │ ├── ET_01.edf
│ │ ├── RESULTS_QUESTIONNAIRE.txt
│ │ ├── RESULTS_QUESTIONS.txt
│ └── ...
├── unique_aois
├── fixations.csv
├── fixations_corrected.csv
├── reading_measures.csv
├── reading_measures_corrected.csv
├── stimuli_columns_descriptions.csv
├── stimuli.csv
To run the psycholinguistic analyses please run the regression models for the different response variables (first-pass reading time, first-pass regression etc.) using the following command.
mkdir model_fits, logs
bash run_regression_models.sh
This will save the brms-fits in the folder model_fits
.
To extract and plot the posterior distributions of the parameters of interest (word length, surprisal, etc.), please run
Rscript --vanilla analyses/extract_and_plot.R