Simple command-line tools for low-barrier reproducible experiments with GPT2, XGLM (564M), and Bloom (560M). Primarily intended for use in education. Designed to be easy to use for non-technical users.
There are currently two tools:
llm_generate.py
: Given a text, generate the next N tokens. Annotate all words with their surprisal.llm_topn.py
: Given a text, list the N most highly ranked next tokens with their surprisal.
Key features:
- Assumes basic familiarity with the use of a command line, but no programming is needed.
- Runs open source pre-trained LLM locally without the need for an API key or internet connection.
- GPT2 (1.5B) for English
- Bloom (560M) for 46 languages
- XGLM (564M) for 30 languages
- Reasonably fast on CPU.
- Reproducible results via random seeds.
- Batch mode that processes multiple items in one go.
- Output format is ASCII art bar charts for quick experiments or
.csv
for easy processing using R, Python, spreadsheet editors, whatnot.
Developed and tested on Ubuntu Linux, but it may also work out of the box on Mac OS and Windows.
It is assumed that the system has a recent version of Python3 and the pip package manager for Python. On Ubuntu Linux, these can be installed via sudo apt install python3-pip
.
Then install PyTorch and Huggingface’s transformer
package:
pip3 install torch
pip3 install transformers
Command to generate four additional tokens using GPT2 (default model) and calculate surprisal for each token.
python3 llm_generate.py "The key to the cabinets" -n 4
Item Idx Token: Surprisal (bits) 1 1 The: nan 1 2 key: ██████████ 10.4 1 3 to: ██ 2.0 1 4 the: ████ 3.8 1 5 cabinets: █████████████████████ 21.0 1 6 is: ██ 1.5 1 7 that: ███ 3.3 1 8 the: ███ 2.5 1 9 doors: ████████ 7.6
Generation with XGLM 564M
python3 llm_generate.py "Der Polizist sagte, dass man nicht mehr ermitteln kann," -n 5 -m xglm-564M
Item Idx Token: Surprisal (bits) 1 1 </s>: nan 1 2 </s>: █████ 4.8 1 3 Der: ████████████ 11.6 1 4 Polizi: █████████████ 13.0 1 5 st: 0.2 1 6 sagte: ███████████ 10.7 1 7 ,: ██ 1.7 1 8 dass: ██ 2.0 1 9 man: █████ 5.5 1 10 nicht: █████ 4.5 1 11 mehr: ████ 4.2 1 12 er: ████████ 7.8 1 13 mitteln: ████ 4.1 1 14 kann: ███ 3.1 1 15 ,: █ 1.2 1 16 da: ████ 4.3 1 17 nicht: ███████ 7.1 1 18 alle: ██ 2.4 1 19 Daten: ██████ 5.7 1 20 gespeichert: ███ 3.3
Likewise with Bloom 560M:
python3 llm_generate.py "Der Polizist sagte, dass man nicht mehr ermitteln kann," -n 5 -m bloom-560m
Two sampling parameters are currently supported: 1. Temperature (default 1) and 2. Top-k (default 50). To use different sampling parameters:
python3 llm_generate.py "This is a" -t 1000 -k 1
Item Idx Token: Surprisal (bits) 1 1 This: nan 1 2 is: ████ 4.4 1 3 a: ███ 2.7 1 4 very: ████ 4.2 1 5 good: ████ 3.8 1 6 example: ████ 4.2 1 7 of: 0.4 1 8 how: ██ 2.2 1 9 to: ███ 2.7 1 10 use: ███ 2.9 1 11 the: ███ 3.2 1 12 ": ██████ 6.4 1 13 I: ████████ 7.8
The repetition penalty is fixed at 1.0 assuming that larger values are not desirable when studying the behaviour of the model. Nucleus sampling is currently not supported but could be added if needed.
CSV format in shell output can be obtained with the -c
option:
python3 llm_generate.py "The key to the cabinets" -n 4 -c
item,idx,token,surprisal 1,1,The,nan 1,2,key,10.35491943359375 1,3,to,2.019094467163086 1,4,the,3.7583045959472656 1,5,cabinets,21.04239845275879 1,6,is,1.5308449268341064 1,7,that,3.2748565673828125 1,8,the,2.5106589794158936 1,9,doors,7.590230464935303
To store results in a .csv
file which can be easily loaded in R, Excel, Google Sheets, and similar:
python3 llm_generate.py "The key to the cabinets" -n 4 -o output.csv
When storing results to a file, there’s no need to specify -c
. CSV will be used by default.
To obtain reproducible (i.e. non-random) results, the -s
option can be used to set a random seed:
python3 llm_generate.py "The key to the cabinets" -n 4 -s 1
To process multiple items in batch mode, create a .csv
file following this example:
item,text,n 1,John saw the man who the card catalog had confused a great deal.,0 2,No head injury is too trivial to be ignored.,0 3,The key to the cabinets were on the table.,0 4,How many animals of each kind did Moses take on the ark?,0 5,The horse raced past the barn fell.,0 6,The first thing the new president will do is,10
Columns:
- Item number
- Text
- Number of additional tokens that should be generated
Then run:
python3 llm_generate.py -i input_generate.csv -o output_generate.csv
Result:
item | wn | w | surprisal |
---|---|---|---|
1 | 1 | John | nan |
1 | 2 | saw | 12.686095237731934 |
1 | 3 | the | 2.5510218143463135 |
1 | 4 | man | 6.69647216796875 |
1 | 5 | who | 4.4374775886535645 |
1 | 6 | the | 9.218789100646973 |
1 | 7 | card | 12.91416072845459 |
1 | 8 | catalog | 13.132523536682129 |
1 | 9 | had | 5.045916557312012 |
1 | 10 | confused | 12.417732238769531 |
1 | 11 | a | 8.445308685302734 |
1 | 12 | great | 8.923978805541992 |
1 | 13 | deal | 0.5196788311004639 |
1 | 14 | . | 2.855055093765259 |
2 | 1 | No | nan |
2 | 2 | head | 12.043790817260742 |
2 | 3 | injury | 7.169843673706055 |
2 | 4 | is | 3.976238965988159 |
2 | 5 | too | 6.11444616317749 |
2 | 6 | trivial | 10.36826229095459 |
2 | 7 | to | 1.1925396919250488 |
2 | 8 | be | 3.6252267360687256 |
2 | 9 | ignored | 5.360403060913086 |
2 | 10 | . | 1.3230934143066406 |
3 | 1 | The | nan |
3 | 2 | key | 10.35491943359375 |
3 | 3 | to | 2.019094467163086 |
3 | 4 | the | 3.7583045959472656 |
3 | 5 | cabinets | 21.04239845275879 |
3 | 6 | were | 6.044715404510498 |
3 | 7 | on | 9.186738967895508 |
3 | 8 | the | 1.0266693830490112 |
3 | 9 | table | 6.743055820465088 |
3 | 10 | . | 2.8487112522125244 |
4 | 1 | How | nan |
4 | 2 | many | 8.747537612915039 |
4 | 3 | animals | 10.349991798400879 |
4 | 4 | of | 7.982310771942139 |
4 | 5 | each | 7.254271984100342 |
4 | 6 | kind | 3.8629841804504395 |
4 | 7 | did | 6.853036880493164 |
4 | 8 | Moses | 11.290939331054688 |
4 | 9 | take | 6.513387680053711 |
4 | 10 | on | 5.387193202972412 |
4 | 11 | the | 2.429086208343506 |
4 | 12 | ar | 8.29068660736084 |
4 | 13 | k | 0.001733059762045741 |
4 | 14 | ? | 1.3717999458312988 |
5 | 1 | The | nan |
5 | 2 | horse | 13.856287002563477 |
5 | 3 | raced | 10.928426742553711 |
5 | 4 | past | 5.529265880584717 |
5 | 5 | the | 1.912912130355835 |
5 | 6 | barn | 6.164068222045898 |
5 | 7 | fell | 18.577974319458008 |
5 | 8 | . | 6.4461774826049805 |
6 | 1 | The | nan |
6 | 2 | first | 7.707244873046875 |
6 | 3 | thing | 3.870574712753296 |
6 | 4 | the | 5.894345760345459 |
6 | 5 | new | 7.025041580200195 |
6 | 6 | president | 6.4177327156066895 |
6 | 7 | will | 4.513916492462158 |
6 | 8 | do | 0.641898512840271 |
6 | 9 | is | 0.6119055151939392 |
6 | 10 | introduce | 6.937398910522461 |
6 | 11 | some | 5.374466896057129 |
6 | 12 | sort | 5.1832194328308105 |
6 | 13 | of | 0.0006344764260575175 |
6 | 14 | ””“” | 5.472208499908447 |
6 | 15 | Make | 6.435114860534668 |
6 | 16 | America | 0.20164340734481812 |
6 | 17 | Great | 0.06291275471448898 |
6 | 18 | Again | 0.01570785976946354 |
6 | 19 | ””“” | 0.08896449953317642 |
Top 5 next tokens:
python3 llm_topn.py "The key to the cabinets" -n 5
Item Text Token Rank: Surprisal (bits) 1 The key to the cabinets is 1: ██ 1.5 1 The key to the cabinets are 2: ████ 4.1 1 The key to the cabinets , 3: ████ 4.2 1 The key to the cabinets was 4: ████ 4.2 1 The key to the cabinets and 5: ████ 4.5
python3 llm_topn.py "Der Schlüssel zu den Schränken" -n 10 -m xglm-564M
Item Text Token Rank: Surprisal (bits) 1 Der Schlüssel zu den Schränken </s> 1: ██ 2.3 1 Der Schlüssel zu den Schränken ist 2: ███ 2.8 1 Der Schlüssel zu den Schränken , 3: ████ 4.0 1 Der Schlüssel zu den Schränken und 4: ████ 4.4 1 Der Schlüssel zu den Schränken im 5: █████ 4.5 1 Der Schlüssel zu den Schränken in 6: █████ 4.6 1 Der Schlüssel zu den Schränken des 7: █████ 4.9 1 Der Schlüssel zu den Schränken : 8: █████ 5.0 1 Der Schlüssel zu den Schränken der 9: █████ 5.4 1 Der Schlüssel zu den Schränken . 10: ██████ 6.0
python3 llm_topn.py "The key to the cabinets" -n 5 -c
python3 llm_topn.py "The key to the cabinets" -n 5 -o output.csv
To process multiple items in batch mode, create a .csv
file following this example:
item,text,n 1,The key to the cabinets,10 2,The key to the cabinet,10 3,The first thing the new president will do is to introduce,10 4,"After moving into the Oval Office, one of the first things that",10
Columns:
- Item number
- Text
- Number of top tokens that should be reported
Then run:
python3 llm_topn.py -i input_topn.csv -o output_topn.csv
Result:
item | s | w | rank | surprisal |
1 | The key to the cabinets | is | 1 | 1.530847191810608 |
1 | The key to the cabinets | are | 2 | 4.100262641906738 |
1 | The key to the cabinets | , | 3 | 4.1611528396606445 |
1 | The key to the cabinets | was | 4 | 4.206236839294434 |
1 | The key to the cabinets | and | 5 | 4.458767890930176 |
1 | The key to the cabinets | in | 6 | 4.966185569763184 |
1 | The key to the cabinets | of | 7 | 5.340408802032471 |
1 | The key to the cabinets | ’ | 8 | 5.369940280914307 |
1 | The key to the cabinets | being | 9 | 5.823633193969727 |
1 | The key to the cabinets | that | 10 | 6.032191753387451 |
2 | The key to the cabinet | ’s | 1 | 1.8515361547470093 |
2 | The key to the cabinet | is | 2 | 2.9451916217803955 |
2 | The key to the cabinet | , | 3 | 4.270960807800293 |
2 | The key to the cabinet | was | 4 | 4.756969928741455 |
2 | The key to the cabinet | meeting | 5 | 5.037260055541992 |
2 | The key to the cabinet | being | 6 | 5.4005866050720215 |
2 | The key to the cabinet | resh | 7 | 6.193490028381348 |
2 | The key to the cabinet | has | 8 | 6.257472991943359 |
2 | The key to the cabinet | and | 9 | 6.363502502441406 |
2 | The key to the cabinet | of | 10 | 6.371416091918945 |
3 | The first thing the new president will do is to introduce | a | 1 | 1.717236042022705 |
3 | The first thing the new president will do is to introduce | legislation | 2 | 3.0158398151397705 |
3 | The first thing the new president will do is to introduce | the | 3 | 3.788292407989502 |
3 | The first thing the new president will do is to introduce | his | 4 | 4.383864402770996 |
3 | The first thing the new president will do is to introduce | an | 5 | 4.400935649871826 |
3 | The first thing the new president will do is to introduce | new | 6 | 4.592444896697998 |
3 | The first thing the new president will do is to introduce | some | 7 | 5.393261909484863 |
3 | The first thing the new president will do is to introduce | himself | 8 | 6.188421726226807 |
3 | The first thing the new president will do is to introduce | more | 9 | 7.121828079223633 |
3 | The first thing the new president will do is to introduce | and | 10 | 7.167385578155518 |
4 | After moving into the Oval Office, one of the first things that | came | 1 | 4.16267204284668 |
4 | After moving into the Oval Office, one of the first things that | I | 2 | 4.3133015632629395 |
4 | After moving into the Oval Office, one of the first things that | Trump | 3 | 4.36268949508667 |
4 | After moving into the Oval Office, one of the first things that | President | 4 | 4.635979652404785 |
4 | After moving into the Oval Office, one of the first things that | he | 5 | 4.925130367279053 |
4 | After moving into the Oval Office, one of the first things that | the | 6 | 5.133755207061768 |
4 | After moving into the Oval Office, one of the first things that | was | 7 | 5.245244026184082 |
4 | After moving into the Oval Office, one of the first things that | happened | 8 | 5.386913299560547 |
4 | After moving into the Oval Office, one of the first things that | Obama | 9 | 6.018731117248535 |
4 | After moving into the Oval Office, one of the first things that | Mr | 10 | 6.0303544998168945 |