Skip to content

Simple tool for generating tokens with open source transformers and/or calculate per-token surprisal.

Notifications You must be signed in to change notification settings

tmalsburg/llm_surprisal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple command-line tools for low-barrier reproducible experiments with GPT2, XGLM (564M), and Bloom (560M). Primarily intended for use in education. Designed to be easy to use for non-technical users.

There are currently two tools:

  1. llm_generate.py: Given a text, generate the next N tokens. Annotate all words with their surprisal.
  2. llm_topn.py: Given a text, list the N most highly ranked next tokens with their surprisal.

Key features:

  • Assumes basic familiarity with the use of a command line, but no programming is needed.
  • Runs open source pre-trained LLM locally without the need for an API key or internet connection.
  • Reasonably fast on CPU.
  • Reproducible results via random seeds.
  • Batch mode that processes multiple items in one go.
  • Output format is ASCII art bar charts for quick experiments or .csv for easy processing using R, Python, spreadsheet editors, whatnot.

Developed and tested on Ubuntu Linux, but it may also work out of the box on Mac OS and Windows.

Install

Prerequisites

It is assumed that the system has a recent version of Python3 and the pip package manager for Python. On Ubuntu Linux, these can be installed via sudo apt install python3-pip.

Then install PyTorch and Huggingface’s transformer package:

pip3 install torch
pip3 install transformers

Use

Generation with surprisal

Simple generation of tokens

Command to generate four additional tokens using GPT2 (default model) and calculate surprisal for each token.

python3 llm_generate.py "The key to the cabinets" -n 4
Item Idx    Token: Surprisal (bits)
   1   1      The:                         nan
   1   2      key: ██████████             10.4
   1   3       to: ██                      2.0
   1   4      the: ████                    3.8
   1   5 cabinets: █████████████████████  21.0
   1   6       is: ██                      1.5
   1   7     that: ███                     3.3
   1   8      the: ███                     2.5
   1   9    doors: ████████                7.6

Multilingual models

Generation with XGLM 564M

python3 llm_generate.py "Der Polizist sagte, dass man nicht mehr ermitteln kann," -n 5 -m xglm-564M
Item Idx       Token: Surprisal (bits)
   1   1        </s>:                    nan
   1   2        </s>: █████              4.8
   1   3         Der: ████████████      11.6
   1   4      Polizi: █████████████     13.0
   1   5          st:                    0.2
   1   6       sagte: ███████████       10.7
   1   7           ,: ██                 1.7
   1   8        dass: ██                 2.0
   1   9         man: █████              5.5
   1  10       nicht: █████              4.5
   1  11        mehr: ████               4.2
   1  12          er: ████████           7.8
   1  13     mitteln: ████               4.1
   1  14        kann: ███                3.1
   1  15           ,: █                  1.2
   1  16          da: ████               4.3
   1  17       nicht: ███████            7.1
   1  18        alle: ██                 2.4
   1  19       Daten: ██████             5.7
   1  20 gespeichert: ███                3.3

Likewise with Bloom 560M:

python3 llm_generate.py "Der Polizist sagte, dass man nicht mehr ermitteln kann," -n 5 -m bloom-560m

Sampling parameters

Two sampling parameters are currently supported: 1. Temperature (default 1) and 2. Top-k (default 50). To use different sampling parameters:

python3 llm_generate.py "This is a" -t 1000 -k 1
Item Idx   Token: Surprisal (bits)
   1   1    This:                    nan
   1   2      is: ████               4.4
   1   3       a: ███                2.7
   1   4    very: ████               4.2
   1   5    good: ████               3.8
   1   6 example: ████               4.2
   1   7      of:                    0.4
   1   8     how: ██                 2.2
   1   9      to: ███                2.7
   1  10     use: ███                2.9
   1  11     the: ███                3.2
   1  12       ": ██████             6.4
   1  13       I: ████████           7.8

The repetition penalty is fixed at 1.0 assuming that larger values are not desirable when studying the behaviour of the model. Nucleus sampling is currently not supported but could be added if needed.

Output in CSV format

CSV format in shell output can be obtained with the -c option:

python3 llm_generate.py "The key to the cabinets" -n 4 -c
item,idx,token,surprisal
1,1,The,nan
1,2,key,10.35491943359375
1,3,to,2.019094467163086
1,4,the,3.7583045959472656
1,5,cabinets,21.04239845275879
1,6,is,1.5308449268341064
1,7,that,3.2748565673828125
1,8,the,2.5106589794158936
1,9,doors,7.590230464935303

Store results in a .csv file

To store results in a .csv file which can be easily loaded in R, Excel, Google Sheets, and similar:

python3 llm_generate.py "The key to the cabinets" -n 4 -o output.csv

When storing results to a file, there’s no need to specify -c. CSV will be used by default.

Reproducible generation

To obtain reproducible (i.e. non-random) results, the -s option can be used to set a random seed:

python3 llm_generate.py "The key to the cabinets" -n 4 -s 1

Batch mode generation

To process multiple items in batch mode, create a .csv file following this example:

item,text,n
1,John saw the man who the card catalog had confused a great deal.,0
2,No head injury is too trivial to be ignored.,0
3,The key to the cabinets were on the table.,0
4,How many animals of each kind did Moses take on the ark?,0
5,The horse raced past the barn fell.,0
6,The first thing the new president will do is,10

Columns:

  1. Item number
  2. Text
  3. Number of additional tokens that should be generated

Then run:

python3 llm_generate.py -i input_generate.csv -o output_generate.csv

Result:

itemwnwsurprisal
11Johnnan
12saw12.686095237731934
13the2.5510218143463135
14man6.69647216796875
15who4.4374775886535645
16the9.218789100646973
17card12.91416072845459
18catalog13.132523536682129
19had5.045916557312012
110confused12.417732238769531
111a8.445308685302734
112great8.923978805541992
113deal0.5196788311004639
114.2.855055093765259
21Nonan
22head12.043790817260742
23injury7.169843673706055
24is3.976238965988159
25too6.11444616317749
26trivial10.36826229095459
27to1.1925396919250488
28be3.6252267360687256
29ignored5.360403060913086
210.1.3230934143066406
31Thenan
32key10.35491943359375
33to2.019094467163086
34the3.7583045959472656
35cabinets21.04239845275879
36were6.044715404510498
37on9.186738967895508
38the1.0266693830490112
39table6.743055820465088
310.2.8487112522125244
41Hownan
42many8.747537612915039
43animals10.349991798400879
44of7.982310771942139
45each7.254271984100342
46kind3.8629841804504395
47did6.853036880493164
48Moses11.290939331054688
49take6.513387680053711
410on5.387193202972412
411the2.429086208343506
412ar8.29068660736084
413k0.001733059762045741
414?1.3717999458312988
51Thenan
52horse13.856287002563477
53raced10.928426742553711
54past5.529265880584717
55the1.912912130355835
56barn6.164068222045898
57fell18.577974319458008
58.6.4461774826049805
61Thenan
62first7.707244873046875
63thing3.870574712753296
64the5.894345760345459
65new7.025041580200195
66president6.4177327156066895
67will4.513916492462158
68do0.641898512840271
69is0.6119055151939392
610introduce6.937398910522461
611some5.374466896057129
612sort5.1832194328308105
613of0.0006344764260575175
614””“”5.472208499908447
615Make6.435114860534668
616America0.20164340734481812
617Great0.06291275471448898
618Again0.01570785976946354
619””“”0.08896449953317642

Top N next tokens with surprisal

Simple top N

Top 5 next tokens:

python3 llm_topn.py "The key to the cabinets" -n 5
Item                    Text Token Rank: Surprisal (bits)
   1 The key to the cabinets    is    1: ██                 1.5
   1 The key to the cabinets   are    2: ████               4.1
   1 The key to the cabinets     ,    3: ████               4.2
   1 The key to the cabinets   was    4: ████               4.2
   1 The key to the cabinets   and    5: ████               4.5

Multilingual top N

python3 llm_topn.py "Der Schlüssel zu den Schränken" -n 10 -m xglm-564M
Item                           Text Token Rank: Surprisal (bits)
   1 Der Schlüssel zu den Schränken  </s>    1: ██                 2.3
   1 Der Schlüssel zu den Schränken   ist    2: ███                2.8
   1 Der Schlüssel zu den Schränken     ,    3: ████               4.0
   1 Der Schlüssel zu den Schränken   und    4: ████               4.4
   1 Der Schlüssel zu den Schränken    im    5: █████              4.5
   1 Der Schlüssel zu den Schränken    in    6: █████              4.6
   1 Der Schlüssel zu den Schränken   des    7: █████              4.9
   1 Der Schlüssel zu den Schränken     :    8: █████              5.0
   1 Der Schlüssel zu den Schränken   der    9: █████              5.4
   1 Der Schlüssel zu den Schränken     .   10: ██████             6.0

Force CSV format in shell output

python3 llm_topn.py "The key to the cabinets" -n 5 -c

Store results in a file (CSV format)

python3 llm_topn.py "The key to the cabinets" -n 5 -o output.csv

Batch mode top N

To process multiple items in batch mode, create a .csv file following this example:

item,text,n
1,The key to the cabinets,10
2,The key to the cabinet,10
3,The first thing the new president will do is to introduce,10
4,"After moving into the Oval Office, one of the first things that",10

Columns:

  1. Item number
  2. Text
  3. Number of top tokens that should be reported

Then run:

python3 llm_topn.py -i input_topn.csv -o output_topn.csv

Result:

itemswranksurprisal
1The key to the cabinetsis11.530847191810608
1The key to the cabinetsare24.100262641906738
1The key to the cabinets,34.1611528396606445
1The key to the cabinetswas44.206236839294434
1The key to the cabinetsand54.458767890930176
1The key to the cabinetsin64.966185569763184
1The key to the cabinetsof75.340408802032471
1The key to the cabinets85.369940280914307
1The key to the cabinetsbeing95.823633193969727
1The key to the cabinetsthat106.032191753387451
2The key to the cabinet’s11.8515361547470093
2The key to the cabinetis22.9451916217803955
2The key to the cabinet,34.270960807800293
2The key to the cabinetwas44.756969928741455
2The key to the cabinetmeeting55.037260055541992
2The key to the cabinetbeing65.4005866050720215
2The key to the cabinetresh76.193490028381348
2The key to the cabinethas86.257472991943359
2The key to the cabinetand96.363502502441406
2The key to the cabinetof106.371416091918945
3The first thing the new president will do is to introducea11.717236042022705
3The first thing the new president will do is to introducelegislation23.0158398151397705
3The first thing the new president will do is to introducethe33.788292407989502
3The first thing the new president will do is to introducehis44.383864402770996
3The first thing the new president will do is to introducean54.400935649871826
3The first thing the new president will do is to introducenew64.592444896697998
3The first thing the new president will do is to introducesome75.393261909484863
3The first thing the new president will do is to introducehimself86.188421726226807
3The first thing the new president will do is to introducemore97.121828079223633
3The first thing the new president will do is to introduceand107.167385578155518
4After moving into the Oval Office, one of the first things thatcame14.16267204284668
4After moving into the Oval Office, one of the first things thatI24.3133015632629395
4After moving into the Oval Office, one of the first things thatTrump34.36268949508667
4After moving into the Oval Office, one of the first things thatPresident44.635979652404785
4After moving into the Oval Office, one of the first things thathe54.925130367279053
4After moving into the Oval Office, one of the first things thatthe65.133755207061768
4After moving into the Oval Office, one of the first things thatwas75.245244026184082
4After moving into the Oval Office, one of the first things thathappened85.386913299560547
4After moving into the Oval Office, one of the first things thatObama96.018731117248535
4After moving into the Oval Office, one of the first things thatMr106.0303544998168945

About

Simple tool for generating tokens with open source transformers and/or calculate per-token surprisal.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published