-
Notifications
You must be signed in to change notification settings - Fork 93
/
readme.txt
212 lines (133 loc) · 6.6 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
***************************
LATENT DIRICHLET ALLOCATION
***************************
David M. Blei
blei[at]cs.princeton.edu
(C) Copyright 2006, David M. Blei (blei [at] cs [dot] princeton [dot] edu)
This file is part of LDA-C.
LDA-C is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
LDA-C is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
USA
------------------------------------------------------------------------
This is a C implementation of latent Dirichlet allocation (LDA), a
model of discrete data which is fully described in Blei et al. (2003)
(http://www.cs.berkeley.edu/~blei/papers/blei03a.pdf).
LDA is a hierarchical probabilistic model of documents. Let \alpha be
a scalar and \beta_{1:K} be K distributions of words (called "topics").
As implemented here, a K topic LDA model assumes the following
generative process of an N word document:
1. \theta | \alpha ~ Dirichlet(\alpha, ..., \alpha)
2. for each word n = {1, ..., N}:
a. Z_n | \theta ~ Mult(\theta)
b. W_n | z_n, \beta ~ Mult(\beta_{z_n})
This code implements variational inference of \theta and z_{1:N} for a
document, and estimation of the topics \beta_{1:K} and Dirichlet
parameter \alpha.
------------------------------------------------------------------------
TABLE OF CONTENTS
A. COMPILING
B. TOPIC ESTIMATION
1. SETTINGS FILE
2. DATA FILE FORMAT
C. INFERENCE
D. PRINTING TOPICS
E. QUESTIONS, COMMENTS, PROBLEMS, UPDATE ANNOUNCEMENTS
------------------------------------------------------------------------
A. COMPILING
Type "make" in a shell.
------------------------------------------------------------------------
B. TOPIC ESTIMATION
Estimate the model by executing:
lda est [alpha] [k] [settings] [data] [random/seeded/manual=filename/*] [directory]
The term [random/seeded/*] > describes how the topics will be
initialized. "Random" initializes each topic randomly; "seeded"
initializes each topic to a distribution smoothed from a randomly
chosen document; "manual=filename" will load the document numbers to
use as seeds from the file specified (one per line); or, you can
specify a model name to load a pre-existing model as the initial model
(this is useful to continue EM from where it left off). To change the
number of initial documents used, edit lda-estimate.c.
The model (i.e., \alpha and \beta_{1:K}) and variational posterior
Dirichlet parameters will be saved in the specified directory every
ten iterations. Additionally, there will be a log file for the
likelihood bound and convergence score at each iteration. The
algorithm runs until that score is less than "em_convergence" (from
the settings file) or "em_max_iter" iterations are reached. (To
change the lag between saved models, edit lda-estimate.c.)
The saved models are in two files:
<iteration>.other contains alpha.
<iteration>.beta contains the log of the topic distributions.
Each line is a topic; in line k, each entry is log p(w | z=k)
The variational posterior Dirichlets are in:
<iteration>.gamma
The settings file and data format are described below.
1. Settings file
See settings.txt for a sample. See inf-settings.txt for an example of
a settings file for inference. These are placeholder values; they
should be experimented with.
This is of the following form:
var max iter [integer e.g., 10 or -1]
var convergence [float e.g., 1e-8]
em max iter [integer e.g., 100]
em convergence [float e.g., 1e-5]
alpha [fit/estimate]
where the settings are
[var max iter]
The maximum number of iterations of coordinate ascent variational
inference for a single document. A value of -1 indicates "full"
variational inference, until the variational convergence
criterion is met.
[var convergence]
The convergence criteria for variational inference. Stop if
(score_old - score) / abs(score_old) is less than this value (or
after the maximum number of iterations). Note that the score is
the lower bound on the likelihood for a particular document.
[em max iter]
The maximum number of iterations of variational EM.
[em convergence]
The convergence criteria for varitional EM. Stop if (score_old -
score) / abs(score_old) is less than this value (or after the
maximum number of iterations). Note that "score" is the lower
bound on the likelihood for the whole corpus.
[alpha]
If set to [fixed] then alpha does not change from iteration to
iteration. If set to [estimate], then alpha is estimated along
with the topic distributions.
2. Data format
Under LDA, the words of each document are assumed exchangeable. Thus,
each document is succinctly represented as a sparse vector of word
counts. The data is a file where each line is of the form:
[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where [M] is the number of unique terms in the document, and the
[count] associated with each term is how many times that term appeared
in the document. Note that [term_1] is an integer which indexes the
term; it is not a string.
------------------------------------------------------------------------
C. INFERENCE
To perform inference on a different set of data (in the same format as
for estimation), execute:
lda inf [settings] [model] [data] [name]
Variational inference is performed on the data using the model in
[model].* (see above). Two files will be created : [name].gamma are
the variational Dirichlet parameters for each document;
[name].likelihood is the bound on the likelihood for each document.
------------------------------------------------------------------------
D. PRINTING TOPICS
The Python script topics.py lets you print out the top N
words from each topic in a .beta file. Usage is:
python topics.py <beta file> <vocab file> <n words>
------------------------------------------------------------------------
E. QUESTIONS, COMMENTS, PROBLEMS, AND UPDATE ANNOUNCEMENTS
Please join the topic-models mailing list,
To join, go to http://lists.cs.princeton.edu and click on
"topic-models."