-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathclassification.m
executable file
·557 lines (524 loc) · 25.2 KB
/
classification.m
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
%% Chapter 6: Document Filtering (Page 117)
% "Programming Collective Intelligence - Building Smart Web 2.0 Applications"
% by Toby Segaran (O'Reilly Media, ISBN-10: 0-596-52932-5)
%
% This chapter shows how to classify documents based on their contents as
% an application of machine learning. A familiar application of such
% techniques is spam filtering. This problem began with e-mails, but it has
% now spread to blogs, message boards, Wiki's, and social networking
% sites like MySpace and Facebook in a form of spam posts, comments and
% messages. This is the dark side of the Web 2.0.
%
% However, the techniques are not limited to spam for application. The
% algorithms are broad enough to accommodate other applications, because it
% is about learning to recognize whether a document belongs in one category
% or another based on its content.
%
% In the world of marketing I live in, we also gets a lot of junk leads as
% we move our lead generation activities to online. We would use online
% forms to try to gather data, but your target may try to bypass the forms
% by providing fake data in order to get the fulfillment. It would be very
% useful to automatically sort out the junk.
%
% In this chapter I will also take advantage of a new MATLAB feature in
% R2008a called object-oriented programming (OOP).
%% Documents and Words (Page 118)
%
% The task of classifying documents requires some kind of features to
% distinguish one document over another. A feature is anything that you can
% determine as being either present or absent in the document, and in our
% case words in documents are the logical choice for such features.
% Assumption is that frequency of word appearance is different based
% on document type. For example, certain words are more likely to
% appear in spam than non-spam.
%
% Of course, features could be word pairs or phrases, or any other higher
% or lower document structures that can be either present or absent. In
% this case, we will focus on words.
%
% 'getwords' function extracts words larger than 2 letters but less than 20
% into a cell array using regular expression. The text is divided by
% non-alphabetic character, so this doesn't work with Japanese text. Words
% are converted to lowercase.
%
% This is not a perfect solution, because spam often uses capitalized
% letters and non-alphabetic characters for emphasis, and we are not
% capturing these features in 'getwords'. So we need to leave room for
% improvement when we write the classification routine by easily
% substituting with different feature extraction techniques.
test='This is just a test %*$![]';
disp(['test sample=''' test ''';'])
disp('Extract features using ''getwords'' function')
disp(getwords(test))
clear all;
%% Training the Classifier (Pages 119-121)
%
% The classifiers discussed in this chapter learn how to classify a
% document by being trained. The more examples of documents and their
% correct classifications it sees, the better the classifier will get at
% making predictions with increasing certainty.
%
% Here, the classifier is implemented as a user-defined MATLAB class,
% taking advantage of the new feature in R2008a. Classes act as templates.
% We may want to classify different documents by different criteria, and we
% can write separate programs for them, but by using template approach, we
% can use one template that can be used to create individual cases of
% classifiers for each situation. These individual cases are called
% 'instances' or 'objects' in OOP. We can create different instances of the
% classifier class or template and train them differently to have them
% serve different purposes.
%
% Make sure you use 'handle' superclass declaration in class definition
% - this is the key feature that enables self-referencing behavior for your
% user-defined classes. This key feature is not well documented in MATLAB
% documentation, so it took me a while to figure this out.
%
% 'classifier' class has 3 built-in properties. 'fc' stands for 'feature
% count' and it keeps track of different features in different categories
% of classification. 'cc' stands for 'category count' and it keeps track of
% how many times each category has been used. The last property,
% 'getfeatures', store the function handle to a feature-extraction
% function. Default is '@getwords'. The class also defines the methods you
% can use to construct an object or instance, or manipulate it in various
% ways. 'train' and 'fcount' are examples of those methods.
%
% 'train' method provides the initial training for classifier object by
% supplying a sample text and its category. 'fcount' gives you the count of
% given feature in a given category acquired through training.
cl=classifier(); % create an instance or object from 'classifier' class
text1='the quick brown fox jumps over the lazy dog';
text2='make quick money in the online casino';
disp(['training sample1=''' text1 ''';'])
disp(['training sample2=''' text2 ''';'])
cl.train(text1,'good'); % train the classifier object 'cl' with sample text.
cl.train(text2,'bad'); % more training, with a bad example.
% the word 'quick appeared once on both, so it should have a count of 1
% in each category
disp(' ')
disp(sprintf('Count of ''quick'' classifed as ''good''=%d',cl.fcount('quick', 'good')))
disp(sprintf('Count of ''quick'' classifed as ''bad''=%d',cl.fcount('quick', 'bad')))
%% Calculating Probabilities (Pages 121-122)
%
% Now that we know how many features a given category has and frequency of
% each feature, you can now calculate the probability of a word being in a
% particular category... Pr(word|category). This is implemented as a class
% method 'fprob'.
%
% We also have to re-train the object as we develop the algorithm, so we
% should automate the training process - that's what 'sampletrain' does.
%
% The sample texts it feeds are:
%
% * 'Nobody owns the water.' -> good
% * 'the quick rabbit jumps fences' ->good
% * 'buy pharmaceuticals now' -> bad
% * 'make quick money at the online casino' -> bad
% * 'the quick brown fox jumps' -> good
%
% 'quick' appears 3 times and 2 are good and 1 bad. So the probability of
% 'quick' appearing in 'good' documents should be 2/3=0.6666666..
disp(' ')
disp('Reset the ''classifier'' object')
cl=classifier(); % reset the object by re-creating it.
sampletrain(cl); % feed the standard training data
format='Probability of ''quick'' appearing in ''good'' documents=%f';
disp(sprintf(format,cl.fprob('quick','good')))
%% Starting with a Reasonable Guess (Pages 122-123)
%
% 'fprob' gives exact prediction based on training data, but that's also
% its limitation - its prediction is good only for training data. Its
% prediction is incredibly sensitive to the training data it received.
% For example, the word 'money' only appeared once in a bad example, so its
% probability is 100% for bad documents, 0% for good documents. This is too
% extreme, because 'money' could be used in good documents as well.
%
% To address this issue, we would like to introduce a concept of 'assumed
% probability' ('ap') that new features take on initially, and then adjusted
% gradually in training. '0.5' is a good neutral probability to start with.
% Next thing to think about is how much weight you would like to give to 'ap'.
% Weight of 1 means it is worth one word in actual samples. The new metric
% is the weighted average of probability returned from 'fprob' and 'ap'.
%
% For the case of 'money', its 'ap' is '0.5' initially for both 'good' and
% 'bad'. Then the classifier object finds 1 example of 'money' in a 'bad'
% document in the training. So its probability for 'bad' becomes:
%
% * (weight x ap + count x fprob)/(count + weight)
% * = (1x0.5+1*1)/(1+1)
% * = 0.75
%
% This means that its probability for 'good' will be '0.25'.
disp(' ')
disp('Reset the ''classifier'' object')
cl=classifier();
sampletrain(cl);
format='Probability of ''money'' appearing in ''good'' documents=%f';
disp('Run the weighted probability calculation... ')
disp(' ')
disp(sprintf(format,cl.weightedprob('money','good',@cl.fprob)))
disp(' ')
disp('Repeat the training - this should increase certainty')
disp(' ')
sampletrain(cl);
disp(sprintf(format,cl.weightedprob('money','good',@cl.fprob)))
%% A Naive Classifier - Probability of a Whole Document (Pages 123-124)
%
% We now have a way to calculate Pr(word|category), which gives us the
% probabilities of a document in a category containing a particular word,
% but we need a way to combine the individual word probabilities to get the
% probability that an entire document belongs in a given category.
%
% Here we will implement one approach called a naive Bayesian classifier.
% It is called 'naive' because it assumed that the probabilities being
% combined are independent of each other. In the case of spam, we know that
% 'money' and 'casino' are much more highly correlated than other
% combinations, so this is not a good assumption. While we cannot use the
% probabilities given by a naive Bayesian classifier per se, we can still
% use its results relatively for comparison among different categories.
% Therefore it is still an effective approach in real life despite its flaw
% in the assumption.
%
% If all features are independent, then we can combine their probabilities
% by multiplying them together. Let's take an example of words 'MATLAB' and
% 'Casino', and their probabilities to appear in bad documents are as
% follows:
%
% * Pr(matlab|bad)=20%
% * Pr(casino|bad)=80%
%
% The probability that both 'MATLAB' and 'Casino' should appear in bad
% documents is 16%.
%
% * Pr(matlab&casino|bad)= 20%x80%=16%
%
% So we can get the probabilities for the whole document by multiplying the
% probabilities of all features contained in the document.
%
% We will create a different subclass of 'classifier' to add this
% capability. This subclass is called 'naivebayes' which implements
% 'docprob' method.
%
% Let's test with some sample data. Please remember that the result won't
% be exactly 16% because we are using weighted average here.
disp(' ')
disp('Reset the object with ''naivebayes''')
cl=naivebayes();
disp(' ')
disp('Training ''naivebayes'' object with 80% good ''MATLAB'' & 80% bad ''Casino''...')
cl.fc={'matlab','good',80;'matlab','bad',20; 'casino','good',20;'casino','bad',80};
cl.cc={'good',100;'bad',100};
format='Probability of ''matlab casino'' as a ''good'' document=%f';
disp(sprintf(format,cl.docprob('matlab casino','good')))
%% A Quick Introduction to Bayes' Theorem (Pages 125-126)
%
% Now that we know how to calculate Pr(document|category), we need to flip
% it around to get Pr(category|document) in order to classify new
% documents. That's what Bayes' Theorem is about.
%
% * Pr(A|B)=Pr(B|A)xPr(A)/Pr(B)
%
% Or, in our case,
%
% * Pr(category|document)=Pr(document|category)xPr(category)/Pr(document)
%
% Pr(document|category) is given by 'docprob'. Pr(category) is just the
% number of documents in the category divided by total number of documents.
% Pr(document) is irrelevant in our case. Because of the basic flaw in our
% naive Bayesian approach, we are not going to use resulting probabilities
% directly. Instead we will calculate probability for each category
% separately, and then all the results will be compared. Pr(document) is
% the same for each of those separate calculations, so we can safely ignore
% this term.
%
% 'prob' method calculates the probability of each category and returns the
% product Pr(document|category) x Pr(category).
disp(' ')
disp('Reset the ''naivebayes'' object')
cl=naivebayes();
sampletrain(cl);
format='Probability of ''quick rabbit'' as a ''good'' document=%f';
disp(sprintf(format,cl.prob('quick rabbit','good')))
format='Probability of ''quick rabbit'' as a ''bad'' document=%f';
disp(sprintf(format,cl.prob('quick rabbit','bad')))
%% Choosing a Category (Pages 126-127)
%
% The final step in classification is to determine the category a new
% document belongs in. The simple approach is to assign it to the category
% with highest probability. However, in reality, this may not be a
% desirable behavior. You don't want to misplace non-spam as spam just
% because the probability is marginally higher. So we need to assign a
% minimum threshold for each category, so that new item can be placed there
% only when the probability is higher than the next best by a specified
% threshold.
%
% For example, the threshold for 'bad' could be set to '3', while it could
% be '1' for 'good'. This means any item with highest 'good' probability,
% however small the difference may be, will be classified as 'good', while
% only items with 3 times higher 'bad' probability than others will be
% classified as 'bad', and anything inbetween is 'unknown'.
%
% We need to modify 'classifier' class to add 'thresholds' property and
% 'classify' method to implement this approach. The changes will also be
% inherited by 'naivebayes' subclass automatically, so you don't have to do
% anything with that subclass. This is the power of OOP.
disp(' ')
disp('Reset the ''naivebayes'' object')
cl=naivebayes();
sampletrain(cl);
format='Bayes classify ''quick rabbit''... Category=%s';
disp(sprintf(format,cl.classify('quick rabbit','unknown')))
format='Bayes classify ''quick money''... Category=%s';
disp(sprintf(format,cl.classify('quick money','unknown')))
disp(' ')
disp('Set the threshold for ''bad'' to ''3.0''')
cl.setthreshold('bad',3.0);
disp(sprintf(format,cl.classify('quick money','unknown')))
disp(' ')
disp('Run the ''sampletrain'' 10 times')
for i=1:10
sampletrain(cl);
end
disp(sprintf(format,cl.classify('quick money','unknown')))
clear all;
%% The Fisher Method - Category Probabilities for Features (Pages 127-129)
%
% Here we will examine the Fisher Method, named for R.A. Fisher, as an
% alternative approach in document filtering. This method is used in
% SpamBayes, an Outlook plugin written in Python, and known to be
% particularly effective for spam filtering.
%
% In a naive Bayesian approach, we used the feature probabilities to
% determine the whole document probability. The Fisher Method is more
% elaborate, but it is worth learning because it offers much greater
% flexibility.
%
% In the Fisher Method, you start by calculating the probability of a
% document being in a category given the presence of a particular feature -
% Pr(category|feature). If 'casino' appears in 500 documents and 499 of
% those are in the 'bad' category, then 'casino' will score very close to 1
% for 'bad'.
%
% * Pr(category|feature)= (number of documents in this category with the
% feature)/(total number of documents with the feature)
%
% However, this formula doesn't account for imbalance of data among
% different categories. If most of your documents are good except a few bad
% ones, the features associated with bad ones will score too high for
% 'bad', even though they may be perfectly legitimate words. So we need to
% normalize the probability calculation. Required information is:
%
% * clf=Pr(feature|category) for a given category
% * freqsum = sum of Pr(feature|category) for all the categories
% * cprob=clf/(clf+nclf)
%
% If you run the following code, you will see that 'money' returns 1.0
% probability for 'bad'. Again, it is too sensitive to the training data.
% We should use weighted average approach again with the Fisher Method.
disp(' ')
disp('Reset the object with ''fisherclassifier''')
cl=fisherclassifier();
sampletrain(cl);
format='Probability of ''quick'' as a ''good'' document=%f';
disp(sprintf(format,cl.cprob('quick','good')))
format='Probability of ''money'' as a ''bad'' document=%f';
disp(sprintf(format,cl.cprob('money','bad')))
disp(' ')
disp('Use weighted average rather than cprob directly')
disp(sprintf(format,cl.weightedprob('money','bad',@cl.cprob)))
%% Combining the Probabilities (Pages 129-130)
%
% In our naive Bayesian approach we combined the feature probabilities by
% multiplying them to derive an overall document probability. The Fisher
% Method add rigor to this process by applying additional steps to test the
% multiplication result.
%
% Fisher's research shows that if you take a natural log of the result and
% multiplying it by -2, then the resulting score should fit a chi-squared
% distribution if the probabilities are independent and random. We can use
% this knowledge this way: if a document doesn't belong in a particular
% category, its feature probabilities vary randomly for that category.
% If a document belongs in another category, its feature probabilities
% should be high in general - no longer random.
%
% By feeding the result of the Fisher calculation to the inverse chi-square
% function, you get the probability that a random set of probabilities
% would return such a high number. For further reading, check out:
% http://www.gigamonkeys.com/book/practical-a-spam-filter.html
%
% By the way, I made a small change to 'invchi2' function:
%
% * Python: for i in range(1, df//2)
% * MATLAB: for i=1:floor(df/2)-1
%
% '//' is an integer division operator, an equivalent to 'floor' in MATLAB.
% 'range(1,2)' returns '[1]', not [1 2] in Python, however - the second
% end-of-range argument is not included in the resulting vector. So I had
% to add '-1' in order to make MATLAB version to behave the same as Python
% version. I am not sure if this is strictly Kosher from Statistics point
% of view.
%
% Anyhow, we can see that the Fisher Method is more sophisticated and
% robust than our naive Bayesian alternative, and therefore it is expected
% to perform better.
disp(' ')
disp('Reset the ''fisherclassifier'' object')
cl=fisherclassifier();
sampletrain(cl);
format='''cprob'' of ''quick'' as a ''good'' document=%f';
disp(sprintf(format,cl.cprob('quick','good')))
format='''fisherprob'' of ''quick rabbit'' as a ''good'' document=%f';
disp(sprintf(format,cl.fisherprob('quick rabbit','good')))
format='''fisherprob'' of ''quick rabbit'' as a ''bad'' document=%f';
disp(sprintf(format,cl.fisherprob('quick rabbit','bad')))
%% Classifying Items (Pages 130-131)
%
% Because 'fisherprob' returns a probability value pre-qualified by inverse
% chi-square function, we can use it directly in classification, rather
% than having multiplication thresholds like our naive Bayesian classifier.
% Instead, we will specify the minimums for each classification. For a
% spam filter, you may want to set this minimum high for 'bad', such as '0.6',
% while the minimum for 'good' can be as low as '0.2', in order to avoid
% accidental misplacement of good e-mails in the 'bad' category. Anything
% lower than the minimums will be classified as 'unknown'.
%
% The result we will see may be not much different from the output of our
% naive Bayesian classifier, because of the small training dataset, but The
% Fisher Method is believed to outperform in real world situations.
disp(' ')
disp('Reset the ''fisherclassifier'' object')
cl=fisherclassifier();
sampletrain(cl);
format='Fisher classify ''quick rabbit''... Category=%s';
disp(sprintf(format,cl.classify('quick rabbit','unknown')))
format='Fisher classify ''quick money''... Category=%s';
disp(sprintf(format,cl.classify('quick money','unknown')))
disp(' ')
disp('Set the minumum for ''bad'' to ''0.8''')
cl.setminimum('bad',0.8);
disp(sprintf(format,cl.classify('quick money','unknown')))
disp(' ')
disp('Set the minumum for ''good'' to ''0.4''')
cl.setminimum('good',0.4);
disp(sprintf(format,cl.classify('quick money','unknown')))
%% Persisting the Trained Classifiers (Pages 132-133)
%
% The book covers the process of adapting the 'classifier' class to work
% with SQLite light-weight, serverless database. Python 2.5 or later comes
% with SQlite built-in. It is used widely in desktop applications as well
% as small embedded applications. See http://www.sqlite.org/ for more
% information.
%
% So far, the classifiers we created need to be trained each time before
% you can use them, and you need to repeat this process each time you start
% up a session in MATLAB, because we haven't provided a way to persist the
% data or object. In MATLAB, you can actually save the whole classifier
% objects in .mat file. Python doesn't provide such functionality built-in,
% so it makes sense to provide native support to SQLite precisely for this
% reason.
%
% There is a nice library called
% <http://mksqlite.berlios.de/mksqlite_eng.html mksqlite> that connects
% MATLAB to SQLite.
% "classifer2.m" shows the modified classifier that uses mksqlite to
% persist the trained object.
%
% Here is the simple script to test the modified classifier.
% text1='the quick brown fox jumps over the lazy dog';
% text2='make quick money in the online casino';
%
% cl=classifier2();
% cl.setdb('test1.db');
% cl.train(text1, 'good');
% cl.train(text2, 'bad');
%
% disp(sprintf('Count of ''quick'' classified as ''good''=%d', cl.fcount('quick', 'good')))
% disp(sprintf('Count of ''quick'' classified as ''bad''=%d, cl.fcount('quick', 'bad')))
% mksqlite(cl.con, 'close');
%
% clear all
%% Filtering Blog Feeds - interactive training (Pages 134-135)
%
% As a non-spam related application of the classifiers we created, the book
% introduces a program that filters RSS feeds from various blogs. RSS was
% one of the important drivers for Web 2.0, but it now floods the
% cyberspace, overwhelming people with too much information to process.
%
% It takes advantage of a RSS parsing library called 'Universal Feed
% Parser' to build the documents to feed to the classifier, present the
% text to the user with a guess, then asks for a user input for correct
% answer. You can pick from multiple categories beyond 'good' or 'bad'.
% This user input is then used for training.
%
% Since there is no 'Universal Feed Parser' for MATLAB, I wrote a very
% simple code inside 'feedfilter' function. It only works with the sample
% XML file provided here 'python_search.xml' or any RSS feed that follows
% its format exactly. It doesn't work with other RSS feed formats.
%
% I improved the usability a bit - you can simply accept a guess by hitting
% return without entering text. You can also end the session by entering
% 'end' at the prompt at any time. As you go through the training, you will
% notice that the guesses get better and better.
% Test 'feedfilter' function
% cl=fisherclassifier();
% feedfilter(cl);
%% Filtering Blog Feeds - the training results (Page 136)
%
% If you would rather skip the tedious interactive training, you can load
% the pre-made dataset. The probabilities of various categories given the
% word 'python' are evenly divided, because this word appears in all
% entries. The word 'Eric' occurs 25 percent of entries related to Monty
% Python and doesn't occur at all in other categories. Therefore the
% probability of 'monty' given 'eric is 100%, while the probability of
% 'eric' given 'monty' is 25% (well, in the way I classified, it comes out
% more like 28%).
load feedfilter.mat;
disp(' ')
format='Probability (w->c) of ''python'' in ''prog'' = %f';
disp(sprintf(format,cl.cprob('python','prog')))
format='Probability (w->c) of ''python'' in ''snake'' = %f';
disp(sprintf(format,cl.cprob('python','snake')))
format='Probability (w->c) of ''python'' in ''monty'' = %f';
disp(sprintf(format,cl.cprob('python','monty')))
format='Probability (w->c) of ''eric'' in ''monty'' = %f';
disp(sprintf(format,cl.cprob('eric','monty')))
format='Probability (c->w) of ''eric'' in ''monty'' = %f';
disp(sprintf(format,cl.fprob('eric','monty')))
clear all;
%% Improving Feature Detection (Pages 136-138)
%
% Our 'getwords' function is very simple. It ignores all non-alphabet
% characters and convert everything into lowercase. There are many ways to
% improve it:
%
% * Use the presence of many uppercase words as a feature without making
% uppercase words and lowercase words distinct.
% * Use set of words in addition to individual words.
% * Capture more meta information, such as who sent an email message or what
% category a blog entry was posted under, and annotate it as
% meta information.
% * Keep URLs and numbers intact.
%
% 'entryfeatures' works with RSS feed add those capabilities. 'feedfilter'
% is also modified to pass a RSS entry as a cell array function rather than
% as a text string to feature detection if the function name is
% 'entryfeatures'.
% cl=fisherclassifier(@entryfeatures);
% feedfilter(cl);
%% Using Akismet (Pages 138-139)
%
% Akismet http://akismet.com/ is another web service that provides spam
% filter function online via its XML-based API. This means that we can take
% advantage of collective power of online users to fight spam, so that you
% don't have to train your spam filter on your own.
%
% It is primarily designed for filtering out spam comments posted on blogs,
% so it doesn't work for other uses very well. Therefore I am not going to
% touch it here.
%
% But this is a good example of how Web 2.0 takes advantage of collective
% intelligence.
%% Alternative Methods (pages 139-140)
%
% The book discusses other approaches such as neural network or support
% vector machines. However, Bayesian approaches we covered here are easier
% to implement and demand less computing power and resources.