Nerchuko is a library of Machine Learning algorithms written in Clojure. Nerchuko presently focuses on Machine Learning for textual data.
Apart from the core Machine Learning algorithms, Nerchuko includes several helper functions that are useful when working with those Machine Learning algorithms. For example there are helper functions for preparing datasets, Feature Selection, Cross-validation etc.
Please note that Nerchuko is under active development. There may be bugs and the API may change without notice.
The API documentation can be found here: http://sids.github.com/nerchuko.
Nerchujo is hosted on Clojars. You can find the instructions for adding it as a dependency to your projects here: http://clojars.org/nerchuko.
Simply add the Nerchuko jar along with the jars of all the dependencies to your classpath and you are good to go. See below for instructions on building the Nerchuko jar. Nerchuko's dependencies are:
If you have git installed on your system, use the following command to get the Nerchuko source code:
git clone git://github.com/sids/nerchuko.git
Otherwise, you can download the source code from here: http://github.com/sids/nerchuko/tarball/master.
You will need lein installed to build Nerchuko from the source. Build the Nerchuko jar using the following command:
cd nerchuko
lein jar
Nerchuko's classification capabilities can be accessed through nerchuko.classification. Documentation for the namespace provides a simple example of how to use it. For a more elaborate example, look at the 20 Newsgroups example.
The nerchuko.classification namespace also includes other functions that might be useful when dealing with classification tasks: n-fold cross validation; produce, manipulate & print confusion matrices. More helper functions can be found in the namespaces nerchuko.helpers.
When working on text classification, functions in the nerchuko.text.helpers namespace might be useful.
Nerchuko includes implementations for the following classifiers:
Nerchuko's feature selection capabilities can be accessed through nerchuko.feature-selection. Documentation for the namespace provides simple example of how to use it. For a more elaborate example, look at the 20 Newsgroups example.
Nerchuko includes implementations for the following feature selection techniques:
Look in the examples/ directory for some examples demonstrating the usage of Nerchuko. These examples use Nerchuko to work with some standard machine learning datasets. This is currently the best way to learn to use Nerchuko.
You can run the examples using the command
lein run-example
This will print out a short help with instructions on running specific examples.
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
This is a very simple and good example demonstrating the usage of Nerchuko for text classification/categorization.
Download the data set from the above link and then run this using the command:
lein run-example newsgroups
The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...
Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
Although this might seem like another example for text classification, the the text has been preprocessed and we are presented with a numeric data set.
Download the data set from the above link and then run this using the command:
lein run-example spambase
Copyright (C) 2010 Siddhartha Reddy.
Distributed under the Apache License Version 2.0. See the file LICENSE.