Open source library of Machine Learning algorithms.
Mahout is Hindi for Elephant Driver ie. it drives Hadoop, for which the elephant is the mascot.
- Hadoop is not strictly necessary (can run in local mode without Hadoop)
- version of Hadoop varies by Mahout version
$JAVA_HOME
environemnt variable must be sethadoop
executable must in in$PATH
if using Hadoop- ratings data must be in CSV format
- users and items must be integers
- malformed ratings can skew predictions (XXX: strongly validate input data)
- output or temp directory existence causes non-fast failure (not initially checked. XXX: write checks for this)
- only files in input directory should be input data, stray files can result in
ArrayIndexOutOfBoundsException
yum install mahout
to run in local mode without Hadoop set MAHOUT_LOCAL
environment variable to any value:
export MAHOUT_LOCAL=true
cat > users.txt <<EOF
6037
6038
6039
6040
EOF
hadoop fs -put users.txt
Always use --booleanData
for binary preferences for Tannimoto / LogLiklihood
Schema always same hence mahout assumes schema and just works
For binary preferences:
userid1,trueitem1
userid1,trueitem2
For numeric preferences:
user,item,preference
--usersFile
,--itemsFile
- only recommend for this list of users / items--filterFile
- exclude user,item pairs from recommendations
mahout recommenditembased --input movierating --output recs --usersFile users.txt --similarityClassname SIMILARITY_LOGLIKELIHOOD --booleanData
Other values for the --similarityClassname
option:
SIMILARITY_TANIMOTO_COEFFICIENT --booleanData
SIMILARITY_EUCLIDEAN_DISTANCE
SIMILARITY_COSINE
SIMILARITY_PEARSON_CORRELATION
user_id [item1:score1, ... itemN:scoreN]
Ported from private Knowledge Base pages 2013+