Skip to content

Latest commit

 

History

History
100 lines (71 loc) · 2.03 KB

mahout.md

File metadata and controls

100 lines (71 loc) · 2.03 KB

Mahout

Open source library of Machine Learning algorithms.

Mahout is Hindi for Elephant Driver ie. it drives Hadoop, for which the elephant is the mascot.

Key Points

  • Hadoop is not strictly necessary (can run in local mode without Hadoop)
  • version of Hadoop varies by Mahout version
  • $JAVA_HOME environemnt variable must be set
  • hadoop executable must in in $PATH if using Hadoop
  • ratings data must be in CSV format
  • users and items must be integers
  • malformed ratings can skew predictions (XXX: strongly validate input data)
  • output or temp directory existence causes non-fast failure (not initially checked. XXX: write checks for this)
  • only files in input directory should be input data, stray files can result in ArrayIndexOutOfBoundsException
yum install mahout

to run in local mode without Hadoop set MAHOUT_LOCAL environment variable to any value:

export MAHOUT_LOCAL=true
cat > users.txt <<EOF
6037
6038
6039
6040
EOF
hadoop fs -put users.txt

Always use --booleanData for binary preferences for Tannimoto / LogLiklihood

Input

Schema always same hence mahout assumes schema and just works

For binary preferences:

userid1,trueitem1
userid1,trueitem2

For numeric preferences:

user,item,preference
  • --usersFile, --itemsFile - only recommend for this list of users / items
  • --filterFile - exclude user,item pairs from recommendations
mahout recommenditembased --input movierating --output recs --usersFile users.txt --similarityClassname SIMILARITY_LOGLIKELIHOOD --booleanData

Other values for the --similarityClassname option:

SIMILARITY_TANIMOTO_COEFFICIENT --booleanData
SIMILARITY_EUCLIDEAN_DISTANCE
SIMILARITY_COSINE
SIMILARITY_PEARSON_CORRELATION

Output

user_id [item1:score1, ... itemN:scoreN]

Ported from private Knowledge Base pages 2013+