Skip to content

Latest commit

 

History

History
67 lines (52 loc) · 2.94 KB

hw9.md

File metadata and controls

67 lines (52 loc) · 2.94 KB

 

home | copyright ©2019, [email protected]

syllabus | src | submit | chat

Homework 9: Big Data

Lets see if we can learn from large data sets, without holding all that data .

Same as homework 8 but now:

  • Read in the first 5000 rows, randomize the order
  • Do unsupervised clustering on the first 500 rows
  • Then, one a time, dribble in the remaining 4500 rows and only update kids if the new data is anomalous
  • Then, one a time, dribble in the rows and only update kids if the new data is anomalous

Engineering tips:

  • Make the nodes of your tree "smart".
    • They get new rows, one at a time
      • and only if they are anomalous might they get pushed to subtrees
    • When there is enough, that node knows to make its own sub-tree.
  • Define "anomaly" using the pivots
    • havea magic constant α=0.5
    • If the cosine distance from east to west is c;
    • The if a new row is distance a,b from east west then if falls at distance x along c
      • x = (a^2 + c^2 - b^2) / (2c)
    • And if the sub-trees are being split at s
      • if s < 0.5
        • then far = s*α and anomalous is x < far
      • else
        • then far = s+ (1-s)*α and anomalous is x > far

To assess the results:

For two large datasets (xomo10000 and pom310000)

  • Build a tree using all data (as in prior homeworks).
  • 100 times select rows in a leaf cluster, at random
    • Tag each of these probes with the BEFORE values
      • size, mean and standard deviation of the performance scores in their leaf cluster
  • 20 times, rebuild the trees using all the data
    • Find the probes
    • Tag the probes with the AFTER values:
      • size, mean and standard deviation of the performance scores in their leaf cluster
    • Using the code at https://gist.github.com/timm/33578871be53e604da83679dc7ccbcc5, report how often these probes land on the same distributions in AFTER than BEFORE - i.e. when Num.same test passes.
  • Let baseline be the mean same score found in the above 20 repeats.
  • Now 20 times repeat:
    • build the trees incrementally
    • Compute the same score (using the same 100 probes as used above)
  • Write a table showing the same score seen with all and incremental

Write a file report.txt commenting on how much α effects the same score.

What could go wrong

The baseline score is very low (in which case the random projections are finding wildly different clusters).

  • If that happens, spend more time finding the pivots (i.e. lesson the "random" in the random projections)