home |
copyright ©2019, [email protected]
syllabus |
src |
submit |
Lets see if we can learn from large data sets, without holding all that data .
Same as homework 8 but now:
- Read in the first 5000 rows, randomize the order
- Do unsupervised clustering on the first 500 rows
- Then, one a time, dribble in the remaining 4500 rows and only update kids if the new data is anomalous
- Then, one a time, dribble in the rows and only update kids if the new data is anomalous
Engineering tips:
- Make the nodes of your tree "smart".
- They get new rows, one at a time
- and only if they are anomalous might they get pushed to subtrees
- When there is enough, that node knows to make its own sub-tree.
- They get new rows, one at a time
- Define "anomaly" using the pivots
- havea magic constant α=0.5
- If the cosine distance from east to west is c;
- The if a new row is distance a,b from east west then if falls at distance x along c
- x = (a^2 + c^2 - b^2) / (2c)
- And if the sub-trees are being split at s
- if s < 0.5
- then far = s*α and anomalous is x < far
- else
- then far = s+ (1-s)*α and anomalous is x > far
- if s < 0.5
To assess the results:
For two large datasets (xomo10000 and pom310000)
- Build a tree using all data (as in prior homeworks).
- 100 times select rows in a leaf cluster, at random
- Tag each of these probes with the BEFORE values
- size, mean and standard deviation of the performance scores in their leaf cluster
- Tag each of these probes with the BEFORE values
- 20 times, rebuild the trees using all the data
- Find the probes
- Tag the probes with the AFTER values:
- size, mean and standard deviation of the performance scores in their leaf cluster
- Using the code at https://gist.github.com/timm/33578871be53e604da83679dc7ccbcc5, report
how often these probes land on the same distributions in AFTER than BEFORE
- i.e. when
test passes.
- Let baseline be the mean same score found in the above 20 repeats.
- Now 20 times repeat:
- build the trees incrementally
- Compute the same score (using the same 100 probes as used above)
- Write a table showing the same score seen with all and incremental
Write a file report.txt
commenting on how much α effects the same score.
The baseline score is very low (in which case the random projections are finding wildly different clusters).
- If that happens, spend more time finding the pivots (i.e. lesson the "random" in the random projections)