Cross device prediction for ICDM kaggle competition
See here
-
Domain knowledge
We want to understand these cookies and IP things. -
Data pre-processing
Noteawk
andsed
are highly recommended cause they are fast. -
Machine Learning
See below. -
Submission
We have to follow the protocal from Kaggle.
Essentially, this is an identification problem. If it has to be related to some problem that I've tackled, that could be Face Verification problem. It is strongly recommended to go through some literature first.
Of course, we can form it as a supervised learning problem, then all the traditional schemes can be adopted (sorted by priority):
- Gradient Boosting Machines (kaggle-favored)
- SVM with different kernels
- Random Forest
- ...
Making it a supervised learning problem doesn't make that much sense to me actually. Inspired from Face community, the following appoaches are really worth trying (sorted by priority):
- Siemise convolutional net
- DrLIM
- Metric learning
We always start from easiest thing, and get it complicated. This bottom-up path can make us always feel clear about what we are doing. Maybe it will waste some time at the begining, I faithfully believe that will pay off along the road we go.
For supervised, do scikit-learn on the raw data first.
For another, do LDA (linear discriminative analysis, easiest version metric) in Euclidean space.
pandas only works for dense