Machine learning routines for Apache Spot (incubating).
At present, spot-ml contains routines for performing suspicious connections analyses on netflow, DNS or proxy data gathered from a network. These analyses consume a (possibly very lage) collection of network events and produces a list of the events that considered to be the least probable (most suspicious).
spot-ml is designed to be run as a component of Spot. It relies on the ingest component of Spot to collect and load netflow and DNS records, and spot-ml will try to load data to the operational analytics component of Spot. It is suggested that when experimenting with spot-ml, you do so as a part of the unified Spot system: Please see [the Spot wiki]
The remaining instructions in this README file treat spot-ml in a stand-alone fashion that might be helpful for customizing and troubleshooting the component.
Load data for consumption by spot-ml by running [spot-ingest].
The data format and location where the data is stored differs for netflow and DNS analyses.
Netflow Data
Netflow data for the year YEAR, month MONTH, and day DAY is stored in in a Parquet table at HUSER/flow/csv/y=YEAR/m=MONTH/d=DAY/*
according to
the following schema:
- time: String
- year: Double
- month: Double
- day: Double
- hour: Double
- minute: Double
- second: Double
- time of duration: Double
- source IP: String
- destination IP: String
- source port: Double
- dport: Double
- proto: String
- flag: String
- fwd: Double
- stos: Double
- ipkt: Double.
- ibyt: Double
- opkt: Double
- obyt: Double
- input: Double
- output: Double
- sas: String
- das: Sring
- dtos: String
- dir: String
- rip: String
DNS Data
DNS data for the year YEAR, month MONTH and day DAY is stored in Parquet at HUSER/dns/hive/y=YEAR/m=MONTH/d=DAY/
using the following schema:
- frame_time: STRING
- unix_tstamp: BIGINT
- frame_len: INT3. ip_dst: STRING
- ip_src: STRING
- dns_qry_name: STRING
- dns_qry_class: STRING
- dns_qry_type: INT
- dns_qry_rcode: INT
- dns_a: STRING
PROXY DATA
Proxy data for the year YEAR, month MONTH and day DAY is stored in Parquet at HUSER/dns/hive/y=YEAR/m=MONTH/d=DAY/
using the following schema:
- p_date: STRING
- p_time: STRING
- clientip: STRING
- host: STRING
- reqmethod: STRING
- useragent: STRING
- resconttype: STRING
- duration: INT
- username: STRING
- authgroup: STRING
- exceptionid: STRING
- filterresult: STRING
- webcat: STRING
- referer: STRING
- respcode: STRING
- action: STRING
- urischeme: STRING
- uriport: STRING
- uripath: STRING
- uriquery: STRING
- uriextension: STRING
- serverip: STRING
- scbytes: INT
- csbytes: INT
- virusid: STRING
- bcappname: STRING
- bcappoper: STRING
- fulluri: STRING
To run a suspicious connects analysis, execute the ml_ops.sh
script in the ml directory of the MLNODE.
./ml_ops.sh YYYMMDD <type> <suspicion threshold> <max results returned>
For example:
./ml_ops.sh 19731231 flow 1e-20 200
If the max results returned argument is not provided, all results with scores below the threshold will be returned, for example:
./ml_ops.sh 20150101 dns 1e-4
As the maximum probability of an event is 1, a threshold of 1 can be used to select a fixed number of most suspicious items regardless of their exact scores:
./ml_ops.sh 20150101 proxy 1 2000
Final results are stored in the following file on HDFS.
Depending on which data source is analyzed,
spot-ml output will be found under the HPATH
at one of
$HPATH/dns/scored_results/YYYYMMDD/scores/dns_results.csv
$HPATH/proxy/scored_results/YYYYMMDD/scores/results.csv
$HPATH/flow/scored_results/YYYYMMDD/scores/flow_results.csv
It is a csv file in which network events annotated with estimated probabilities and sorted in ascending order.
spot-ml is licensed under Apache Version 2.0
Create a pull request and contact the maintainers.
Report issues at the [Spot issues page].