Cuckoo Sandbox ML GSoC Proposal

                                        Malware Analysis Framework Schematic

Review of Data Sample

There are 199 total malware reports given to us. The reports were studied and from the analysis more relevant details for the project were uncovered

Report Analysis

info - This is the Cuckoo metadata
signatures - Representative of malicious behavior or indicator (Severity levels etc)
target - Metadata of the given file
virustotal - Pretty important and Gets antivirus signatures from VirusTotal.com for various results
static - Static analysis
dropped - Dropped files analysis
behavior - Different classes that analyze behavior dynamically
debug - Debug information
strings - Extract strings from analyzed file
network - Network analysis

Referred to https://github.com/cuckoosandbox/cuckoo/tree/master/modules/processing & http://docs.cuckoosandbox.org/en/latest/customization/processing/ to gain understanding of the sample reports

Features

JSON data - parse stage
In this step I decided to parse through the 199 files and selected signatures, virustotal, behavior and target and analyzed them.

signatures count - 191 files
virustotal count - 171 files
behavior count - 198 files
target count - 199 files

JSON data - run relevant queries on the dataset to aid in clustering

Clustering, Classifying and Anomaly Detection

Studying of two learning concepts seem fruitful for malware analysis:

Clustering of behavior - which enables identifying novel classes of malware with similar behavior
Classification of behavior - which allows to assign malware to known classes of behavior

A set of malware binaries often contains similar variants of the same family which exhibit almost identical behavioral patterns. As a consequence, the embedded reports form dense clouds in the vector space. We can exploit this dense representation by subsuming groups of similar behavior using prototypes — reports being typical for a group of homogeneous behavior. By restricting the computation of learning methods to prototypes and later propagating results to all embedded data, we can accelerate clustering as well as classification techniques. The extracted prototypes correspond to regular reports and thus can be easily inspected by a human analyst.

Behavior analysis using prototypes: (a) prototypes of data, (b) clustering using pro- totypes, and (c) classification using prototypes. Black lines in Figure (b) indicate prototypes joined by linkage clustering. Black lines in Figure (c) represent the class decision boundary.

Algorithm 1 Prototype extraction

1: prototypes ← ∅
2: distance [ x ] ← ∞ for all x ∈ reports
3: while max ( distance ) > dp do
4: choose z such that distance [ z ]= max ( distance )
5: for x ∈ reports and x ≠ z do
6: if distance [ x ] > || ˆ φ ( x ) − ˆ φ ( z ) || then
7: distance [ x ] ←|| ˆ φ ( x ) − ˆ φ ( z ) ||
8: add z to prototypes

Algorithm 2 Clustering using prototypes

1: for z , z 0 ∈ prototypes do
2: distance [ z , z0 ] ←|| ˆ φ ( z ) − ˆ φ ( z0 ) ||
3: while min ( distance ) < dc do
4: merge clusters z , z0 with minimum distance [ z , z0 ]
5: update distance using complete linkage
6: for x ∈ reports do
7: z ← nearest prototype to x
8: assign x to cluster of z
9: reject clusters with less than m members

Algorithm 3 Classification using prototypes

1: for x ∈ reports do
2: z ← nearest prototype to x
3: if || ˆ φ ( z ) − ˆ φ ( x ) || > d r then
4: reject x as unknown class
5: else
6: assign x to cluster of z

Enter Incremental Analysis - the whole point

Based on a joint formulation of classification and clustering, we can have an incremental approach to analysis of malware behavior. To realize an incremental analysis, we need to keep track of intermediate results, such as clusters determined during previous runs of the algorithm. Fortunately, the concept of prototypes enables us to store discovered clusters in a concise representation and, moreover, provides a significant speed-up if used for classification.

Algorithm 4 Incremental Analysis

1: rejected ← ∅ , prototypes ← ∅
2: for reports ← data source ∪ rejected do
3: classify reports to known clusters using prototypes . see Algorithm 3
4: extract prototypes from remaining reports . see Algorithm 1
5: cluster remaining reports using prototypes . see Algorithm 2
6: prototypes ← prototypes ∪ prototypes of new clusters
7: rejected ← rejected reports from clustering

The reports to be analyzed are received from a data source in regular intervals. In the first processing phase, the in- coming reports are classified using prototypes of known clusters (line 3 of algorithm). Thereby, variants of known malware are efficiently identified and filtered from further analysis. In the following phase, prototypes are extracted from the remaining reports and subsequently used for clustering of behavior (line 4–5). The prototypes of the new clusters are stored along with the original set of prototypes, such that they can be applied in a next run for classification. This procedure—alternating between classification and clustering—is repeated incrementally, where the amount of unknown malware is continuously reduced and the prevalent classes of malware are automatically discovered. The number of reports available during one incremental run, however, may be insufficient for determining all clusters of malware behavior. For example, infrequent malware variants may only be represented by few samples in the embedding space. To compensate for this lack of information, we reject clusters with fewer than m members and feed the corresponding reports back to the data source. Consequently, infrequent malware is ag- glomerated until sufficient data is available for clustering.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data-extracted		data-extracted
data		data
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cuckoo Sandbox ML GSoC Proposal

Review of Data Sample

Report Analysis

Features

Clustering, Classifying and Anomaly Detection

Enter Incremental Analysis - the whole point

About

Releases

Packages

Languages

License

athend/cuckooml

Folders and files

Latest commit

History

Repository files navigation

Cuckoo Sandbox ML GSoC Proposal

Review of Data Sample

Report Analysis

Features

Clustering, Classifying and Anomaly Detection

Enter Incremental Analysis - the whole point

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages