Skip to content

Dataset preparation

michal-kapala edited this page Aug 12, 2023 · 8 revisions

Dataset preparation

Steps needed to prepare training and/or testing datasets for dubRE.

Compilation

Select and compile the wanted binaries. The compiler needs to emit PDB files (e.g. MSVC or rustc).

See Symbol demangling for details on support for decorated names.

Decompilation

Analyse all binaries with IDA Pro (full autoanalysis is required). Apply PDB files where feasible - it is optional as PDB information is stored separately in the database. It is worth noting that the choice slightly affects decompilation results, including the number of restored subroutines.

Data extraction

PDB

Extract PDB information to JSON file with PDB parser (use ground-truth branch). Import the data into SQLite database using pdb.py script.

IDB/i64

Use extraction plugins for IDA to export:

Data labelling

Use data transformation scripts to generate additional tables and label the samples manually.

Tokens

  1. Create tokens table from strings with tokenize.py
  2. Fill in is_name column:
    • 0 for negatives (not a function name)
    • 1 for positives (a function name)

Xref paths

  1. Delete duplicates made due to recursive calls:
DELETE FROM paths WHERE func_addr = path_func1 OR (path_func1 = path_func2 AND path_func1 != -1)
  1. Label function-string paths in paths with autolabel_paths.py
  2. Manually review the labelled samples and correct false positives.
  3. Create function-token token_paths table and populate it with all function-string paths decomposed into function-token paths where functions have any positive paths with tpaths.py.
  4. Create token_paths_positive table for manual labelling and populate it with positive function-string paths decomposed into function-token paths with tpaths_pos.py
  5. Fill in names_func column in token_paths_positive:
    • 0 for negatives (token is not the function's name)
    • 1 for positives (token is the function's name)
  6. Copy the assigned labels back into token_paths and delete token_paths_positive with tpaths_merge.py
  7. Balance the dataset with function-token paths of functions that have no positive paths. The positive/negative ratio of root functions in token_paths should be close to 1:1.

See Path labelling for manual labelling process steps.

Dataset integrity checks

To ensure coherent and complete data, the relation between paths and tpaths is checked - every positive function-string path must have at least 1 function-token path:

SELECT * FROM (SELECT * FROM paths WHERE to_name = 1) AS p LEFT JOIN (SELECT * FROM token_paths WHERE names_func = 1) AS tp ON p.func_addr = tp.func_addr WHERE names_func IS NULL

The above query returns all positive function-string paths which are missing their function-token path counterpart. This could happen for a few reasons:

  • The related token is mislabelled as false negative
  • The related token is missing a label
  • The path is mislabelled as false positive
  • The token path is mislabelled as false positive
  • The related string record is missing from strings
  • The related token is a duplicate and thus missing from tokens

Below you can find solutions to each of the problems.

False negative/unlabelled token

A token was initially mislabelled as a negative, but cross reference path review and labelling proved otherwise.

With partially-labelled datasets, it is possible to have a positive function-string path with a NULL-labelled token. The correction process is the same in both cases.

  1. Correct the token label:
-- Provide the token value (it's unique)
UPDATE tokens SET is_name = 1 WHERE literal = '<token literal>'
  1. Add a labelled token path manually:
-- Provide all needed parameters from related `paths` record and the token literal
INSERT INTO token_paths (path_id,func_addr,string_addr,token_literal,names_func) VALUES (123,456,789,'<token literal>',1)

False positive path

A string does not contain a token which is the referred function's name, but it was manually labelled as such.

Correct the path label:

-- Provide the correct path id
UPDATE paths SET to_name = 0 WHERE id = 123456789

False positive token path

A function-token path does not name the referred function, but it was manually labelled as such.

Correct the token path label:

-- Provide the correct path id and token literal
UPDATE token_paths SET names_func = 1 WHERE path_id = 123456789 AND token_literal = '<token literal>'

Missing string

IDA string list-absent UTF-16 strings

Strings exported by strings_export.py plugin skip over some of UTF-16-encoded string literals which do not appear in the IDB's string list.

Add the string record to the database:

  1. Open the IDB and execute the below code from IDA's terminal:
# Provide the string's offset
import idc
print(idc.get_strlit_contents(<string offset>, -1, 1))
  1. Copy the string out of the terminal
  2. Add a new strings record:
-- Provide respectively the offset and value of the string
INSERT INTO strings VALUES (123456789, '<string literal>')
  1. Run tpaths_add_missing.py script which will add the missing tokens and tpaths records for the new string:
python tpaths_add_missing.py --dbpath="<full SQLite path>" --pathid=<related path id>
  1. Manually label the added tokens and token paths.

Non-literal artifacts

Another reason for a missing string could be a type misassignment by IDA which results in identifying exceptional entities such as line variables as strings. Such IDB objects are not string literals, all records pointing to them should be deleted from paths:

-- Provide path id
DELETE FROM paths WHERE id = 123456789

Duplicate token

The database model does not allow duplicate tokens from a single binary to prevent unwanted weighting of tokens. If you are sure the token is a fitting positive, there likely exists a different function-token path with it or the name is too ambiguous for a function name and as such out of scope for the training dataset. In case of single-token strings, such paths should be labelled negatives.

Label the path as a negative:

-- Provide the path id
UPDATE paths SET to_name = 0 WHERE id = 123456789

Dataset balancing

Raw binary data tends to be highly unbalanced. Equalizing the numbers of positive and negative samples should improve the quality of both datasets.

Tokens

Majority of the tokens obtained from binaries tends to be negative. Pre-made pdb table can be used to equalize the function name token set with undecorated names of the original functions.

Xref paths

Similarly to function names, the majority of the function-token cross reference paths tends to be negative too. Because of the large sizes of paths, it is feasible to initially choose only paths of the functions with any positive paths, label the paths, then add the same amount of functions which do not have any positive paths. This approach narrows the dataset, trading negative samples for dataset balance and ease of manual labelling.

Data merging

Use mergedb.py script to merge single-binary databases into the all-in-one database.

Data splitting

Use to split the merged data into training and test datasets.