-
Notifications
You must be signed in to change notification settings - Fork 2
Dataset preparation
Steps needed to prepare training and/or testing datasets for dubRE.
Select and compile the wanted binaries. The compiler needs to emit PDB files (e.g. MSVC or rustc).
See Symbol demangling for details on support for decorated names.
Analyse all binaries with IDA Pro (full autoanalysis is required). Apply PDB files where feasible - it is optional as PDB information is stored separately in the database. It is worth noting that the choice slightly affects decompilation results, including the number of restored subroutines.
Extract PDB information to JSON file with PDB parser (use ground-truth
branch). Import the data into SQLite database using pdb.py
script.
Use extraction plugins for IDA to export:
-
strings (
Shift
+S
) - functions (
Shift
+F
followed byShift
+D
) - function-string cross reference paths (
Shift
+X
)
Use data transformation scripts to generate additional tables and label the samples manually.
- Create
tokens
table fromstrings
withtokenize.py
- Fill in
is_name
column:-
0
for negatives (not a function name) -
1
for positives (a function name)
-
- Delete duplicates made due to recursive calls:
DELETE FROM paths WHERE func_addr = path_func1 OR (path_func1 = path_func2 AND path_func1 != -1)
- Label function-string paths in
paths
withautolabel_paths.py
- Manually review the labelled samples and correct false positives.
- Create function-token
token_paths
table and populate it with all function-string paths decomposed into function-token paths where functions have any positive paths withtpaths.py
. - Create
token_paths_positive
table for manual labelling and populate it with positive function-string paths decomposed into function-token paths withtpaths_pos.py
- Fill in
names_func
column intoken_paths_positive
:-
0
for negatives (token is not the function's name) -
1
for positives (token is the function's name)
-
- Copy the assigned labels back into
token_paths
and deletetoken_paths_positive
withtpaths_merge.py
- Balance the dataset with function-token paths of functions that have no positive paths. The positive/negative ratio of root functions in
token_paths
should be close to 1:1.
See Path labelling for manual labelling process steps.
To ensure coherent and complete data, the relation between paths
and tpaths
is checked - every positive function-string path must have at least 1 function-token path:
SELECT * FROM (SELECT * FROM paths WHERE to_name = 1) AS p LEFT JOIN (SELECT * FROM token_paths WHERE names_func = 1) AS tp ON p.func_addr = tp.func_addr WHERE names_func IS NULL
The above query returns all positive function-string paths which are missing their function-token path counterpart. This could happen for a few reasons:
- The related token is mislabelled as false negative
- The related token is missing a label
- The path is mislabelled as false positive
- The token path is mislabelled as false positive
- The related string record is missing from
strings
- The related token is a duplicate and thus missing from
tokens
Below you can find solutions to each of the problems.
A token was initially mislabelled as a negative, but cross reference path review and labelling proved otherwise.
With partially-labelled datasets, it is possible to have a positive function-string path with a NULL
-labelled token. The correction process is the same in both cases.
- Correct the token label:
-- Provide the token value (it's unique)
UPDATE tokens SET is_name = 1 WHERE literal = '<token literal>'
- Add a labelled token path manually:
-- Provide all needed parameters from related `paths` record and the token literal
INSERT INTO token_paths (path_id,func_addr,string_addr,token_literal,names_func) VALUES (123,456,789,'<token literal>',1)
A string does not contain a token which is the referred function's name, but it was manually labelled as such.
Correct the path label:
-- Provide the correct path id
UPDATE paths SET to_name = 0 WHERE id = 123456789
A function-token path does not name the referred function, but it was manually labelled as such.
Correct the token path label:
-- Provide the correct path id and token literal
UPDATE token_paths SET names_func = 1 WHERE path_id = 123456789 AND token_literal = '<token literal>'
Strings exported by strings_export.py
plugin skip over some of UTF-16
-encoded string literals which do not appear in the IDB's string list.
Add the string record to the database:
- Open the IDB and execute the below code from IDA's terminal:
# Provide the string's offset
import idc
print(idc.get_strlit_contents(<string offset>, -1, 1))
- Copy the string out of the terminal
- Add a new
strings
record:
-- Provide respectively the offset and value of the string
INSERT INTO strings VALUES (123456789, '<string literal>')
- Run
tpaths_add_missing.py
script which will add the missingtokens
andtpaths
records for the new string:
python tpaths_add_missing.py --dbpath="<full SQLite path>" --pathid=<related path id>
- Manually label the added tokens and token paths.
Another reason for a missing string could be a type misassignment by IDA which results in identifying exceptional entities such as line variables as strings. Such IDB objects are not string literals, all records pointing to them should be deleted from paths
:
-- Provide path id
DELETE FROM paths WHERE id = 123456789
The database model does not allow duplicate tokens from a single binary to prevent unwanted weighting of tokens. If you are sure the token is a fitting positive, there likely exists a different function-token path with it or the name is too ambiguous for a function name and as such out of scope for the training dataset. In case of single-token strings, such paths should be labelled negatives.
Label the path as a negative:
-- Provide the path id
UPDATE paths SET to_name = 0 WHERE id = 123456789
Raw binary data tends to be highly unbalanced. Equalizing the numbers of positive and negative samples should improve the quality of both datasets.
Majority of the tokens obtained from binaries tends to be negative. Pre-made pdb
table can be used to equalize the function name token set with undecorated names of the original functions.
Similarly to function names, the majority of the function-token cross reference paths tends to be negative too. Because of the large sizes of paths
, it is feasible to initially choose only paths of the functions with any positive paths, label the paths, then add the same amount of functions which do not have any positive paths. This approach narrows the dataset, trading negative samples for dataset balance and ease of manual labelling.
Use mergedb.py
script to merge single-binary databases into the all-in-one database.
Use to split the merged data into training and test datasets.