Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entropy calculation of sections and overlays (PE, ELF, COFF, MACHO) #507

Merged
merged 42 commits into from
Mar 19, 2019

Conversation

JakubPruzinec
Copy link
Contributor

@JakubPruzinec JakubPruzinec commented Feb 23, 2019

#502
I've implemented extraction of section and overlay entropies. Algorithm for entropy calculation uses Shannons formula normalized to <0,1> <0,8> interval.

@metthal metthal self-requested a review February 23, 2019 21:14
@metthal metthal self-assigned this Feb 23, 2019
Copy link
Member

@metthal metthal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I have added some comments to your code.

One thing that I am thinking about as I am looking at the code is that we don't usually show floating point numbers from fileinfo. I don't think there is any other case where we do but basically, what about the way precision and display format? std::cout uses by default scientific notation for them (for example 1e-5). I don't think we care that much about this kind of precision. I think we would be more than fine with just few digits of precision and not use scientific notation at all.

I am not entirely sure about this, I'll probably discuss it with someone else. I just want you to not forget about it and maybe try to come up with some ideas.

return;
}

auto data = reinterpret_cast<const uint8_t *>(bytes.data());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is reinterpret_cast really necessary here? Wouldn't static_cast be enough or is the cast necessary at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cast is necessary, since bytes.data() returns const char * and computeDataEntropy takes const unsigned char* as parameter. As stated earlier, static_cast doesn't suffice. invalid static_cast from type ‘const char*’ to type ‘const uint8_t*’ {aka ‘const unsigned char*’}, please see https://stackoverflow.com/questions/658913/c-style-cast-from-unsigned-char-to-const-char

Copy link
Contributor Author

@JakubPruzinec JakubPruzinec Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wouldn't be concerned about wrong interpretation of signedness as computeDataEntropy doesn't take it into consideration.

section.setCrc32(auxSec->getCrc32());
section.setMd5(auxSec->getMd5());
section.setSha256(auxSec->getSha256());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this unnecessary whitespace please.

@metthal
Copy link
Member

metthal commented Feb 26, 2019

Could you please also modify it so that results are in range from 0 to 8? We should stick with what VirusTotal and YARA are using.

if (frequency)
{
double probability = static_cast<double>(frequency) / dataLen;
entropy -= probability * (log(probability) / log(2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be calculated using only single logarithm because log_a(x) / log_a(b) = log_b(x) so you can actually use logarithm of base-2 here. Also please use std:: namespace instead of the functions in the global namespace.

#include <map>
#include <unordered_map>
#include <cmath>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep headers in alphabetical order.

@@ -47,10 +47,12 @@ class SecSeg
unsigned long long address; ///< start address in memory
unsigned long long memorySize; ///< size in memory
unsigned long long entrySize; ///< size of one entry in file
double entropy; ///< entropy in <0,1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't comment here wrong? Shouldn't it be <0,8>?

@metthal metthal merged commit ad499fe into avast:master Mar 19, 2019
s3rvac added a commit that referenced this pull request Mar 21, 2019
We usually add both the issue and the PR into CHANGELOG (see other entries).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants