-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entropy calculation of sections and overlays (PE, ELF, COFF, MACHO) #507
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. I have added some comments to your code.
One thing that I am thinking about as I am looking at the code is that we don't usually show floating point numbers from fileinfo
. I don't think there is any other case where we do but basically, what about the way precision and display format? std::cout
uses by default scientific notation for them (for example 1e-5
). I don't think we care that much about this kind of precision. I think we would be more than fine with just few digits of precision and not use scientific notation at all.
I am not entirely sure about this, I'll probably discuss it with someone else. I just want you to not forget about it and maybe try to come up with some ideas.
return; | ||
} | ||
|
||
auto data = reinterpret_cast<const uint8_t *>(bytes.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is reinterpret_cast
really necessary here? Wouldn't static_cast
be enough or is the cast necessary at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cast is necessary, since bytes.data()
returns const char *
and computeDataEntropy
takes const unsigned char*
as parameter. As stated earlier, static_cast
doesn't suffice. invalid static_cast from type ‘const char*’ to type ‘const uint8_t*’ {aka ‘const unsigned char*’}
, please see https://stackoverflow.com/questions/658913/c-style-cast-from-unsigned-char-to-const-char
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I wouldn't be concerned about wrong interpretation of signedness as computeDataEntropy
doesn't take it into consideration.
section.setCrc32(auxSec->getCrc32()); | ||
section.setMd5(auxSec->getMd5()); | ||
section.setSha256(auxSec->getSha256()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this unnecessary whitespace please.
Could you please also modify it so that results are in range from 0 to 8? We should stick with what VirusTotal and YARA are using. |
src/fileformat/utils/other.cpp
Outdated
if (frequency) | ||
{ | ||
double probability = static_cast<double>(frequency) / dataLen; | ||
entropy -= probability * (log(probability) / log(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be calculated using only single logarithm because log_a(x) / log_a(b) = log_b(x)
so you can actually use logarithm of base-2 here. Also please use std::
namespace instead of the functions in the global namespace.
src/fileformat/utils/other.cpp
Outdated
#include <map> | ||
#include <unordered_map> | ||
#include <cmath> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep headers in alphabetical order.
@@ -47,10 +47,12 @@ class SecSeg | |||
unsigned long long address; ///< start address in memory | |||
unsigned long long memorySize; ///< size in memory | |||
unsigned long long entrySize; ///< size of one entry in file | |||
double entropy; ///< entropy in <0,1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't comment here wrong? Shouldn't it be <0,8>
?
We usually add both the issue and the PR into CHANGELOG (see other entries).
#502
I've implemented extraction of section and overlay entropies. Algorithm for entropy calculation uses Shannons formula normalized to
<0,1><0,8> interval.