[ML] Decrease the memory used by distribution models #146

tveasey · 2018-07-05T13:31:17Z

This makes some type changes to reduce precision of quantities where we don't need that precision. For example, this saves us around 15% of the memory typically used by the count residual distribution model (dropping from 1900 to 1650 bytes).

This may have a small affect on results simply due to slightly different rounding of floating point quantities.

…on't need precision). Make enums 32 bits

droberts195 · 2018-07-05T13:50:21Z

include/maths/MathsTypes.h

@@ -42,7 +42,12 @@ using TCalendarComponentVec = std::vector<maths::CCalendarComponent>;
 //!   -# ContinuousData: which indicates the takes real values.
 //!   -# MixedData: which indicates the data can be decomposed into
 //!      some combination of the other three data types.
-enum EDataType { E_DiscreteData, E_IntegerData, E_ContinuousData, E_MixedData };
+enum EDataType : std::int32_t {


Is this actually making a difference on any platform? On all the platforms we support int is 32 bits. And I know gcc uses int for enums unless it's forced to do something different.

The C++ standard says that compilers can choose any integral type that can accommodate all the enum values to represent the enum type. But in practice they'll probably favour int as the C standard says the enum values must be int, and it would be foolish for a compiler vendor to introduce an incompatibility in the binary output for the same header file depending on whether it was compiled as C or C++.

So does clang or Visual C++ use a different enum size by default? If not, I think this is just introducing a distraction into the code without changing the end result.

Ok good point. I'd actually switched them to int8_t originally, but it caused knock on problems for printing them out which created more code churn. I'll double check and revert if it doesn't make a difference.

The other problem: Even if you force shrink the datatype like this (I agree int should be 32bit also on 64bit) you might not achieve the desired affect due to padding in the class instances where that is used.

(Do we correctly take padding into account in the memory usage calculations?)

I also shuffled some variable orders around so we would've benefited from shrinking these. I'm just doing a run now though to verify if we do actually get anything from this, I suspect Dave is right though and we don't.

Also, we're using sizeof to get the size of these classes in the memory calculation (which include pads).

(Do we correctly take padding into account in the memory usage calculations?)

We take padding between members of a single object into account, because we use sizeof(obj) rather than sizeof(member1) + sizeof(member2) + ....

What we don't take into account is allocator overhead. For example, if we ask for a 27 byte chunk of memory and the OS gives us a 32 byte chunk instead then we'll count 27 if sizeof(obj) is 27. And nor do we take into account any allocator data structures like pointers to the next block in a free list.

But model memory is only meant to be a rough estimate - getting it absolutely perfect would be incredibly hard and would involve having a very deep knowledge of what malloc() is doing on each platform.

👍

Maybe there are data structures which would benefit from disabling padding? If so we could try and see what are the savings in memory and what is the added runtime (hopefully not that bad).

That's an interesting suggestion. I reordered variables in this change, so it should be able pack the types properly, but there may be other cases. Maybe something to look into down the line. The only slight caveat is it has been worthwhile shrinking these because of the upcoming change to model multi-bucket features will duplicate them per time series model. I suspect we may have smaller wins elsewhere.

edsavage

A bit late to the party but LGTM

Backport of #146.

tveasey added 4 commits June 6, 2018 14:16

Migrate to lower precision storage in distribution models (where we d…

c7e51b9

…on't need precision). Make enums 32 bits

Merge branch 'master' into enhancement/reduce-distribution-model-memory

6869806

Merge branch 'master' into enhancement/reduce-distribution-model-memory

6298c58

Reduce enumeration size

45f10e8

tveasey added >enhancement v7.0.0 :ml v6.4.0 affects-results labels Jul 5, 2018

tveasey requested a review from edsavage July 5, 2018 13:31

droberts195 reviewed Jul 5, 2018

View reviewed changes

tveasey added 2 commits July 5, 2018 15:00

Documentation

11fd905

Defining underlying type of enum to int32_t doesn't gain us anything

de3f364

tveasey force-pushed the enhancement/reduce-distribution-model-memory branch from d0050a4 to de3f364 Compare July 5, 2018 14:41

edsavage approved these changes Jul 5, 2018

View reviewed changes

tveasey merged commit 9f1a9cf into elastic:master Jul 6, 2018

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jul 23, 2018

[ML] Decrease the memory used by distribution models (elastic#146)

eb8f1b7

tveasey mentioned this pull request Jul 23, 2018

[6.4][ML] Decrease the memory used by distribution models (#146) #162

Merged

tveasey added a commit that referenced this pull request Jul 23, 2018

[6.4][ML] Decrease the memory used by distribution models (#162)

eecd2f6

Backport of #146.

tveasey mentioned this pull request Jul 30, 2018

[ML] Anomaly detection for multiple bucket features #175

Merged

tveasey deleted the enhancement/reduce-distribution-model-memory branch April 10, 2019 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Decrease the memory used by distribution models #146

[ML] Decrease the memory used by distribution models #146

tveasey commented Jul 5, 2018 •

edited

Loading

droberts195 Jul 5, 2018

tveasey Jul 5, 2018

hendrikmuhs Jul 5, 2018

tveasey Jul 5, 2018

tveasey Jul 5, 2018

droberts195 Jul 5, 2018 •

edited

Loading

hendrikmuhs Jul 5, 2018

tveasey Jul 5, 2018 •

edited

Loading

edsavage left a comment

[ML] Decrease the memory used by distribution models #146

[ML] Decrease the memory used by distribution models #146

Conversation

tveasey commented Jul 5, 2018 • edited Loading

droberts195 Jul 5, 2018

Choose a reason for hiding this comment

tveasey Jul 5, 2018

Choose a reason for hiding this comment

hendrikmuhs Jul 5, 2018

Choose a reason for hiding this comment

tveasey Jul 5, 2018

Choose a reason for hiding this comment

tveasey Jul 5, 2018

Choose a reason for hiding this comment

droberts195 Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

hendrikmuhs Jul 5, 2018

Choose a reason for hiding this comment

tveasey Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

edsavage left a comment

Choose a reason for hiding this comment

tveasey commented Jul 5, 2018 •

edited

Loading

droberts195 Jul 5, 2018 •

edited

Loading

tveasey Jul 5, 2018 •

edited

Loading