Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Anomaly detection for multiple bucket features #175

Merged
merged 40 commits into from
Aug 17, 2018
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
e9c6cd7
Initial work on features spanning multiple buckets
tveasey Jun 18, 2018
7de5096
Remove partially implemented multi-bucket data gatherering support: i…
tveasey Jun 18, 2018
f7ef031
Given we now have explicit multi-bucket features, it's better to meas…
tveasey Jun 18, 2018
6412977
Bug fixes and compiler warnings
tveasey Jun 18, 2018
874fb1d
Support disabling multibucket feature modelling
tveasey Jun 18, 2018
2dd2357
More work
tveasey Jun 19, 2018
8545b01
Multivariate bulk features
tveasey Jun 20, 2018
73cac7f
Unit test bulk features. Improve weight calculation for contrast. For…
tveasey Jun 21, 2018
a8512fe
Towards fixing model tests
tveasey Jun 22, 2018
6c2d266
Finish up fixing tests + bug fixes
tveasey Jun 26, 2018
88a453b
Merge master
tveasey Jul 5, 2018
1cae145
Merge master
tveasey Jul 5, 2018
56f019c
We can't upgrade the anomaly model because the features have changed
tveasey Jul 5, 2018
37a04b6
Update test thresholds
tveasey Jul 5, 2018
b441515
Merge branch 'master' into feature/multiple-bucket-detection
tveasey Jul 6, 2018
c6210c3
Merge branch 'master' into feature/multiple-bucket-detection
tveasey Jul 13, 2018
9f798d9
The contrast feature wasn't helping enough in the average case. Also …
tveasey Jul 17, 2018
5dcd1ab
Bug fix
tveasey Jul 17, 2018
3b602ce
It is a good idea to compute weighted means since outliers otherwise …
tveasey Jul 17, 2018
8f611a6
Improve function for combining feature probabilities
tveasey Jul 18, 2018
092e803
Towards fixing unit tests
tveasey Jul 18, 2018
d880e4d
Update test expected result
tveasey Jul 19, 2018
93f41b1
Fix linux compilation
tveasey Jul 19, 2018
91e49cc
Another linux fix
tveasey Jul 19, 2018
db8ba71
Another linux fix
tveasey Jul 20, 2018
85a2db9
Formatting fixes
tveasey Jul 20, 2018
a379e11
Merge branch 'master' into feature/multiple-bucket-detection
tveasey Jul 24, 2018
a2a9532
Fix unit tests and some bug fixes to correlation models
tveasey Jul 24, 2018
af25b54
Fix unit test
tveasey Jul 24, 2018
fda494a
Tweak to feature probability aggregation
tveasey Jul 25, 2018
61e85dc
Formatting fix
tveasey Jul 25, 2018
ff74d76
Tidy up
tveasey Jul 25, 2018
70d819a
Merge branch 'master' into feature/multiple-bucket-detection
tveasey Jul 26, 2018
67b7205
Review comments and documentation
tveasey Aug 1, 2018
f2b01c1
Merge branch 'master' into feature/multiple-bucket-detection
tveasey Aug 8, 2018
3937f40
Support correlation between multi-bucket and bucket feature when aggr…
tveasey Aug 8, 2018
f4962df
Formatting fixes
tveasey Aug 8, 2018
c32772a
Review comments
tveasey Aug 11, 2018
20df95c
Rework multi-bucket features to better encapsulate functionality and …
tveasey Aug 16, 2018
878b038
Merge branch 'master' into feature/multiple-bucket-detection
tveasey Aug 17, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions bin/autodetect/CCmdLineParser.cc
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@ bool CCmdLineParser::parse(int argc,
bool& memoryUsage,
std::size_t& bucketResultsDelay,
bool& multivariateByFields,
std::string& multipleBucketspans,
bool& perPartitionNormalization,
TStrVec& clauseTokens) {
try {
Expand Down Expand Up @@ -118,7 +117,7 @@ bool CCmdLineParser::parse(int argc,
("multivariateByFields",
"Optional flag to enable multi-variate analysis of correlated by fields")
("multipleBucketspans", boost::program_options::value<std::string>(),
"Optional comma-separated list of additional bucketspans - must be direct multiples of the main bucketspan")
"Deprecated - ignored")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to keep it? Are there clients using this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to completely remove this, although it does then commit us to removing the corresponding options on the Java side. @dimitris-athanasiou said he would do this. The functionality to silently drop any settings related to this functionality (which was never documented and not fully tested) will then live in the Java code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to drop everything from the c++ side. I'll prepare the java side and we will merge them both in 6.5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised elastic/elasticsearch#32496. @tveasey, this means you can completely remove the multipleBucketspans param in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great. I'll tidy.

("perPartitionNormalization",
"Optional flag to enable per partition normalization")
;
Expand Down Expand Up @@ -234,9 +233,6 @@ bool CCmdLineParser::parse(int argc,
if (vm.count("multivariateByFields") > 0) {
multivariateByFields = true;
}
if (vm.count("multipleBucketspans") > 0) {
multipleBucketspans = vm["multipleBucketspans"].as<std::string>();
}
if (vm.count("perPartitionNormalization") > 0) {
perPartitionNormalization = true;
}
Expand Down
1 change: 0 additions & 1 deletion bin/autodetect/CCmdLineParser.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,6 @@ class CCmdLineParser {
bool& memoryUsage,
std::size_t& bucketResultsDelay,
bool& multivariateByFields,
std::string& multipleBucketspans,
bool& perPartitionNormalization,
TStrVec& clauseTokens);

Expand Down
10 changes: 4 additions & 6 deletions bin/autodetect/Main.cc
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,6 @@ int main(int argc, char** argv) {
bool memoryUsage(false);
std::size_t bucketResultsDelay(0);
bool multivariateByFields(false);
std::string multipleBucketspans;
bool perPartitionNormalization(false);
TStrVec clauseTokens;
if (ml::autodetect::CCmdLineParser::parse(
Expand All @@ -97,10 +96,9 @@ int main(int argc, char** argv) {
summaryCountFieldName, delimiter, lengthEncodedInput, timeField,
timeFormat, quantilesStateFile, deleteStateFiles, persistInterval,
maxQuantileInterval, inputFileName, isInputFileNamedPipe, outputFileName,
isOutputFileNamedPipe, restoreFileName, isRestoreFileNamedPipe,
persistFileName, isPersistFileNamedPipe, maxAnomalyRecords, memoryUsage,
bucketResultsDelay, multivariateByFields, multipleBucketspans,
perPartitionNormalization, clauseTokens) == false) {
isOutputFileNamedPipe, restoreFileName, isRestoreFileNamedPipe, persistFileName,
isPersistFileNamedPipe, maxAnomalyRecords, memoryUsage, bucketResultsDelay,
multivariateByFields, perPartitionNormalization, clauseTokens) == false) {
return EXIT_FAILURE;
}

Expand Down Expand Up @@ -147,7 +145,7 @@ int main(int argc, char** argv) {
ml::model::CAnomalyDetectorModelConfig modelConfig =
ml::model::CAnomalyDetectorModelConfig::defaultConfig(
bucketSpan, summaryMode, summaryCountFieldName, latency,
bucketResultsDelay, multivariateByFields, multipleBucketspans);
bucketResultsDelay, multivariateByFields);
modelConfig.perPartitionNormalization(perPartitionNormalization);
modelConfig.detectionRules(ml::model::CAnomalyDetectorModelConfig::TIntDetectionRuleVecUMapCRef(
fieldConfig.detectionRules()));
Expand Down
29 changes: 18 additions & 11 deletions include/maths/CBasicStatistics.h
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ class MATHS_EXPORT CBasicStatistics {
template<typename T, unsigned int ORDER>
struct SSampleCentralMoments : public std::unary_function<T, void> {
using TCoordinate = typename SCoordinate<T>::Type;
using TValue = T;

//! See core::CMemory.
static bool dynamicSizeAlwaysZero() {
Expand Down Expand Up @@ -1480,17 +1481,6 @@ class MATHS_EXPORT CBasicStatistics {
//! The set maximum.
COrderStatisticsStack<T, 1, GREATER> m_Max;
};

// Friends
template<typename T>
friend std::ostream&
operator<<(std::ostream& o, const CBasicStatistics::SSampleCentralMoments<T, 1u>&);
template<typename T>
friend std::ostream&
operator<<(std::ostream& o, const CBasicStatistics::SSampleCentralMoments<T, 2u>&);
template<typename T>
friend std::ostream&
operator<<(std::ostream& o, const CBasicStatistics::SSampleCentralMoments<T, 3u>&);
};

template<typename T>
Expand Down Expand Up @@ -1596,6 +1586,23 @@ template<typename U>
void CBasicStatistics::SSampleCentralMoments<T, ORDER>::add(const U& x, const TCoordinate& n) {
basic_statistics_detail::SCentralMomentsCustomAdd<U>::add(x, n, *this);
}

//! \brief Defines a promoted type for a SSampleCentralMoments.
//!
//! \see CTypeConversions.h for details.
template<typename T, unsigned int N>
struct SPromoted<CBasicStatistics::SSampleCentralMoments<T, N>> {
using Type = CBasicStatistics::SSampleCentralMoments<typename SPromoted<T>::Type, N>;
};

//! \brief Defines SSampleCentralMoments on a suitable floating point type.
//!
//! \see CTypeConversions.h for details.
template<typename T, unsigned int N, typename U>
struct SFloatingPoint<CBasicStatistics::SSampleCentralMoments<T, N>, U> {
using Type =
CBasicStatistics::SSampleCentralMoments<typename SFloatingPoint<T, U>::Type, N>;
};
}
}

Expand Down
20 changes: 20 additions & 0 deletions include/maths/CBasicStatisticsPersist.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,16 @@ template<typename T, std::size_t N>
bool stringToType(const std::string& str, CSymmetricMatrixNxN<T, N>& value) {
return value.fromDelimited(str);
}
//! Function to do conversion from string to a vector.
template<typename T>
bool stringToType(const std::string& str, CVector<T>& value) {
return value.fromDelimited(str);
}
//! Function to do conversion from string to a symmetric matrix.
template<typename T>
bool stringToType(const std::string& str, CSymmetricMatrix<T>& value) {
return value.fromDelimited(str);
}

//! Function to do conversion to a string.
template<typename T>
Expand All @@ -72,6 +82,16 @@ template<typename T, std::size_t N>
inline std::string typeToString(const CSymmetricMatrixNxN<T, N>& value) {
return value.toDelimited();
}
//! Function to do conversion to a string from a vector.
template<typename T>
inline std::string typeToString(const CVector<T>& value) {
return value.toDelimited();
}
//! Function to do conversion to a string from a symmetric matrix.
template<typename T>
inline std::string typeToString(const CSymmetricMatrix<T>& value) {
return value.toDelimited();
}
}

template<typename T, unsigned int ORDER>
Expand Down
18 changes: 12 additions & 6 deletions include/maths/CLinearAlgebra.h
Original file line number Diff line number Diff line change
Expand Up @@ -928,7 +928,6 @@ class CVectorNx1 : private boost::equality_comparable< CVectorNx1<T, N>,

public:
using TArray = T[N];
using TVec = std::vector<T>;
using TBoostArray = boost::array<T, N>;
using TConstIterator = typename TBoostArray::const_iterator;

Expand All @@ -950,21 +949,24 @@ class CVectorNx1 : private boost::equality_comparable< CVectorNx1<T, N>,
}

//! Construct from a boost array.
explicit CVectorNx1(const boost::array<T, N>& a) {
template<typename U>
explicit CVectorNx1(const boost::array<U, N>& a) {
for (std::size_t i = 0u; i < N; ++i) {
TBase::m_X[i] = a[i];
}
}

//! Construct from a vector.
explicit CVectorNx1(const TVec& v) {
template<typename U>
explicit CVectorNx1(const std::vector<U>& v) {
for (std::size_t i = 0u; i < N; ++i) {
TBase::m_X[i] = v[i];
}
}

//! Construct from a vector.
explicit CVectorNx1(const core::CSmallVectorBase<T>& v) {
template<typename U>
explicit CVectorNx1(const core::CSmallVectorBase<U>& v) {
for (std::size_t i = 0u; i < N; ++i) {
TBase::m_X[i] = v[i];
}
Expand Down Expand Up @@ -1244,10 +1246,14 @@ class CVector : private boost::equality_comparable< CVector<T>,
}

//! Construct from a vector.
explicit CVector(const TArray& v) { TBase::m_X = v; }
template<typename U>
explicit CVector(const std::vector<U>& v) {
TBase::m_X = v;
}

//! Construct from a vector.
explicit CVector(const core::CSmallVectorBase<T>& v) {
template<typename U>
explicit CVector(const core::CSmallVectorBase<U>& v) {
TBase::m_X.assign(v.begin(), v.end());
}

Expand Down
78 changes: 55 additions & 23 deletions include/maths/CModel.h
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,6 @@ class MATHS_EXPORT CModelAddSamplesParams {
using TDouble2VecWeightsAryVec = std::vector<maths_t::TDouble2VecWeightsAry>;

public:
CModelAddSamplesParams();

//! Set whether or not the data are integer valued.
CModelAddSamplesParams& integer(bool integer);
//! Get the data type.
Expand Down Expand Up @@ -133,15 +131,15 @@ class MATHS_EXPORT CModelAddSamplesParams {

private:
//! The data type.
maths_t::EDataType m_Type;
maths_t::EDataType m_Type = maths_t::E_MixedData;
//! True if the data are non-negative false otherwise.
bool m_IsNonNegative;
bool m_IsNonNegative = false;
//! The propagation interval.
double m_PropagationInterval;
double m_PropagationInterval = 1.0;
//! The trend sample weights.
const TDouble2VecWeightsAryVec* m_TrendWeights;
const TDouble2VecWeightsAryVec* m_TrendWeights = nullptr;
//! The prior sample weights.
const TDouble2VecWeightsAryVec* m_PriorWeights;
const TDouble2VecWeightsAryVec* m_PriorWeights = nullptr;
};

//! \brief The extra parameters needed by CModel::probability.
Expand Down Expand Up @@ -178,6 +176,8 @@ class MATHS_EXPORT CModelProbabilityParams {

//! Add whether a value's bucket is empty.
CModelProbabilityParams& addBucketEmpty(const TBool2Vec& empty);
//! Set whether or not the values' bucket is empty.
CModelProbabilityParams& bucketEmpty(const TBool2Vec1Vec& empty);
//! Get whether the values' bucket is empty.
const TBool2Vec1Vec& bucketEmpty() const;

Expand All @@ -200,14 +200,19 @@ class MATHS_EXPORT CModelProbabilityParams {
//! Get the most anomalous correlate if there is one.
TOptionalSize mostAnomalousCorrelate() const;

//! Set whether or not to update the anomaly model.
CModelProbabilityParams& updateAnomalyModel(bool update);
//! Get whether or not to update the anomaly model.
bool updateAnomalyModel() const;
//! Set whether or not to use bulk features.
CModelProbabilityParams& useBulkFeatures(bool use);
//! Get whether or not to use bulk features.
bool useBulkFeatures() const;

//! Set whether or not to use the anomaly model.
CModelProbabilityParams& useAnomalyModel(bool use);
//! Get whether or not to use the anomaly model.
bool useAnomalyModel() const;

private:
//! The entity tag (if relevant otherwise 0).
std::size_t m_Tag;
std::size_t m_Tag = 0;
//! The coordinates' probability calculations.
TProbabilityCalculation2Vec m_Calculations;
//! The confidence interval to use when detrending.
Expand All @@ -220,8 +225,41 @@ class MATHS_EXPORT CModelProbabilityParams {
TSize2Vec m_Coordinates;
//! The most anomalous coordinate (if there is one).
TOptionalSize m_MostAnomalousCorrelate;
//! Whether or not to update the anomaly model.
bool m_UpdateAnomalyModel;
//! Whether or not to use bulk features.
bool m_UseBulkFeatures = true;
//! Whether or not to use the anomaly model.
bool m_UseAnomalyModel = true;
};

//! \brief Describes the result of the model probability calculation.
struct MATHS_EXPORT SModelProbabilityResult {
using TDouble4Vec = core::CSmallVector<double, 4>;
using TSize1Vec = core::CSmallVector<std::size_t, 1>;
using TTail2Vec = core::CSmallVector<maths_t::ETail, 2>;

//! \brief Wraps up a feature label and probability.
struct MATHS_EXPORT SFeatureProbability {
using TStrCRef = boost::reference_wrapper<const std::string>;
SFeatureProbability();
SFeatureProbability(const std::string& label, double probability);
TStrCRef s_Label;
double s_Probability = 1.0;
};
using TFeatureProbability4Vec = core::CSmallVector<SFeatureProbability, 4>;

//! The overall result probability.
double s_Probability = 1.0;
//! True if the probability depends on the correlation between two
//! time series and false otherwise.
bool s_Conditional = false;
//! The probabilities for each individual feature.
TFeatureProbability4Vec s_FeatureProbabilities;
//! The tail of the current bucket probability.
TTail2Vec s_Tail;
//! The identifier of the time series correlated with this one which
//! has the smallest probability in the current bucket (if and only
//! if the result depends on the correlation structure).
TSize1Vec s_MostAnomalousCorrelate;
};

//! \brief The model interface.
Expand Down Expand Up @@ -355,10 +393,7 @@ class MATHS_EXPORT CModel {
virtual bool probability(const CModelProbabilityParams& params,
const TTime2Vec1Vec& time,
const TDouble2Vec1Vec& value,
double& probability,
TTail2Vec& tail,
bool& conditional,
TSize1Vec& mostAnomalousCorrelate) const = 0;
SModelProbabilityResult& result) const = 0;

//! Get the Winsorisation weight to apply to \p value,
//! if appropriate.
Expand Down Expand Up @@ -499,14 +534,11 @@ class MATHS_EXPORT CModelStub : public CModel {
const TForecastPushDatapointFunc& forecastPushDataPointFunc,
std::string& messageOut);

//! Returns 1.0.
//! Returns true.
virtual bool probability(const CModelProbabilityParams& params,
const TTime2Vec1Vec& time,
const TDouble2Vec1Vec& value,
double& probability,
TTail2Vec& tail,
bool& conditional,
TSize1Vec& mostAnomalousCorrelate) const;
SModelProbabilityResult& result) const;

//! Returns empty.
virtual TDouble2Vec
Expand Down
Loading