-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/18 implement hyfd #139
Conversation
Closes #18, leaving a todo list:
|
ba449a2
to
8caffe3
Compare
85c4b26
to
2c6fdc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 38. Check the log or trigger a new build to see more.
|
||
namespace algos::hyfd { | ||
|
||
class Sampler::Efficiency { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: use of undeclared identifier 'Sampler' [clang-diagnostic-error]
class Sampler::Efficiency {
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Интересно, а как оно собирается? Перед включением этого хедера всегда (буквально выше) включается хедер, где объявлен Sampler
? Не очень какая-то зависимость получается
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Это я задумывал как приватный класс у Sampler
, в Sampler
лежит forward declaration of Efficiency
. Таким образом единственный файл, где он должен использоваться, это sampler.cpp
, а остальные места его не надо включать.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
То же с FDValidations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Возможно, тогда стоит этот код просто целиком перенести в sampler.cpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Можно, но хочется по-максимуму вынести выносимое, потому что validator.cpp
, например, довольно тяжёлый, не хочется добавлять туда что-либо.
Хотя и много хедеров, которые используются лишь в одном файле, тоже наверное не очень. Хз, по-моему, что так, что так будут свои плюсы и минусы, и бы оставил как есть.
Ну, могу сделать директории validator
, sampler
и переместить туда всё, им принадлежащее, чтобы как-то виузально хотя бы обозначить, что куда относится.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
хм, мне наоборот кажется, что лучше держать локально, если оно нужно только в validator.cpp
, чем выносить в отдельный файл. Но можно в целом и в директории объеденить
2c6fdc7
to
18bcf12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
41f75ca
to
83bc1e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
83bc1e4
to
b26452e
Compare
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/model/fd.h
Outdated
// unsigned int Fletcher16() const; | ||
[[nodiscard]] RawFD ToRawFD() const { | ||
return {lhs_.GetColumnIndices(), rhs_.GetIndex()}; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Перенос строки (:
src/algorithms/hyfd/fd_validations.h
Outdated
unsigned count_validations_ = 0; | ||
unsigned count_intersections_ = 0; | ||
|
||
void add(FDValidations&& other) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Может выкинуть bad_alloc
src/algorithms/hyfd/sampler.cpp
Outdated
assert(first_record_id < compressed_records_->size() && | ||
second_record_id < compressed_records_->size()); | ||
|
||
for (size_t i = 0; i < (*compressed_records_)[0].size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Кажется, все ещё не исправлено
NonFDs essentially stores agree sets, i.e. such a column combination that for some two records their projections are equal, while complements differ. It is filled by the Sampler and read by the Inductor.
Sampler uses a deterministic focused sampling approach to efficiently get most promising agree sets for passing them to the Inductor later.
FDTree is a prefix tree, a core structure of HyFD which represents the FD search space. It is used by the Inductor and the Validator.
Inductor uses the agree set collection received from the Sampler to enrich the FD prefix tree by specializing invalidated candidates.
Validator performs breadth-first bottom-up traverse of the FD prefix tree, validating the FDs and specializing incorrect candidates.
b26452e
to
a47d8ed
Compare
- disable identifier length and recursion checks - allow classes with all variables being public - include libraries as system directories for compiler to run less checks on them
a47d8ed
to
a5d5dff
Compare
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
HyFD is a hybrid functional dependency discovery algorithm, employing both row-efficient
sample-based agree set generation and column-efficient lattice traversal.
The algorithm has three major components: Sampler, Inductor and Validator --- the former two
representing row-efficient step. Sampler acquires a collection of agree sets using a special
deterministic sampling approach, then Inductor uses this data to enrich an FD prefix tree. This
concludes the first step and the execution moves to Validator, which performs breadth-first
bottom-up traverse of the FDTree, validating the FDs and specializing incorrect candidates.
It is difficult to come up with a criteria or an algorithm for choosing the optimal point to move
from the first step to the second, therefore HyFD uses a notion of efficiency to track how
efficient current step is performing and switches to the other step in case it falls beneath a
configured threshold. Each step's results are useful for the other step: Sampler-Inductor
specialize the prefix tree, which reduces the number or costly PLI checks Validator has to
perform, while Validator yields a collection of row pairs, which invalidated a candidate and can
be used by Sampler to obtain a new agree set.
Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional Dependency
Discovery. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD
'16). Association for Computing Machinery, New York, NY, USA, 821–833.
https://doi.org/10.1145/2882903.2915203