Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/18 implement hyfd #139

Merged
merged 11 commits into from
Nov 26, 2022
Merged

Feature/18 implement hyfd #139

merged 11 commits into from
Nov 26, 2022

Conversation

Mstrutov
Copy link
Collaborator

@Mstrutov Mstrutov commented Nov 3, 2022

HyFD is a hybrid functional dependency discovery algorithm, employing both row-efficient
sample-based agree set generation and column-efficient lattice traversal.

The algorithm has three major components: Sampler, Inductor and Validator --- the former two
representing row-efficient step. Sampler acquires a collection of agree sets using a special
deterministic sampling approach, then Inductor uses this data to enrich an FD prefix tree. This
concludes the first step and the execution moves to Validator, which performs breadth-first
bottom-up traverse of the FDTree, validating the FDs and specializing incorrect candidates.

It is difficult to come up with a criteria or an algorithm for choosing the optimal point to move
from the first step to the second, therefore HyFD uses a notion of efficiency to track how
efficient current step is performing and switches to the other step in case it falls beneath a
configured threshold. Each step's results are useful for the other step: Sampler-Inductor
specialize the prefix tree, which reduces the number or costly PLI checks Validator has to
perform, while Validator yields a collection of row pairs, which invalidated a candidate and can
be used by Sampler to obtain a new agree set.

Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional Dependency
Discovery. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD
'16). Association for Computing Machinery, New York, NY, USA, 821–833.
https://doi.org/10.1145/2882903.2915203

@Mstrutov
Copy link
Collaborator Author

Mstrutov commented Nov 3, 2022

Closes #18, leaving a todo list:

  • parallelize validator using ParallelFor
  • implement the Guardian component
  • investigate performance issues, optimize to beat Metanome
  • consider implementing a bitset custom iterator to simplify bitset loops, which are used very frequently

@Mstrutov Mstrutov force-pushed the feature/18-implement-hyfd branch from ba449a2 to 8caffe3 Compare November 3, 2022 07:28
@Mstrutov Mstrutov marked this pull request as ready for review November 3, 2022 10:49
@Mstrutov Mstrutov requested a review from polyntsov November 3, 2022 10:49
@polyntsov polyntsov added the feature Provides new functionality label Nov 7, 2022
@Mstrutov Mstrutov force-pushed the feature/18-implement-hyfd branch 2 times, most recently from 85c4b26 to 2c6fdc7 Compare November 14, 2022 06:42
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 38. Check the log or trigger a new build to see more.


namespace algos::hyfd {

class Sampler::Efficiency {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: use of undeclared identifier 'Sampler' [clang-diagnostic-error]

class Sampler::Efficiency {
      ^

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Интересно, а как оно собирается? Перед включением этого хедера всегда (буквально выше) включается хедер, где объявлен Sampler? Не очень какая-то зависимость получается

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Это я задумывал как приватный класс у Sampler, в Sampler лежит forward declaration of Efficiency. Таким образом единственный файл, где он должен использоваться, это sampler.cpp, а остальные места его не надо включать.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

То же с FDValidations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Возможно, тогда стоит этот код просто целиком перенести в sampler.cpp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Можно, но хочется по-максимуму вынести выносимое, потому что validator.cpp, например, довольно тяжёлый, не хочется добавлять туда что-либо.

Хотя и много хедеров, которые используются лишь в одном файле, тоже наверное не очень. Хз, по-моему, что так, что так будут свои плюсы и минусы, и бы оставил как есть.

Ну, могу сделать директории validator, sampler и переместить туда всё, им принадлежащее, чтобы как-то виузально хотя бы обозначить, что куда относится.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

хм, мне наоборот кажется, что лучше держать локально, если оно нужно только в validator.cpp, чем выносить в отдельный файл. Но можно в целом и в директории объеденить

@Mstrutov Mstrutov force-pushed the feature/18-implement-hyfd branch from 2c6fdc7 to 18bcf12 Compare November 21, 2022 05:39
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@Mstrutov Mstrutov force-pushed the feature/18-implement-hyfd branch 3 times, most recently from 41f75ca to 83bc1e4 Compare November 22, 2022 05:07
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@Mstrutov Mstrutov force-pushed the feature/18-implement-hyfd branch from 83bc1e4 to b26452e Compare November 22, 2022 05:23
@github-actions
Copy link

clang-tidy review says "All clean, LGTM! 👍"

@Mstrutov Mstrutov requested a review from polyntsov November 22, 2022 05:36
Copy link
Collaborator

@polyntsov polyntsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/model/fd.h Outdated
// unsigned int Fletcher16() const;
[[nodiscard]] RawFD ToRawFD() const {
return {lhs_.GetColumnIndices(), rhs_.GetIndex()};
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Перенос строки (:

unsigned count_validations_ = 0;
unsigned count_intersections_ = 0;

void add(FDValidations&& other) noexcept {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Может выкинуть bad_alloc

assert(first_record_id < compressed_records_->size() &&
second_record_id < compressed_records_->size());

for (size_t i = 0; i < (*compressed_records_)[0].size(); ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Кажется, все ещё не исправлено

NonFDs essentially stores agree sets, i.e. such a column combination that
for some two records their projections are equal, while complements differ.
It is filled by the Sampler and read by the Inductor.
Sampler uses a deterministic focused sampling approach to efficiently
get most promising agree sets for passing them to the Inductor later.
FDTree is a prefix tree, a core structure of HyFD which represents the
FD search space. It is used by the Inductor and the Validator.
Inductor uses the agree set collection received from the Sampler to enrich
the FD prefix tree by specializing invalidated candidates.
Validator performs breadth-first bottom-up traverse of the FD prefix tree,
validating the FDs and specializing incorrect candidates.
@Mstrutov Mstrutov force-pushed the feature/18-implement-hyfd branch from b26452e to a47d8ed Compare November 26, 2022 14:54
- disable identifier length and recursion checks
- allow classes with all variables being public
- include libraries as system directories for compiler to run less checks
 on them
@Mstrutov Mstrutov force-pushed the feature/18-implement-hyfd branch from a47d8ed to a5d5dff Compare November 26, 2022 14:57
@github-actions
Copy link

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link

clang-tidy review says "All clean, LGTM! 👍"

@polyntsov polyntsov merged commit d5f10f3 into main Nov 26, 2022
@polyntsov polyntsov deleted the feature/18-implement-hyfd branch November 26, 2022 15:38
This was referenced Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Provides new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants