-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GFD mining #465
base: main
Are you sure you want to change the base?
Add GFD mining #465
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 29. Check the log or trigger a new build to see more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
c3eeecf
to
a59e3cc
Compare
: query_(query_), iso_(iso_), res_(res_) {} | ||
|
||
template <typename CorrespondenceMap1To2, typename CorrespondenceMap2To1> | ||
bool operator()(CorrespondenceMap1To2 f, CorrespondenceMap2To1) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second parameter is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boost Graph Library requires such syntax from callback function. Unfortunately, if I change the function signature, the code will not compile
Add PR description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
You are doing great! Please don't forget to notify when you are done with changes requested. Also please mark conversations if they are resolved |
60aeadc
to
9ce90a3
Compare
Token name_token = Token(i, name.first); | ||
for (auto& value : name.second) { | ||
Token value_token = Token(-1, value); | ||
result.push_back(Literal(name_token, value_token)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.reserve(name.second.size() * attrs_info.at(label).size())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, name.second
can have different number of values, depending on name.first
. The reserve may not work correctly.
11d6a2e
to
13b76ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
44b4271
to
b6e4f89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the pull request a bit and left comments on the things that caught my eye.
It would be nice if you try to improve the code yourself some more. It might be helpful to read google's coding style, which we try to follow.
Also you can use some tools. For example, cppcheck
can point out errors and possible improvements
class CompareCallback { | ||
private: | ||
graph_t const& query_; | ||
Embedding& iso_; | ||
bool& res_; | ||
|
||
public: | ||
CompareCallback(graph_t const& query, Embedding& iso, bool& res) | ||
: query_(query), iso_(iso), res_(res) {} | ||
|
||
template <typename CorrespondenceMap1To2, typename CorrespondenceMap2To1> | ||
bool operator()(CorrespondenceMap1To2 f, CorrespondenceMap2To1) const { | ||
BGL_FORALL_VERTICES_T(u, query_, graph_t) { | ||
iso_.emplace(u, get(f, u)); | ||
} | ||
res_ = true; | ||
return false; | ||
} | ||
}; | ||
|
||
struct VCompare { | ||
graph_t const& query; | ||
graph_t const& graph; | ||
|
||
bool operator()(vertex_t const& fr, vertex_t const& to) const { | ||
return query[fr].attributes.at("label") == graph[to].attributes.at("label"); | ||
} | ||
}; | ||
|
||
struct ECompare { | ||
graph_t const& query; | ||
graph_t const& graph; | ||
|
||
bool operator()(edge_t const& fr, edge_t const& to) const { | ||
return query[fr].label == graph[to].label; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use lambdas instead of these structs?
Define them directly where you need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the boost graph library requires exactly this structure for the code to work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the boost graph library requires exactly this structure for the code to work correctly.
Are you sure about that? Cause this works fine with lambdas: https://godbolt.org/z/63ooq15nG
src/core/algorithms/gfd/gfd.h
Outdated
std::string ToString(); | ||
}; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add blank line at the end. Also applies to other files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the CI check complains when I do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
69c2d3f
to
10ff8ca
Compare
Add class for GFD miner
Add minimal GFD and GFD with multiple conclusion tests
Added two examples with searching for dependencies in small graphs.
|
Change commit name |
// Defines a specific attribute of the pattern. | ||
// The first element is the index of the vertex, | ||
// the second is the name of the attribute. | ||
// An alias for user convenience. | ||
using Token = std::pair<int, std::string>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are indexes supposed to have negative values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I encode the constant token with the value -1
class CmpCallback { | ||
private: | ||
bool& res_; | ||
|
||
public: | ||
CmpCallback(bool& res) : res_(res) {} | ||
|
||
template <typename CorrespondenceMap1To2, typename CorrespondenceMap2To1> | ||
bool operator()(CorrespondenceMap1To2, CorrespondenceMap2To1) const { | ||
res_ = true; | ||
return false; | ||
} | ||
}; | ||
|
||
struct VCmp { | ||
graph_t const& lhs; | ||
graph_t const& rhs; | ||
|
||
bool operator()(vertex_t const& fr, vertex_t const& to) const { | ||
return lhs[fr].attributes.at("label") == rhs[to].attributes.at("label"); | ||
} | ||
}; | ||
|
||
struct ECmp { | ||
graph_t const& lhs; | ||
graph_t const& rhs; | ||
|
||
bool operator()(edge_t const& fr, edge_t const& to) const { | ||
return lhs[fr].label == rhs[to].label; | ||
} | ||
}; | ||
|
||
bool IsSub(graph_t const& query, graph_t const& graph) { | ||
bool result = false; | ||
VCmp vcmp = {query, graph}; | ||
ECmp ecmp = {query, graph}; | ||
CmpCallback callback(result); | ||
boost::property_map<graph_t, boost::vertex_index_t>::type query_index_map = | ||
get(boost::vertex_index, query); | ||
boost::property_map<graph_t, boost::vertex_index_t>::type graph_index_map = | ||
get(boost::vertex_index, graph); | ||
std::vector<vertex_t> query_vertex_order = vertex_order_by_mult(query); | ||
boost::vf2_subgraph_iso(query, graph, callback, query_index_map, graph_index_map, | ||
query_vertex_order, ecmp, vcmp); | ||
return result; | ||
} | ||
|
||
} // namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have not done
std::vector<Literal> premises = {}; | ||
Gfd gfd = {pattern, premises, conclusion}; | ||
if (Validate(graph, gfd, embeddings, sigma)) { | ||
rules.push_back({{}, l}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid unnecessary temporary object creation
rules.push_back({{}, l}); | |
rules.emplace_back({}, l); |
Consider everywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the suggested line gives a compiler error:
no matching function for call to ‘std::vector<std::pair<std::vector<std::pair<std::pair<int, std::__cxx11::basic_string >, std::pair<int, std::__cxx11::basic_string > > >, std::pair<std::pair<int, std::__cxx11::basic_string >, std::pair<int, std::__cxx11::basic_string > > > >::emplace_back(, gfd::Literal&)’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will review gfd_miner.cpp
tomorrow (:
|
||
bool ContainsLiteral(std::vector<Literal> const& literals, Literal const& l) { | ||
auto check = [&l](auto const& cur_lit) { return CompareLiterals(cur_lit, l); }; | ||
return std::any_of(literals.begin(), literals.end(), check); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return std::any_of(literals.begin(), literals.end(), check); | |
return std::ranges::any_of(literals, check); |
More cleaner this way, gets rid of .begin()
/.end()
boilerplate
return false; | ||
} | ||
auto check = [&rhs](auto const& cur_lit) { return ContainsLiteral(rhs, cur_lit); }; | ||
return std::all_of(lhs.begin(), lhs.end(), check); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return std::all_of(lhs.begin(), lhs.end(), check); | |
return std::ranges::all_of(lhs, check); |
|
||
#include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <algorithm>
is not used in this file
class CmpCallback { | ||
private: | ||
bool& res_; | ||
|
||
public: | ||
CmpCallback(bool& res) : res_(res) {} | ||
|
||
template <typename CorrespondenceMap1To2, typename CorrespondenceMap2To1> | ||
bool operator()(CorrespondenceMap1To2, CorrespondenceMap2To1) const { | ||
res_ = true; | ||
return false; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using references as fields is not a good sign in any code.
Here we pass this FunctionObject to boost::vf2_subgraph_iso
as a callback that will be called with operator()
and set local variable bool result = false;
to true.
Can't we just pass simple lambda to it then?
Like
bool result = false;
auto callback = [&result](auto, auto) { result = true; return false; }
//...
boost::vf2_subgraph_iso(query, graph, callback, //...
//...
return result;
struct VCmp { | ||
graph_t const& lhs; | ||
graph_t const& rhs; | ||
|
||
bool operator()(vertex_t const& fr, vertex_t const& to) const { | ||
return lhs[fr].attributes.at("label") == rhs[to].attributes.at("label"); | ||
} | ||
}; | ||
|
||
struct ECmp { | ||
graph_t const& lhs; | ||
graph_t const& rhs; | ||
|
||
bool operator()(edge_t const& fr, edge_t const& to) const { | ||
return lhs[fr].label == rhs[to].label; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is slighty better because we use const references as fields, but still, why don't just use lambdas?
auto vertex_cmp = [&query, &graph](vertex_t const& fr, vertex_t const& to) {
return query[fr].attributes.at("label") == graph[to].attributes.at("label");
};;
auto edge_cmp = [&query, &graph](edge_t const& fr, edge_t const& to) {
return query[fr].label == graph[to].label;
};
//pass vertex_cmp and edge_cmp to boost::vf2_subgraph_iso
bool Gfd::operator!=(Gfd const& gfd) const { | ||
return !(*this == gfd); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've defined operator==
, this code can be easily generated by compiler. Put that in .h file:
bool operator==(const Gfd& gfd) const;
bool operator!=(const Gfd& gfd) const = default;
and remove manual implementation
using Info = std::map<std::string, std::set<std::string>>; | ||
|
||
void NextSubset(std::vector<std::size_t>& indices, std::size_t const border) { | ||
if (indices.at(0) == border - indices.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (indices.at(0) == border - indices.size()) { | |
if (indices.empty() || indices[0] == border - indices.size()) { |
|
||
void NextSubset(std::vector<std::size_t>& indices, std::size_t const border) { | ||
if (indices.at(0) == border - indices.size()) { | ||
indices = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indices = {}; | |
indices.clear(); |
std::size_t index = 0; | ||
for (int i = static_cast<int>(indices.size()) - 1; i >= 0; --i) { | ||
if (indices.at(i) != border - indices.size() + static_cast<std::size_t>(i)) { | ||
index = static_cast<std::size_t>(i); | ||
break; | ||
} | ||
} | ||
indices.at(index)++; | ||
for (std::size_t i = index + 1; i < indices.size(); ++i) { | ||
indices.at(i) = indices.at(index) + i - index; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can refactor that as one loop without static_cast
's:
std::size_t i = indices.size();
while (i > 0) {
--i;
if (indices[i] < border - indices.size() + i) {
++indices[i];
for (std::size_t j = i + 1; j < indices.size(); ++j) {
indices[j] = indices[j - 1] + 1;
}
return;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, I'm not sure whether we need to execute second loop if we failed to find index
in the first one
} | ||
|
||
template <typename T> | ||
std::vector<std::vector<T>> GetSubsets(std::vector<T> const& elements, std::size_t const n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<std::vector<T>> GetSubsets(std::vector<T> const& elements, std::size_t const n) { | |
std::vector<std::vector<T>> GetSubsets(std::vector<T> const& elements, std::size_t n) { |
Const here does nothing
if (elements.size() < n) { | ||
return {}; | ||
} | ||
std::vector<std::vector<T>> result = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just
std::vector<std::vector<T>> result = {}; | |
std::vector<std::vector<T>> result; |
is enough
std::vector<graph_t> new_patterns = {}; | ||
std::vector<Embeddings> new_embeddings_set = {}; | ||
std::vector<Rules> new_forbidden_rules_set = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<graph_t> new_patterns = {}; | |
std::vector<Embeddings> new_embeddings_set = {}; | |
std::vector<Rules> new_forbidden_rules_set = {}; | |
std::vector<graph_t> new_patterns; | |
std::vector<Embeddings> new_embeddings_set; | |
std::vector<Rules> new_forbidden_rules_set; |
patterns = new_patterns; | ||
embeddings_set = new_embeddings_set; | ||
forbidden_rules_set = new_forbidden_rules_set; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
patterns = new_patterns; | |
embeddings_set = new_embeddings_set; | |
forbidden_rules_set = new_forbidden_rules_set; | |
patterns = std:move(new_patterns); | |
embeddings_set = std:move(new_embeddings_set); | |
forbidden_rules_set = std::move(new_forbidden_rules_set); |
static std::unique_ptr<algos::GfdMiner> CreateGfdMiningInstance( | ||
std::filesystem::path const& graph_path, TestConfig const& gfdConfig) { | ||
StdParamsMap option_map = {{config::names::kGraphData, graph_path}, | ||
{config::names::kGfdK, gfdConfig.k}, | ||
{config::names::kGfdSigma, gfdConfig.sigma}}; | ||
return algos::CreateAndLoadAlgorithm<GfdMiner>(option_map); | ||
} | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid code duplication, here you can add an additional helper to execute and compare:
static void ExecuteAndCompare(const std::unique_ptr<algos::GfdMiner>& algorithm,
const std::vector<Gfd>& expected_gfds) {
algorithm->Execute();
const auto gfd_list = algorithm->GfdList();
ASSERT_EQ(expected_gfds.size(), gfd_list.size());
ASSERT_THAT(gfd_list, ::testing::ElementsAreArray(expected_gfds));
}
Also we can use matcher ::testing::ElementsAreArray
instead of manually comparing them with ASSERT_TRUE(x == y)
in a loop
std::unique_ptr<algos::GfdMiner> algorithm = | ||
CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | ||
algorithm->Execute(); | ||
std::vector<Gfd> gfd_list = algorithm->GfdList(); | ||
algorithm = CreateGfdMiningInstance(graph_path, {.k = 3, .sigma = 3}); | ||
algorithm->Execute(); | ||
std::size_t expected_size = gfd_list.size(); | ||
gfd_list = algorithm->GfdList(); | ||
ASSERT_EQ(expected_size, gfd_list.size()); | ||
for (std::size_t index = 0; index < gfd_list.size(); ++index) { | ||
ASSERT_TRUE(gfds.at(index) == gfd_list.at(index)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we call CreateGfdMiningInstance
and Execute
two times in a row?
Should't this be just
std::unique_ptr<algos::GfdMiner> algorithm = | |
CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | |
algorithm->Execute(); | |
std::vector<Gfd> gfd_list = algorithm->GfdList(); | |
algorithm = CreateGfdMiningInstance(graph_path, {.k = 3, .sigma = 3}); | |
algorithm->Execute(); | |
std::size_t expected_size = gfd_list.size(); | |
gfd_list = algorithm->GfdList(); | |
ASSERT_EQ(expected_size, gfd_list.size()); | |
for (std::size_t index = 0; index < gfd_list.size(); ++index) { | |
ASSERT_TRUE(gfds.at(index) == gfd_list.at(index)); | |
} | |
std::unique_ptr<algos::GfdMiner> algorithm = CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | |
ExecuteAndCompare(algorithm, expected_gfds); |
?
std::unique_ptr<algos::GfdMiner> algorithm = | ||
CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | ||
algorithm->Execute(); | ||
std::vector<Gfd> gfd_list = algorithm->GfdList(); | ||
std::size_t expected_size = 1; | ||
ASSERT_EQ(expected_size, gfd_list.size()); | ||
for (std::size_t index = 0; index < gfd_list.size(); ++index) { | ||
ASSERT_TRUE(gfds.at(index) == gfd_list.at(index)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing with all other tests:
std::unique_ptr<algos::GfdMiner> algorithm = | |
CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | |
algorithm->Execute(); | |
std::vector<Gfd> gfd_list = algorithm->GfdList(); | |
std::size_t expected_size = 1; | |
ASSERT_EQ(expected_size, gfd_list.size()); | |
for (std::size_t index = 0; index < gfd_list.size(); ++index) { | |
ASSERT_TRUE(gfds.at(index) == gfd_list.at(index)); | |
} | |
std::unique_ptr<algos::GfdMiner> algorithm = CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | |
ExecuteAndCompare(algorithm, gfds); |
This PR implements an algorithm for mining graph functional dependencies based on article "Discovering Graph Functional Dependencies" by Fan Wenfei, Hu Chunming, Liu Xueli, and Lu Ping. Algorithm, given an input graph, returns a set of dependencies satisfied on this graph. The algorithm also has two configurable parameters:
k
is the maximum number of vertices in the pattern of the mined dependency andsigma
is its minimum frequency.In addition, the PR implements the ability to run the algorithm in Python, and also contains examples of its use.