-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement separator validation #326
base: main
Are you sure you want to change the base?
Conversation
c1b2e75
to
9ce5ca8
Compare
9ce5ca8
to
c71528d
Compare
src/tests/test_util.cpp
Outdated
struct TestSeparatorValidationParam { | ||
std::filesystem::path csv_path; | ||
char separator; | ||
char expected; | ||
|
||
TestSeparatorValidationParam(std::filesystem::path const& path, char sep, char exp) | ||
: csv_path(path), separator(sep), expected(exp) {} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use path from already existing csv configs
And we don't need constructor
struct TestSeparatorValidationParam { | |
std::filesystem::path csv_path; | |
char separator; | |
char expected; | |
TestSeparatorValidationParam(std::filesystem::path const& path, char sep, char exp) | |
: csv_path(path), separator(sep), expected(exp) {} | |
}; | |
struct TestSeparatorValidationParam { | |
CSVConfig csv_config; | |
char test_separator; | |
std::optopnal<char> expected_separator; | |
}; |
src/tests/test_util.cpp
Outdated
EXPECT_EQ(actual.value_or('\0'), p.expected); | ||
} | ||
|
||
INSTANTIATE_TEST_SUITE_P( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, that we also should add some complex cases to derive the separator. Something like:
;,;,;
;,;,;
In this case valid separators are ;
and ,
.
Also check validation for csvs with quotes:
'a..b',c,d
e,d,'f..g'
In this case, only ',' is a valid separator
src/tests/test_util.cpp
Outdated
TEST_P(TestSeparatorValidation, Default) { | ||
TestSeparatorValidationParam const& p = GetParam(); | ||
std::optional<char> actual = util::ValidateSeparator(p.csv_path, p.separator); | ||
EXPECT_EQ(actual.value_or('\0'), p.expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and so we can write here
EXPECT_EQ(actual.value_or('\0'), p.expected); | |
EXPECT_EQ(actual, p.expected_separator); |
if (has_next_) { | ||
for (char c : next_line_) { | ||
letter_count[c]++; | ||
} | ||
} | ||
|
||
std::unordered_map<char, unsigned> next_letter_count; | ||
while (has_next_) { | ||
GetNextIfHas(); | ||
next_letter_count.clear(); | ||
for (char c : next_line_) { | ||
next_letter_count[c]++; | ||
} | ||
for (auto letter : letter_count) { | ||
if (letter.second != next_letter_count[letter.first]) { | ||
letter_count[letter.first] = 0; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we will also take into account those characters that we do not want to take into account - the chars in quotes. See second example in the last comment in the file src/tests/test_util.cpp
src/core/util/separator_validator.h
Outdated
|
||
namespace util { | ||
|
||
std::optional<char> ValidateSeparator(std::filesystem::path const& path, char separator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to change the function signature. We shouldn't just write a warning to the console, we need to explicitly return a message with error information and actual separator, if found.
We can return something like: std::pair<std::optional<char>, std::string>
.
This is necessary because, first of all, this function will be needed for the python server. Let's say we downloaded a user's dataset and want to find out whether he sent the correct separator. And the main goal is to provide readable information about the error.
And accordingly, you need to figure out how to bind this function to python bindings
src/core/algorithms/algo_factory.cpp
Outdated
auto csv_parser = std::make_shared<CSVParser>(csv_config); | ||
csv_parser->ValidateSeparator(); | ||
return csv_parser; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not sure that we always want to validate the separator when creating a table. The main problem is that we have to read the table completely before running the algorithm itself. This causes the file system caches and disk cache to fill up. Therefore, if someone tries to conduct experiments and does not know that validation always takes place first, then the results of the experiments will be incorrect (since all caches must be reset before running the test).
c71528d
to
95c4e2c
Compare
95c4e2c
to
01874f5
Compare
Add methods for separator validation in .csv tables