parser: sanitize timestamps to RFC3339 #1201

mdibaiee · 2023-09-26T16:48:36Z

Description:

Automatically parse and sanitize timestamps to RFC3339
Tested by running a source-http-file serving files with timestamps formatted as 2023-07-09 18:45:14
The result is:
Given the input:

999,Free,Gillmor,[email protected],Male,2023-06-24 18:43:51
1000,Dacie,Thomkins,[email protected],Non-binary,2023-03-06 20:18:34

The output with default config (UTC as timezone):

acmeCo/file-data {"_meta":{"file":"bad-timestamps.csv","offset":998},"email":"[email protected]","first_name":"Free","gender":"Male","id":"999","ip_address":"2023-06-24T18:43:51Z","last_name":"Gillmor"}
acmeCo/file-data {"_meta":{"file":"bad-timestamps.csv","offset":999},"email":"[email protected]","first_name":"Dacie","gender":"Non-binary","id":"1000","ip_address":"2023-03-06T20:18:34Z","last_name":"Thomkins"}

The output with default_timezone: "America/New_York":

acmeCo/file-data {"_meta":{"file":"bad-timestamps.csv","offset":998},"email":"[email protected]","first_name":"Free","gender":"Male","id":"999","ip_address":"2023-06-24T18:43:51-04:00","last_name":"Gillmor"}
acmeCo/file-data {"_meta":{"file":"bad-timestamps.csv","offset":999},"email":"[email protected]","first_name":"Dacie","gender":"Non-binary","id":"1000","ip_address":"2023-03-06T20:18:34-05:00","last_name":"Thomkins"}

Also tested with timestamps that already have timezone:

Input:

999,Free,Gillmor,[email protected],Male,2023-06-24 18:43:51+01:00
1000,Dacie,Thomkins,[email protected],Non-binary,2023-03-06 20:18:34+01:00

Output:

acmeCo/file-data {"_meta":{"file":"different-timezone.csv","offset":998},"email":"[email protected]","first_name":"Free","gender":"Male","id":"999","ip_address":"2023-06-24T18:43:51+01:00","last_name":"Gillmor"}
acmeCo/file-data {"_meta":{"file":"different-timezone.csv","offset":999},"email":"[email protected]","first_name":"Dacie","gender":"Non-binary","id":"1000","ip_address":"2023-03-06T20:18:34+01:00","last_name":"Thomkins"}

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

psFried

I think my main feedback is that this needs more exhaustive test cases, particularly for different numbers of sub-second digits and different representations of timezone offsets.

I left some other comments and questions. Let me know if it'd help to talk through any of those.

crates/parser/tests/sanitize_test.rs

psFried · 2023-09-28T16:47:46Z

crates/parser/src/format/sanitize/datetime.rs

+];
+
+const FORMATS: [&'static str; 2] = [
+    "%Y-%m-%d %H:%M:%S%.3f%:z",


Lots of thoughts and questions on these:
I don't think %.3f is right here, because it only allows for 3 sub-second digits, if I'm understanding it correctly. We should at least support 0-9 digits for timestamps with up to nanosecond precision. Technically, I don't think rfc3339 specifies a limit to the number of digits, so maybe we shouldn't either? But IDK if anyone actually uses more than 9 in practice. And it's maybe even arguable whether we should normalize timestamps with >9 sub-second digits to only have 9. Perhaps worth a little research on that.

Also, the chrono docs aren't super clear about the behavior of %:z, but I guess it accepts a literal z or Z in addition to an explicit +/-hours:minutes offset? Might be good to comment.

Another thing is that I don't see the T here. I see that some test cases have it, so I'm wondering if those cases are passing because we're just passing them through as is. Is that intentional and important? If so, it's deserving of a comment.

I also wonder if chrono's built-in rfc3339 format would handle all of those cases without needing to try both of these formats. Is there a reason not to use that?

Another thing is that I don't see the T here. I see that some test cases have it, so I'm wondering if those cases are passing because we're just passing them through as is. Is that intentional and important? If so, it's deserving of a comment.

Yes we just pass through RFC3339 as is, it doesn;t need us to parse it and format it again

I also wonder if chrono's built-in rfc3339 format would handle all of those cases without needing to try both of these formats. Is there a reason not to use that?

RFC3339 requires the T, here we are trying to handle the case where the T is missing, which is not valid RFC3339

psFried · 2023-09-28T17:48:06Z

crates/parser/src/format/sanitize/datetime.rs

+
+fn datetime_to_rfc3339(val: &mut Value, default_timezone: Tz) {
+    match val {
+        Value::String(s) => {


I'm not sure exactly what we should do about this, but would like to point out that in the common case where a string isn't a timestamp, we're trying to parse it 6 times. If there's a way to cut down on that, it might be worth it.

One possibility might be to switch from chrono to the time crate, which has the ability to specify optional elements in the format specifier. The time crate is generally preferred over chrono anyway. We currently use both (chrono being used a bit more, actually), but I'd like us to gradually standardize on just using time if we can. So it might be worthwhile to switch to time now, if it seems like it could significantly cut down on the amount of work we have to do here.

All this is of course speculative without any sort of benchmarks. I just brought up the current lack of benchmarks after standup, and Johnny's suggestion was to just try a basic before and after tests against a big CSV, so we can at least ensure that this isn't regressing performance egregiously. I agree that seems like a good compromise to avoid blowing up the scope of this PR. And I think we can let that determine whether it's worth switching to the time crate. As long as performance hasn't gotten significantly worse, it's fine the way it is for now.

mdibaiee · 2023-10-04T21:16:52Z

Okay, so I did a few things here:

First, I added a test case with fractional seconds up to nanosecond precision (9 digits)
I added a simple benchmark for us to use to compare the sanitization with no sanitization
I ran the benchmark with sanitization enabled and disabled, and the result was that with sanitization, it is approximately %11-13 slower on this dataset:

peoples_500             time:   [5.0192 ms 5.0818 ms 5.1640 ms]
                        change: [+11.122% +12.920% +14.831%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe

I switched from chrono to time crate, this makes the code simpler, but in terms of performance it didn't really help. It is still ~12% slower.

I'll look into this more tomorrow to see if I can improve performance somehow.

mdibaiee · 2023-10-05T16:23:22Z

@psFried made a change now, with this change:

we first attempt to parse into a more generic format that is a superset of naive and offset datetimes
if this parsing succeeds, when then try to parse into an offset date
if we succeed in parsing an offset date, we emit this datetime and we are done
if we don't succeed in parsing an offset date, then this is a naive date and we assume the default offset for it and then emit

This makes the performance impact of sanitization a ~3-5% regression, which I think is acceptable

psFried

Some other edge cases that I think need test coverage are inputs that have explicit TZ offsets that aren't the same as the default: for example, 2023-09-26 12:34:56-04:00 should get normalized to 2023-09-26T12:34:56-04:00, and 2023-09-26T12:34:56-04:00 should be passed through unaltered.

Also, the existing test cases for fractional seconds are all for zeros, which get truncated by the normalization process. I think we should have some tests for non-zero fractional seconds that assert that the fractional portions don't lose significant digits.

crates/parser/src/format/sanitize/datetime.rs

psFried

Those new test cases are way easier to read and understand. Thanks! I think this is looking in nice shape.

mdibaiee force-pushed the mahdi/parser-timestamp branch from d3c226c to 476e678 Compare September 26, 2023 19:48

mdibaiee marked this pull request as ready for review September 26, 2023 19:48

mdibaiee force-pushed the mahdi/parser-timestamp branch 4 times, most recently from fbe8857 to 27148c3 Compare September 26, 2023 20:11

parser: sanitize timestamps to RFC3339

6a8b599

mdibaiee force-pushed the mahdi/parser-timestamp branch from 27148c3 to 6a8b599 Compare September 26, 2023 20:16

mdibaiee requested review from psFried and williamhbaker September 26, 2023 20:17

psFried requested changes Sep 28, 2023

View reviewed changes

mdibaiee added 2 commits October 4, 2023 20:50

parser: support timestamps with arbitrary number of fractional seconds

b4fed06

parser: add benchmark

edb1750

mdibaiee force-pushed the mahdi/parser-timestamp branch from fc3a3ce to edb1750 Compare October 4, 2023 20:32

parser: use time crate to parse, update benchmark to remove I/O

0f27b4c

mdibaiee force-pushed the mahdi/parser-timestamp branch from 99fa14e to 0f27b4c Compare October 5, 2023 10:56

mdibaiee force-pushed the mahdi/parser-timestamp branch 3 times, most recently from 8573cd9 to d42b21e Compare October 6, 2023 17:19

parser: improve performance by first parsing a naive date

15e5b90

mdibaiee force-pushed the mahdi/parser-timestamp branch from d42b21e to 15e5b90 Compare October 6, 2023 17:25

psFried requested changes Oct 10, 2023

View reviewed changes

crates/parser/src/format/sanitize/datetime.rs Outdated Show resolved Hide resolved

crates/parser/src/format/sanitize/datetime.rs Show resolved Hide resolved

parser: refactor tests so it is easier to test different cases

28f9f4d

mdibaiee force-pushed the mahdi/parser-timestamp branch from 3fff5f3 to 28f9f4d Compare October 11, 2023 14:59

psFried approved these changes Oct 12, 2023

View reviewed changes

mdibaiee merged commit 9514bfb into master Oct 12, 2023
4 checks passed

mdibaiee deleted the mahdi/parser-timestamp branch October 12, 2023 16:21

mdibaiee mentioned this pull request Oct 12, 2023

ci: pin flow version until parser changes get greenlight estuary/connectors#1008

Closed

mdibaiee mentioned this pull request Oct 12, 2023

Revert "parser: sanitize timestamps to RFC3339" #1237

Merged

mdibaiee restored the mahdi/parser-timestamp branch October 17, 2023 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser: sanitize timestamps to RFC3339 #1201

parser: sanitize timestamps to RFC3339 #1201

mdibaiee commented Sep 26, 2023 •

edited

Loading

psFried left a comment

psFried Sep 28, 2023

mdibaiee Oct 4, 2023

mdibaiee Oct 4, 2023 •

edited

Loading

psFried Sep 28, 2023

mdibaiee commented Oct 4, 2023

mdibaiee commented Oct 5, 2023 •

edited

Loading

psFried left a comment

psFried left a comment

parser: sanitize timestamps to RFC3339 #1201

parser: sanitize timestamps to RFC3339 #1201

Conversation

mdibaiee commented Sep 26, 2023 • edited Loading

psFried left a comment

Choose a reason for hiding this comment

psFried Sep 28, 2023

Choose a reason for hiding this comment

mdibaiee Oct 4, 2023

Choose a reason for hiding this comment

mdibaiee Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

psFried Sep 28, 2023

Choose a reason for hiding this comment

mdibaiee commented Oct 4, 2023

mdibaiee commented Oct 5, 2023 • edited Loading

psFried left a comment

Choose a reason for hiding this comment

psFried left a comment

Choose a reason for hiding this comment

mdibaiee commented Sep 26, 2023 •

edited

Loading

mdibaiee Oct 4, 2023 •

edited

Loading

mdibaiee commented Oct 5, 2023 •

edited

Loading