Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser: sanitize timestamps to RFC3339 #1201

Merged
merged 6 commits into from
Oct 12, 2023
Merged

Conversation

mdibaiee
Copy link
Member

@mdibaiee mdibaiee commented Sep 26, 2023

Description:

  • Automatically parse and sanitize timestamps to RFC3339
  • Tested by running a source-http-file serving files with timestamps formatted as 2023-07-09 18:45:14
  • The result is:
    Given the input:
999,Free,Gillmor,[email protected],Male,2023-06-24 18:43:51
1000,Dacie,Thomkins,[email protected],Non-binary,2023-03-06 20:18:34

The output with default config (UTC as timezone):

acmeCo/file-data {"_meta":{"file":"bad-timestamps.csv","offset":998},"email":"[email protected]","first_name":"Free","gender":"Male","id":"999","ip_address":"2023-06-24T18:43:51Z","last_name":"Gillmor"}
acmeCo/file-data {"_meta":{"file":"bad-timestamps.csv","offset":999},"email":"[email protected]","first_name":"Dacie","gender":"Non-binary","id":"1000","ip_address":"2023-03-06T20:18:34Z","last_name":"Thomkins"}

The output with default_timezone: "America/New_York":

acmeCo/file-data {"_meta":{"file":"bad-timestamps.csv","offset":998},"email":"[email protected]","first_name":"Free","gender":"Male","id":"999","ip_address":"2023-06-24T18:43:51-04:00","last_name":"Gillmor"}
acmeCo/file-data {"_meta":{"file":"bad-timestamps.csv","offset":999},"email":"[email protected]","first_name":"Dacie","gender":"Non-binary","id":"1000","ip_address":"2023-03-06T20:18:34-05:00","last_name":"Thomkins"}

Also tested with timestamps that already have timezone:

Input:

999,Free,Gillmor,[email protected],Male,2023-06-24 18:43:51+01:00
1000,Dacie,Thomkins,[email protected],Non-binary,2023-03-06 20:18:34+01:00

Output:

acmeCo/file-data {"_meta":{"file":"different-timezone.csv","offset":998},"email":"[email protected]","first_name":"Free","gender":"Male","id":"999","ip_address":"2023-06-24T18:43:51+01:00","last_name":"Gillmor"}
acmeCo/file-data {"_meta":{"file":"different-timezone.csv","offset":999},"email":"[email protected]","first_name":"Dacie","gender":"Non-binary","id":"1000","ip_address":"2023-03-06T20:18:34+01:00","last_name":"Thomkins"}

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@mdibaiee mdibaiee force-pushed the mahdi/parser-timestamp branch from d3c226c to 476e678 Compare September 26, 2023 19:48
@mdibaiee mdibaiee marked this pull request as ready for review September 26, 2023 19:48
@mdibaiee mdibaiee force-pushed the mahdi/parser-timestamp branch 4 times, most recently from fbe8857 to 27148c3 Compare September 26, 2023 20:11
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my main feedback is that this needs more exhaustive test cases, particularly for different numbers of sub-second digits and different representations of timezone offsets.

I left some other comments and questions. Let me know if it'd help to talk through any of those.

crates/parser/tests/sanitize_test.rs Outdated Show resolved Hide resolved
];

const FORMATS: [&'static str; 2] = [
"%Y-%m-%d %H:%M:%S%.3f%:z",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of thoughts and questions on these:
I don't think %.3f is right here, because it only allows for 3 sub-second digits, if I'm understanding it correctly. We should at least support 0-9 digits for timestamps with up to nanosecond precision. Technically, I don't think rfc3339 specifies a limit to the number of digits, so maybe we shouldn't either? But IDK if anyone actually uses more than 9 in practice. And it's maybe even arguable whether we should normalize timestamps with >9 sub-second digits to only have 9. Perhaps worth a little research on that.

Also, the chrono docs aren't super clear about the behavior of %:z, but I guess it accepts a literal z or Z in addition to an explicit +/-hours:minutes offset? Might be good to comment.

Another thing is that I don't see the T here. I see that some test cases have it, so I'm wondering if those cases are passing because we're just passing them through as is. Is that intentional and important? If so, it's deserving of a comment.

I also wonder if chrono's built-in rfc3339 format would handle all of those cases without needing to try both of these formats. Is there a reason not to use that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing is that I don't see the T here. I see that some test cases have it, so I'm wondering if those cases are passing because we're just passing them through as is. Is that intentional and important? If so, it's deserving of a comment.

Yes we just pass through RFC3339 as is, it doesn;t need us to parse it and format it again

Copy link
Member Author

@mdibaiee mdibaiee Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if chrono's built-in rfc3339 format would handle all of those cases without needing to try both of these formats. Is there a reason not to use that?

RFC3339 requires the T, here we are trying to handle the case where the T is missing, which is not valid RFC3339


fn datetime_to_rfc3339(val: &mut Value, default_timezone: Tz) {
match val {
Value::String(s) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure exactly what we should do about this, but would like to point out that in the common case where a string isn't a timestamp, we're trying to parse it 6 times. If there's a way to cut down on that, it might be worth it.

One possibility might be to switch from chrono to the time crate, which has the ability to specify optional elements in the format specifier. The time crate is generally preferred over chrono anyway. We currently use both (chrono being used a bit more, actually), but I'd like us to gradually standardize on just using time if we can. So it might be worthwhile to switch to time now, if it seems like it could significantly cut down on the amount of work we have to do here.

All this is of course speculative without any sort of benchmarks. I just brought up the current lack of benchmarks after standup, and Johnny's suggestion was to just try a basic before and after tests against a big CSV, so we can at least ensure that this isn't regressing performance egregiously. I agree that seems like a good compromise to avoid blowing up the scope of this PR. And I think we can let that determine whether it's worth switching to the time crate. As long as performance hasn't gotten significantly worse, it's fine the way it is for now.

@mdibaiee mdibaiee force-pushed the mahdi/parser-timestamp branch from fc3a3ce to edb1750 Compare October 4, 2023 20:32
@mdibaiee
Copy link
Member Author

mdibaiee commented Oct 4, 2023

Okay, so I did a few things here:

  1. First, I added a test case with fractional seconds up to nanosecond precision (9 digits)
  2. I added a simple benchmark for us to use to compare the sanitization with no sanitization
  3. I ran the benchmark with sanitization enabled and disabled, and the result was that with sanitization, it is approximately %11-13 slower on this dataset:
peoples_500             time:   [5.0192 ms 5.0818 ms 5.1640 ms]
                        change: [+11.122% +12.920% +14.831%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe
  1. I switched from chrono to time crate, this makes the code simpler, but in terms of performance it didn't really help. It is still ~12% slower.

I'll look into this more tomorrow to see if I can improve performance somehow.

@mdibaiee mdibaiee force-pushed the mahdi/parser-timestamp branch from 99fa14e to 0f27b4c Compare October 5, 2023 10:56
@mdibaiee
Copy link
Member Author

mdibaiee commented Oct 5, 2023

@psFried made a change now, with this change:

  1. we first attempt to parse into a more generic format that is a superset of naive and offset datetimes
  2. if this parsing succeeds, when then try to parse into an offset date
  3. if we succeed in parsing an offset date, we emit this datetime and we are done
  4. if we don't succeed in parsing an offset date, then this is a naive date and we assume the default offset for it and then emit

This makes the performance impact of sanitization a ~3-5% regression, which I think is acceptable

@mdibaiee mdibaiee force-pushed the mahdi/parser-timestamp branch 3 times, most recently from 8573cd9 to d42b21e Compare October 6, 2023 17:19
@mdibaiee mdibaiee force-pushed the mahdi/parser-timestamp branch from d42b21e to 15e5b90 Compare October 6, 2023 17:25
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other edge cases that I think need test coverage are inputs that have explicit TZ offsets that aren't the same as the default: for example, 2023-09-26 12:34:56-04:00 should get normalized to 2023-09-26T12:34:56-04:00, and 2023-09-26T12:34:56-04:00 should be passed through unaltered.

Also, the existing test cases for fractional seconds are all for zeros, which get truncated by the normalization process. I think we should have some tests for non-zero fractional seconds that assert that the fractional portions don't lose significant digits.

crates/parser/src/format/sanitize/datetime.rs Outdated Show resolved Hide resolved
crates/parser/src/format/sanitize/datetime.rs Show resolved Hide resolved
@mdibaiee mdibaiee force-pushed the mahdi/parser-timestamp branch from 3fff5f3 to 28f9f4d Compare October 11, 2023 14:59
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those new test cases are way easier to read and understand. Thanks! I think this is looking in nice shape.

@mdibaiee mdibaiee merged commit 9514bfb into master Oct 12, 2023
4 checks passed
@mdibaiee mdibaiee deleted the mahdi/parser-timestamp branch October 12, 2023 16:21
@mdibaiee mdibaiee restored the mahdi/parser-timestamp branch October 17, 2023 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants