-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parser: sanitize timestamps to RFC3339 #1201
Conversation
d3c226c
to
476e678
Compare
fbe8857
to
27148c3
Compare
27148c3
to
6a8b599
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my main feedback is that this needs more exhaustive test cases, particularly for different numbers of sub-second digits and different representations of timezone offsets.
I left some other comments and questions. Let me know if it'd help to talk through any of those.
]; | ||
|
||
const FORMATS: [&'static str; 2] = [ | ||
"%Y-%m-%d %H:%M:%S%.3f%:z", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of thoughts and questions on these:
I don't think %.3f
is right here, because it only allows for 3 sub-second digits, if I'm understanding it correctly. We should at least support 0-9 digits for timestamps with up to nanosecond precision. Technically, I don't think rfc3339 specifies a limit to the number of digits, so maybe we shouldn't either? But IDK if anyone actually uses more than 9 in practice. And it's maybe even arguable whether we should normalize timestamps with >9 sub-second digits to only have 9. Perhaps worth a little research on that.
Also, the chrono docs aren't super clear about the behavior of %:z
, but I guess it accepts a literal z
or Z
in addition to an explicit +/-hours:minutes
offset? Might be good to comment.
Another thing is that I don't see the T
here. I see that some test cases have it, so I'm wondering if those cases are passing because we're just passing them through as is. Is that intentional and important? If so, it's deserving of a comment.
I also wonder if chrono
's built-in rfc3339 format would handle all of those cases without needing to try both of these formats. Is there a reason not to use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing is that I don't see the T here. I see that some test cases have it, so I'm wondering if those cases are passing because we're just passing them through as is. Is that intentional and important? If so, it's deserving of a comment.
Yes we just pass through RFC3339 as is, it doesn;t need us to parse it and format it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if chrono's built-in rfc3339 format would handle all of those cases without needing to try both of these formats. Is there a reason not to use that?
RFC3339 requires the T
, here we are trying to handle the case where the T
is missing, which is not valid RFC3339
|
||
fn datetime_to_rfc3339(val: &mut Value, default_timezone: Tz) { | ||
match val { | ||
Value::String(s) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure exactly what we should do about this, but would like to point out that in the common case where a string isn't a timestamp, we're trying to parse it 6 times. If there's a way to cut down on that, it might be worth it.
One possibility might be to switch from chrono
to the time
crate, which has the ability to specify optional elements in the format specifier. The time
crate is generally preferred over chrono
anyway. We currently use both (chrono being used a bit more, actually), but I'd like us to gradually standardize on just using time
if we can. So it might be worthwhile to switch to time now, if it seems like it could significantly cut down on the amount of work we have to do here.
All this is of course speculative without any sort of benchmarks. I just brought up the current lack of benchmarks after standup, and Johnny's suggestion was to just try a basic before and after tests against a big CSV, so we can at least ensure that this isn't regressing performance egregiously. I agree that seems like a good compromise to avoid blowing up the scope of this PR. And I think we can let that determine whether it's worth switching to the time
crate. As long as performance hasn't gotten significantly worse, it's fine the way it is for now.
fc3a3ce
to
edb1750
Compare
Okay, so I did a few things here:
I'll look into this more tomorrow to see if I can improve performance somehow. |
99fa14e
to
0f27b4c
Compare
@psFried made a change now, with this change:
This makes the performance impact of sanitization a ~3-5% regression, which I think is acceptable |
8573cd9
to
d42b21e
Compare
d42b21e
to
15e5b90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other edge cases that I think need test coverage are inputs that have explicit TZ offsets that aren't the same as the default: for example, 2023-09-26 12:34:56-04:00
should get normalized to 2023-09-26T12:34:56-04:00
, and 2023-09-26T12:34:56-04:00
should be passed through unaltered.
Also, the existing test cases for fractional seconds are all for zeros, which get truncated by the normalization process. I think we should have some tests for non-zero fractional seconds that assert that the fractional portions don't lose significant digits.
3fff5f3
to
28f9f4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those new test cases are way easier to read and understand. Thanks! I think this is looking in nice shape.
Description:
source-http-file
serving files with timestamps formatted as2023-07-09 18:45:14
Given the input:
The output with default config (UTC as timezone):
The output with
default_timezone: "America/New_York"
:Also tested with timestamps that already have timezone:
Input:
Output:
Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is