Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix date parsing regression introduced in 2018 #6009

Open
tfmorris opened this issue Aug 10, 2023 · 12 comments · May be fixed by OpenRefine/openrefine.org#360
Open

Fix date parsing regression introduced in 2018 #6009

tfmorris opened this issue Aug 10, 2023 · 12 comments · May be fixed by OpenRefine/openrefine.org#360
Labels
grel The default expression language, GREL, could be improved in many ways! Priority: Medium Represents important issues that need to be addressed but are not urgent Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Milestone

Comments

@tfmorris
Copy link
Member

The current behavior of interpreting date strings without a timezone as being in UTC was a breaking change introduced in May 2018 without any announcement (as far as I can tell). The fact the tests were changed should have been a red flag that something bad was going on.

I first discovered this in 2020 and added a ToDo about it to the relevant tests and mentioned it in the comments to #3027, but apparently didn't create an issue, so remedying that now.

Current Results

All dates and datetime strings are interpreted as being at UTC when parsed.

Expected Behavior

I believe we should go back to the historical and, to me, more logical behavior of interpreting dates as being in the local timezone rather than as being at UTC.

Versions

OpenRefine <=2.7 - correct
OpenRefine 3.0+ - broken

Additional context

This is a second breaking change to undo the first breaking change, which is less than ideal, but I believe it's also more logical to the user to have things interpreted in the local timezone.

@tfmorris tfmorris added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators grel The default expression language, GREL, could be improved in many ways! labels Aug 10, 2023
@wetneb
Copy link
Member

wetneb commented Aug 10, 2023

I agree this change was not very helpful, I feel like it was mostly motivated by an interest in upgrading to newer Java APIs without much consideration of the user's requirements.

As a user, I see more options than "interpreting in local time" or "interpreting in UTC". There is a third option, which is "interpreting as a date(time) without clear timezone info". I think that third option should be the default behaviour for a data cleaning tool, especially for dates (where timezones are even more rarely specified).

This phrasing of "interpreting" might also be a bit abstract and hard to relate to concrete user workflows, so it's worth trying to pin it down in concrete terms. For instance, when loading a CSV file with datatype auto-detection:

person,birthdate
Maribel,1985-07-04
Fernando,1976-12-22
Jose,1988-03-08

I want that the year, month and day components of the dates to be exactly what they were in the source file, and I want them to stay the same afterwards, even if I created the project while being in the UTC+2 timezone and re-opened it later while being in the UTC-5 timezone. I want those dates to be preserved in OpenRefine and also if I export the project to any format.

I would expect the same behaviour if parsing datetime objects with no timezone attached (such as 1989-09-06 08:34:45).

In my opinion, this is different from interpreting date(time) objects in the local timezone, because if we do so, then the rendering of those objects should change depending on that local timezone (to preserve the interpretation).

@ostephens
Copy link
Member

I made this comment on the forum as well but repeating here:

The advantages I can see to the current behaviour is that it offers reproducibility and consistency between users - which is useful in training, support and in cases where users are working collaboratively but are geographically across time zones (or a single user working on the same data inside/outside daylight saving time? I'm not sure how the previous behaviour handled this situation?).

Outside this I don't have strong feelings on the issue - the most important thing is that we are clear about the behaviour

I do like the suggestion that we have more/better tools for converting date times between time zones - possibly on import, on export (as suggested by @Tim_Chaffe) and in transformations.

@tfmorris tfmorris changed the title Timezoneless date strings should be interpreted as local, not UTC Fix date parsing regression introduced in 2018 Aug 15, 2023
@tfmorris tfmorris added Priority: Medium Represents important issues that need to be addressed but are not urgent and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Aug 15, 2023
@tfmorris
Copy link
Member Author

tfmorris commented Aug 15, 2023

OK, I clearly should have used a better title. Hopefully the updated title is clearer. This is only about fixing the regression, not adding new functionality. The only question in my mind is whether fixing the regression is worse than leaving it as is.

As a user, I see more options than "interpreting in local time" or "interpreting in UTC". There is a third option, which is "interpreting as a date(time) without clear timezone info". I

Sure, but that's a significant chunk of new functionality to implement, and making it the default would not be backwards compatible. We have #533, which I reopened last year, to cover fixing cases where extra information is being unnecessarily added/derived.

I agree this change was not very helpful, I feel like it was mostly motivated by an interest in upgrading to newer Java APIs without much consideration of the user's requirements.

But changing to the new APIs added no technical benefit OR user benefit and, additionally, was done incorrectly making the new code more confusing because it misuses the APIs. It's a sad waste of everyone's time.

@wetneb
Copy link
Member

wetneb commented Aug 16, 2023

Tom, I understand your frustration but I think there is no point in escalating the blame on this contributor who has long left the project by now. A lot of his contributions were made in a personal capacity. Let's stay focused on the possible solutions to the problem and keep the discussion constructive.

@tfmorris tfmorris self-assigned this Aug 22, 2023
@tfmorris
Copy link
Member Author

Any opinions on my question?

The only question in my mind is whether fixing the regression is worse than leaving it as is.

@wetneb
Copy link
Member

wetneb commented Nov 15, 2023

I don't know exactly what you mean by "fixing the regression". In your first message you propose to "interpret in local time" but I am not sure I understand what you mean exactly by that, which is why I tried to phrase the problem differently in my first reply above. I think there are multiple approaches, they could each be better or worse than the status quo:

  • attempting a plain revert, i.e. something as close as possible to a git revert modulo the more recent changes, which would amount going back to the Calendar class internally
  • migrating to something else from the Java 8+ API, such as LocalDateTime, which embeds a timezone I think, so it's perhaps not a great fit
  • some home-grown date(time) class made to fit our needs
  • something else?

In any case the move should be motivated by concrete use cases where we'd be able to validate the improvement from a user perspective.

@thadguidry
Copy link
Member

thadguidry commented Nov 15, 2023

Users are currently not aware of what is happening most of the time with their datetimes. We should make them aware then give them the opportunity at import to make a choice. Maybe a dialog? The smart default should be local time and we should inform the user of this at import. I have never supported keeping hidden any transformations OpenRefine does, and that's the current situation that should be fixed. So can OpenRefine inform the user at import?

@tfmorris
Copy link
Member Author

the move should be motivated by concrete use cases where we'd be able to validate the improvement from a user perspective.

Historically, backward compatibility was a core value for the project. That is the user-related motivation for this. This regression broke that backward compatibility, but now we have 5 years of product in the field with the broken behavior (vs 8 years prior), so it's a damned if you do, damned if you don't situation.

migrating to something else from the Java 8+ API, such as LocalDateTime

This doesn't have anything to do with Java implementation classes, but rather user visible behavior. The simplest way to understand it is to look at the old and new test behavior:

// FIXME: Date/times without timezone info were interpreted as local up until May 2018 at which point they
// switch to UTC
// assertEquals(invoke("toDate", "2013-06-01T13:12:11"), CalendarParser.parseAsOffsetDateTime("2013-06-01
// 13:12:11"));
// These match current behavior, but would fail with the historic (pre-2018) behavior
assertEquals(invoke("toDate", "2013-06-01T13:12:11Z"), CalendarParser.parseAsOffsetDateTime("2013-06-01 13:12:11"));
assertEquals(invoke("toDate", "2013-06-01Z"), CalendarParser.parseAsOffsetDateTime("2013-06-01"));

Fixing the regression would mean uncommenting the first test and adjusting the implementation code so that the test passes.

when loading a CSV file with datatype auto-detection

If you mean the "Attempt to parse cell text into numbers" import option, it doesn't have anything to do with dates. The affected code is the toDate() function.

I'm not proposing (at least here) to:

  • change how dates without times are treated (they'll still be interpreted as being datetimes with a time of midnight)
  • change how date parsing is done (it'll still use the wonky maybe it's a month, maybe it's day algorithm by default)
  • fix the fact that OffsetDateTime with a business logic enforced offset of 0 is being (mis)used instead of Instant
  • add support for local dates/times (ie those which include timezone info)
  • or do anything else with date/time handling except fix this one specific bug

I guess maybe I should just put up a PR for people to review to make things more concrete.

@ostephens
Copy link
Member

The only question in my mind is whether fixing the regression is worse than leaving it as is.

As you say @tfmorris "damned if you do, damned if you don't" . Given this my vote is:

  • leave the existing functionality as it is
  • make sure this is clearly documented (suggest this includes a note of the former behaviour and which version this changed in)
  • put effort into improving our date/time handling including timezone options

@tfmorris tfmorris added this to the 3.8 milestone Dec 18, 2023
@thadguidry
Copy link
Member

I agree with @ostephens above on next steps for this issue.

@tfmorris
Copy link
Member Author

@ostephens @thadguidry I'm counting on you to follow through with issues/documentation that implement your proposed solution. I've dropped my work on a code based solution.

@tfmorris tfmorris modified the milestones: 3.8, 3.9 May 13, 2024
thadguidry added a commit to thadguidry/openrefine.github.com that referenced this issue Aug 5, 2024
@thadguidry
Copy link
Member

@tfmorris @ostephens Please review my new PR which adds an admonition to our docs in both GREL Reference for Dates and the Dates user guide. Hopefully I've captured the provenance well enough, and noted one benefit for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grel The default expression language, GREL, could be improved in many ways! Priority: Medium Represents important issues that need to be addressed but are not urgent Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants