Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average of Dates #54542

Open
alaindebecker opened this issue May 22, 2024 · 22 comments
Open

Average of Dates #54542

alaindebecker opened this issue May 22, 2024 · 22 comments
Labels
dates Dates, times, and the Dates stdlib module

Comments

@alaindebecker
Copy link

That you cannot add Dates and cannot divide a Dates by an integer seams perfectly normal.
However computing the mean of Dates is well founded and sometimes mostly needed.

Example:

using Statistics, Dates

mean([Date("2024-05-22"), Date("2024-05-20")])
### ERROR: MethodError: no method matching /(::Date, ::Int64)

sum([Date("2024-05-22"), Date("2024-05-20")])  ÷ 2
### ERROR: MethodError: no method matching +(::Date, ::Date)

mean([DateTime("2024-05-22"), DateTime("2024-05-20")])
ERROR: MethodError: no method matching /(::DateTime, ::Int64)

versioninfo()
#==
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)
==#```
@dkarrasch dkarrasch added the dates Dates, times, and the Dates stdlib module label May 22, 2024
@oscardssmith
Copy link
Member

What should this return for [Date("2024-05-22"), Date("2024-05-21")]? In general, there isn't a correct answer to this.

@alaindebecker
Copy link
Author

Workaround:
mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))

Rounding is mandatory to avoid ERROR: InexactError: Int64(whatever) but the best strategy would be to return a DateTime and leave the question of rounding explicit to the user.

@alaindebecker
Copy link
Author

To answer your question about the average of [Date("2024-05-22"), Date("2024-05-21")], I'd say that because the two dates are 24 hours accurate anyway, their mean cannot be better than 24 hours accurate, therefore 2024-05-22 and 2024-05-21 would equally answer to the question.

Joking apart, following the principle of this discussion about average of integers [return a Float and let the user explicitly decide about spurious accuracy], I'd say that the mean of 2024-05-22 and 2024-05-21 is

t1 = Date("2024-05-22")
t2 = Date("2024-05-21")
arg = [t1,t2]
convert(DateTime, Millisecond(mean(Dates.value.(DateTime.(arg))))) = 2024-05-21T12:00:00`

which (extra Julia) I'd round to 2024-05-21 because t1 is in facts somwhere between 2024-05-21T00:00:00 and 2024-05-21T23:59:59 and t2 between 2024-05-22T00:00:00 and 2024-05-22T23:59:59.
.

@martinholters
Copy link
Member

Another line of thought: For two values, their mean is their "middle". For 2024-05-21 and 2024-05-22, their middle seems to be midnight, i.e. 2024-05-22T00:00:00, so their mean should be 2024-05-22.

No, it shouldn't. IMHO, we shouldn't define it, as it unclear what it should be. Let's not define mean for Dates.

@martinholters
Copy link
Member

mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))

julia> convert(Date, Day(round(mean(Dates.value.([Date("2000-01-01"), Date("2004-01-01")])))))
2001-12-31

I'm certain someone would consider this a bug and expect 2002-01-01. Did I mention I'd prefer not define mean for Dates?

@StefanKarpinski
Copy link
Member

I'm a little unclear what the definition of average/mean when you can neither add nor divide by a count. Median and extrema seem well-defined to me, but mean feels a lot iffier...

@alaindebecker
Copy link
Author

Generations of astronomers did it however. After all, for them time is just a number, the Julian day number.

Personally, I need it for a regression y=f(t) with t the time. And from time to time, I also need it when I have a bunch of events supposed to arise at about the same time, but are known to be normally distributed.

It is just like temperature: adding or dividing by a count have no meaning, but you find average temperature in any newspaper.

@vtjnash
Copy link
Member

vtjnash commented May 24, 2024

Computationally, it is also easy to define in a rigorous way, because while Date cannot be added, delta days can be. And we can conveniently pick day 0 for the arithmetic, which makes it seem like our Date kind and Days kind are almost alike in units (although in strict mathematics, they are not):

julia> x = [Date("2024-05-22"), Date("2024-05-20")];

julia> d0 = Date("0000-01-01")
0000-01-01

julia> mean(x .- d0) .+ d0
2024-05-21

@jariji
Copy link
Contributor

jariji commented May 24, 2024

I think the decision depends on the rules for dividing Day(n) by an integer in the definition of mean. Some options:

  1. It fails always.

  2. It works if it's divisible and errors otherwise like the current behavior of /(::Day, ::Int):

julia> Day(2)/2
1 day

julia> Day(1)/2
ERROR: InexactError: Int64(0.5)
  1. It rounds to a full Day using some rounding rule. That's consistent with a "fixed point" interpretation of date and datetime types.

  2. It promotes to DateTime. That's consistent with the behavior of 1/2 promoting to float.

Imho it would be best if we could define precise semantics for the date type and for mean so that the answer to this issue would follow unambiguously. I don't like that /(::Day, ::Int) and /(::Int, ::Int) seem to follow different principles.

@alaindebecker
Copy link
Author

Totally agree with @vtjnash: a Date is a point in Time, and Time is continuous.

@alaindebecker
Copy link
Author

jariji : I suggest solution 4, although I usually use solution 3, rounding down.

The resaon is that a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight. So a Date refers to a point in time which is (on average) 12 hours after its literal value. And a the mean of a bunch of Dates will be on average 12 hours after Integer(DateTime.value).

However, solution 4 is in accordance with Julia philosophy : let the rouding rule be explicitely stated by the user.

Solution 1 and 2 are just painfull.

@alaindebecker
Copy link
Author

mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))

julia> convert(Date, Day(round(mean(Dates.value.([Date("2000-01-01"), Date("2004-01-01")])))))
2001-12-31

I'm certain someone would consider this a bug and expect 2002-01-01. Did I mention I'd prefer not define mean for Dates?

Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct, just as

convert(DateTime, Millisecond(mean(Dates.value.([DateTime("2000-01-01"), DateTime("2004-01-01")]))))
2001-12-31T12:00:00

Maybe I rephrase the issue title in Average of Time (time not beeing a Julia type).

@alaindebecker
Copy link
Author

Workaround: mean_dates(dates) = convert(Date, Day(round(mean(Dates.value.(dates)))))

Rounding is mandatory to avoid ERROR: InexactError: Int64(whatever) but the best strategy would be to return a DateTime and leave the question of rounding explicit to the user :

Workaround: mean_dates(dates) = convert(DateTime, Millisecond(mean(Dates.value.(dates))))

@martinholters
Copy link
Member

a Date is a point in Time, and Time is continuous

a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight

So... a Date is a specific but unknown (within the day) point in time?

To me, a date is rather an interval (usually of 24 hours length).

Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct

Ok, but why then do I get 2006-01-01 for the mean of 2004-01-01 and 2008-01-01? Same situation wrt leap year, no?

If I look at this:

julia> for y in 2000:2020
           d1 = Date(y)
           d2 = Date(y+4)
           m = convert(Date, Day(round(mean(Dates.value.([d1, d2])))))
           println("Mean of $(d1) and $(d2) is $(m)")
       end
Mean of 2000-01-01 and 2004-01-01 is 2001-12-31
Mean of 2001-01-01 and 2005-01-01 is 2003-01-01
Mean of 2002-01-01 and 2006-01-01 is 2004-01-02
Mean of 2003-01-01 and 2007-01-01 is 2004-12-31
Mean of 2004-01-01 and 2008-01-01 is 2006-01-01
Mean of 2005-01-01 and 2009-01-01 is 2007-01-02
Mean of 2006-01-01 and 2010-01-01 is 2008-01-01
Mean of 2007-01-01 and 2011-01-01 is 2009-01-01
Mean of 2008-01-01 and 2012-01-01 is 2009-12-31
Mean of 2009-01-01 and 2013-01-01 is 2011-01-01
Mean of 2010-01-01 and 2014-01-01 is 2012-01-02
Mean of 2011-01-01 and 2015-01-01 is 2012-12-31
Mean of 2012-01-01 and 2016-01-01 is 2014-01-01
Mean of 2013-01-01 and 2017-01-01 is 2015-01-02
Mean of 2014-01-01 and 2018-01-01 is 2016-01-01
Mean of 2015-01-01 and 2019-01-01 is 2017-01-01
Mean of 2016-01-01 and 2020-01-01 is 2017-12-31
Mean of 2017-01-01 and 2021-01-01 is 2019-01-01
Mean of 2018-01-01 and 2022-01-01 is 2020-01-02
Mean of 2019-01-01 and 2023-01-01 is 2020-12-31
Mean of 2020-01-01 and 2024-01-01 is 2022-01-01

I do believe this makes perfect sense in some contexts - but also that it may be rather confusing in others. (And I certainly couldn't predict these results.)

@adienes
Copy link
Contributor

adienes commented May 27, 2024

I think it would be a very bad choice to make the mean of Date to round to a Date like is proposed. To me it feels pretty hacky and confusing.

the mean of integers is non-integral...

@alaindebecker
Copy link
Author

I think it would be a very bad choice to make the mean of Date to round to a Date like is proposed. To me it feels pretty hacky and confusing.

the mean of integers is non-integral...

That why I said (but nobody seams to have read it): the best strategy would be to return a DateTime and leave the question of rounding explicit to the user.

@martinholters
Copy link
Member

I'd be ok with defining mean for DateTime. However, I'm not sure the default convert(DateTime, ::Date) should be automatically invoked to also define mean for Date then.

@alaindebecker
Copy link
Author

a Date is a point in Time, and Time is continuous

a Date like 2024-05-25 means any point in time between 2024-05-25 midnight and 2024-05-26 midnight

So... a Date is a specific but unknown (within the day) point in time?

To me, a date is rather an interval (usually of 24 hours length).

Oh yes, there is a leap year at one end and not at the other, so 2001-12-31 is in fact correct

Ok, but why then do I get 2006-01-01 for the mean of 2004-01-01 and 2008-01-01? Same situation wrt leap year, no?

If I look at this:

julia> for y in 2000:2020
           d1 = Date(y)
           d2 = Date(y+4)
           m = convert(Date, Day(round(mean(Dates.value.([d1, d2])))))
           println("Mean of $(d1) and $(d2) is $(m)")
       end
Mean of 2000-01-01 and 2004-01-01 is 2001-12-31
Mean of 2001-01-01 and 2005-01-01 is 2003-01-01
Mean of 2002-01-01 and 2006-01-01 is 2004-01-02
Mean of 2003-01-01 and 2007-01-01 is 2004-12-31
Mean of 2004-01-01 and 2008-01-01 is 2006-01-01
Mean of 2005-01-01 and 2009-01-01 is 2007-01-02
Mean of 2006-01-01 and 2010-01-01 is 2008-01-01
Mean of 2007-01-01 and 2011-01-01 is 2009-01-01
Mean of 2008-01-01 and 2012-01-01 is 2009-12-31
Mean of 2009-01-01 and 2013-01-01 is 2011-01-01
Mean of 2010-01-01 and 2014-01-01 is 2012-01-02
Mean of 2011-01-01 and 2015-01-01 is 2012-12-31
Mean of 2012-01-01 and 2016-01-01 is 2014-01-01
Mean of 2013-01-01 and 2017-01-01 is 2015-01-02
Mean of 2014-01-01 and 2018-01-01 is 2016-01-01
Mean of 2015-01-01 and 2019-01-01 is 2017-01-01
Mean of 2016-01-01 and 2020-01-01 is 2017-12-31
Mean of 2017-01-01 and 2021-01-01 is 2019-01-01
Mean of 2018-01-01 and 2022-01-01 is 2020-01-02
Mean of 2019-01-01 and 2023-01-01 is 2020-12-31
Mean of 2020-01-01 and 2024-01-01 is 2022-01-01

I do believe this makes perfect sense in some contexts - but also that it may be rather confusing in others. (And I certainly couldn't predict these results.)

You hit what Julia calls calendrical vs temporal nature of time (see doc). Since Babylonian astromomers, you record time on a calendar and compute time on $\mathbb{R}$. To bad the year is not an exact number of days.

@alaindebecker
Copy link
Author

I'd be ok with defining mean for DateTime. However, I'm not sure the default convert(DateTime, ::Date) should be automatically invoked to also define mean for Date then.

Something like ?

function mean(itr::AbstractArray{Dates.DateTime})
    return convert(Dates.DateTime, Millisecond(Statistics.mean(Dates.value.(itr))))
end

Example:

using Dates
mean([DateTime("2024-01-01"),DateTime("2025-01-01")])
2024-07-02T00:00:00

@cstjean
Copy link
Contributor

cstjean commented Jun 13, 2024

Unitful has the same problem with regard to °C. It's interesting to consider that a generic fallback definition along the lines of Statistics.mean(vec) = zero(eltype(vec)) + sum(vec .- zero(eltype(vec))) / length(vec) would solve both problems. I guess they're both affine spaces...?

In any case, it's probably more pragmatic to define specialized methods for mean of Date and Quantity vectors, than change the generic mean fallback.

@adienes
Copy link
Contributor

adienes commented Jun 13, 2024

in case the reference is useful, dropping the link to the polars meta-issue for aggregations of datetime-like types
pola-rs/polars#13599

@alaindebecker
Copy link
Author

Thanks @adienes, exactly what was expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dates Dates, times, and the Dates stdlib module
Projects
None yet
Development

No branches or pull requests

9 participants