You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The evalcast-killcards branch specifically moved to an unnested format because it seemed cleaner / more convenient. However, nesting has some benefits, like using less memory since we don't have to store duplicate values in the nest_by fields. It could also make some of the scoring logic simpler.
Using nesting just for the forecast scoring might be as inefficient / RAM-exploding as a group_split. Sharing the nesting work across enough subprocesses could make it worth it, e.g. perhaps if such nesting/nest_bying is done before the join with the actuals or more generally the nested form is shared across all error measure calculations, it might offer some offsetting performance gains.
Comparing speed of nested/unnested approaches for scoring-like calculations (credit: @brookslogan):
Timing some simple operations to see if we'd expect speed gains [using an unnest approach in error measures]:
Expect comparisons to vary based on N, M, and complexity of the calculation. But the above makes me pessimistic about unnest unless the unnesting can be shared across eval metrics or the nested form can be avoided altogether, and even then, it's not that much faster. I'm not sure the latter is possible, because nesting might save RAM by avoiding repeating values in all the other columns (which is part of why I was suggesting unnesting&evaluating chunks of predictions at a time, not all). Although if there are faster versions of unnest or the summarize, then maybe some better speed gains could be realized.
From our prior discussion, using a nested form isn't a clear winner, partly because the unnesting step is slow. We didn't look at differences in memory usage of different approaches, though.
Is this worth pursuing/investigating more? I (@nmdefries) am not familiar with the prior (pre-evalcast-killcards) nested format, so I don't know what's been tried before.
The text was updated successfully, but these errors were encountered:
The
evalcast-killcards
branch specifically moved to an unnested format because it seemed cleaner / more convenient. However, nesting has some benefits, like using less memory since we don't have to store duplicate values in thenest_by
fields. It could also make some of the scoring logic simpler.Using nesting just for the forecast scoring might be as inefficient / RAM-exploding as a
group_split
. Sharing the nesting work across enough subprocesses could make it worth it, e.g. perhaps if such nesting/nest_by
ing is done before the join with the actuals or more generally the nested form is shared across all error measure calculations, it might offer some offsetting performance gains.Comparing speed of nested/unnested approaches for scoring-like calculations (credit: @brookslogan):
From our prior discussion, using a nested form isn't a clear winner, partly because the unnesting step is slow. We didn't look at differences in memory usage of different approaches, though.
Is this worth pursuing/investigating more? I (@nmdefries) am not familiar with the prior (pre-
evalcast-killcards
) nested format, so I don't know what's been tried before.The text was updated successfully, but these errors were encountered: