Allow multiple prefixes in aggregation jobs #67

wojtek-rybak · 2024-07-19T14:35:45Z

According to the API specification, when creating a job, we must provide the bucket name and the prefix of the files with aggregatable reports to be included in the aggregation. Since I want to perform the aggregation every hour, it seems necessary to have a separate prefix for each hour. For example:

/data/2024-07-19/00/...
/data/2024-07-19/01/...
/data/2024-07-19/02/...
/data/2024-07-19/03/...
etc.

However, if I need to perform an aggregation over a 6-hour interval (using different filtering id), I encounter a problem. The API only allows one prefix, which means I would need to copy the data to a new location. This approach seems impractical and inefficient.

It would be highly beneficial if the aggregation service could accept a list of prefixes. This change would allow more flexibility in specifying the data intervals for aggregation without needing to duplicate data.

nlrussell · 2024-08-23T17:43:01Z

Hi @wojtek-rybak, thanks for providing this feedback. Can you say more about how costly it is to do this and just how much of a blocker this is, so we can consider that information in determining the priority of this request?

wojtek-rybak · 2024-09-04T06:22:16Z

Hi @nlrussell

At RTB House, we are currently in the testing phase, working with a small amount of data from a subset of users, which amounts to tens of gigabytes per day. At this stage, the issue described is only a minor inconvenience.

However, we plan to start working on the final, production-ready solution around early October. For that phase, we estimate processing tens of terabytes of data per day. It would be highly beneficial if the feature allowing for multiple prefixes in the aggregation service could be added by then. This addition would enable us to avoid the need for data copying in our final design.

nlrussell · 2024-09-26T14:40:01Z

Hi @wojtek-rybak, we've taken note of your suggestion as a feature request. In the meantime, you could structure your paths with an additional layer, similar to this:

/09-10-2024/interval-01/hour-01
/09-10-2024/interval-01/hour-02
/09-10-2024/interval-01/hour-03
/09-10-2024/interval-01/hour-04
/09-10-2024/interval-01/hour-05
/09-10-2024/interval-01/hour-06
/09-10-2024/interval-02/hour-07
/09-10-2024/interval-02/hour-08
/09-10-2024/interval-02/hour-09
/09-10-2024/interval-02/hour-10
...

Then to query hours 1-6, you can use the prefix "09-10-2024/interval-01/", without the need to copy reports over to another path.

wojtek-rybak · 2024-09-26T16:55:57Z

Hi @nlrussell, thank you for considering my suggestion.
I just wanted to highlight that the solution you provided only works when the length of one interval is divisible by the length of the other. In cases where the intervals have different lengths (e.g., one interval being 5 hours and the other 3 hours), we would still need to copy the reports.

nlrussell · 2024-10-01T14:05:51Z

Hi @wojtek-rybak, thanks for the feedback. We are working on a fix for this so we can provide a better alternative in a future release.

In the meantime, can you say more about your overall workflow? I'm not sure I understand why the intervals need to be divisible by each other. As long as you know your interval pattern in advance, you should be able to arrange the reports such that a given prefix contains all (and only) the reports needed for that specific aggregation job. Of course it is non-ideal and you might have to move reports around to fit such a structure if that is not how you store the reports on initial receipt.

ghanekaromkar · 2024-10-04T01:17:54Z

Hi @wojtek-rybak,
Thanks for raising this issue. We are considering a solution for this and have opened a new issue #76 for requesting customer feedback on our proposal.

wojtek-rybak · 2024-10-05T15:20:15Z

@ghanekaromkar This is great news, I will send feedback immediately.

@nlrussell We expect this mechanism to be used by multiple teams with varying needs and use cases. Our goal is to build a robust and flexible system that stores the reports once and allows each team to schedule their aggregation jobs independently. We want to provide teams the flexibility to choose how their reports are aggregated — whether it's every hour for lower latency, once a week for better signal-to-noise ratio, or a more sophisticated pattern, such as a daily sum over odd hours.

ghanekaromkar mentioned this issue Oct 3, 2024

Support for specifying multiple input prefixes in Aggregation Service CreateJob requests: Feedback Requested #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple prefixes in aggregation jobs #67

Allow multiple prefixes in aggregation jobs #67

wojtek-rybak commented Jul 19, 2024

nlrussell commented Aug 23, 2024

wojtek-rybak commented Sep 4, 2024

nlrussell commented Sep 26, 2024

wojtek-rybak commented Sep 26, 2024

nlrussell commented Oct 1, 2024

ghanekaromkar commented Oct 4, 2024

wojtek-rybak commented Oct 5, 2024

Allow multiple prefixes in aggregation jobs #67

Allow multiple prefixes in aggregation jobs #67

Comments

wojtek-rybak commented Jul 19, 2024

nlrussell commented Aug 23, 2024

wojtek-rybak commented Sep 4, 2024

nlrussell commented Sep 26, 2024

wojtek-rybak commented Sep 26, 2024

nlrussell commented Oct 1, 2024

ghanekaromkar commented Oct 4, 2024

wojtek-rybak commented Oct 5, 2024