Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple prefixes in aggregation jobs #67

Open
wojtek-rybak opened this issue Jul 19, 2024 · 7 comments
Open

Allow multiple prefixes in aggregation jobs #67

wojtek-rybak opened this issue Jul 19, 2024 · 7 comments

Comments

@wojtek-rybak
Copy link

According to the API specification, when creating a job, we must provide the bucket name and the prefix of the files with aggregatable reports to be included in the aggregation. Since I want to perform the aggregation every hour, it seems necessary to have a separate prefix for each hour. For example:

  • /data/2024-07-19/00/...
  • /data/2024-07-19/01/...
  • /data/2024-07-19/02/...
  • /data/2024-07-19/03/...
  • etc.

However, if I need to perform an aggregation over a 6-hour interval (using different filtering id), I encounter a problem. The API only allows one prefix, which means I would need to copy the data to a new location. This approach seems impractical and inefficient.

It would be highly beneficial if the aggregation service could accept a list of prefixes. This change would allow more flexibility in specifying the data intervals for aggregation without needing to duplicate data.

@nlrussell
Copy link
Collaborator

Hi @wojtek-rybak, thanks for providing this feedback. Can you say more about how costly it is to do this and just how much of a blocker this is, so we can consider that information in determining the priority of this request?

@wojtek-rybak
Copy link
Author

Hi @nlrussell

At RTB House, we are currently in the testing phase, working with a small amount of data from a subset of users, which amounts to tens of gigabytes per day. At this stage, the issue described is only a minor inconvenience.

However, we plan to start working on the final, production-ready solution around early October. For that phase, we estimate processing tens of terabytes of data per day. It would be highly beneficial if the feature allowing for multiple prefixes in the aggregation service could be added by then. This addition would enable us to avoid the need for data copying in our final design.

@nlrussell
Copy link
Collaborator

Hi @wojtek-rybak, we've taken note of your suggestion as a feature request. In the meantime, you could structure your paths with an additional layer, similar to this:

/09-10-2024/interval-01/hour-01
/09-10-2024/interval-01/hour-02
/09-10-2024/interval-01/hour-03
/09-10-2024/interval-01/hour-04
/09-10-2024/interval-01/hour-05
/09-10-2024/interval-01/hour-06
/09-10-2024/interval-02/hour-07
/09-10-2024/interval-02/hour-08
/09-10-2024/interval-02/hour-09
/09-10-2024/interval-02/hour-10
...

Then to query hours 1-6, you can use the prefix "09-10-2024/interval-01/", without the need to copy reports over to another path.

@wojtek-rybak
Copy link
Author

Hi @nlrussell, thank you for considering my suggestion.
I just wanted to highlight that the solution you provided only works when the length of one interval is divisible by the length of the other. In cases where the intervals have different lengths (e.g., one interval being 5 hours and the other 3 hours), we would still need to copy the reports.

@nlrussell
Copy link
Collaborator

Hi @wojtek-rybak, thanks for the feedback. We are working on a fix for this so we can provide a better alternative in a future release.

In the meantime, can you say more about your overall workflow? I'm not sure I understand why the intervals need to be divisible by each other. As long as you know your interval pattern in advance, you should be able to arrange the reports such that a given prefix contains all (and only) the reports needed for that specific aggregation job. Of course it is non-ideal and you might have to move reports around to fit such a structure if that is not how you store the reports on initial receipt.

@ghanekaromkar
Copy link
Contributor

Hi @wojtek-rybak,
Thanks for raising this issue. We are considering a solution for this and have opened a new issue #76 for requesting customer feedback on our proposal.

@wojtek-rybak
Copy link
Author

@ghanekaromkar This is great news, I will send feedback immediately.

@nlrussell We expect this mechanism to be used by multiple teams with varying needs and use cases. Our goal is to build a robust and flexible system that stores the reports once and allows each team to schedule their aggregation jobs independently. We want to provide teams the flexibility to choose how their reports are aggregated — whether it's every hour for lower latency, once a week for better signal-to-noise ratio, or a more sophisticated pattern, such as a daily sum over odd hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants