-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple prefixes in aggregation jobs #67
Comments
Hi @wojtek-rybak, thanks for providing this feedback. Can you say more about how costly it is to do this and just how much of a blocker this is, so we can consider that information in determining the priority of this request? |
Hi @nlrussell At RTB House, we are currently in the testing phase, working with a small amount of data from a subset of users, which amounts to tens of gigabytes per day. At this stage, the issue described is only a minor inconvenience. However, we plan to start working on the final, production-ready solution around early October. For that phase, we estimate processing tens of terabytes of data per day. It would be highly beneficial if the feature allowing for multiple prefixes in the aggregation service could be added by then. This addition would enable us to avoid the need for data copying in our final design. |
Hi @wojtek-rybak, we've taken note of your suggestion as a feature request. In the meantime, you could structure your paths with an additional layer, similar to this:
Then to query hours 1-6, you can use the prefix "09-10-2024/interval-01/", without the need to copy reports over to another path. |
Hi @nlrussell, thank you for considering my suggestion. |
Hi @wojtek-rybak, thanks for the feedback. We are working on a fix for this so we can provide a better alternative in a future release. In the meantime, can you say more about your overall workflow? I'm not sure I understand why the intervals need to be divisible by each other. As long as you know your interval pattern in advance, you should be able to arrange the reports such that a given prefix contains all (and only) the reports needed for that specific aggregation job. Of course it is non-ideal and you might have to move reports around to fit such a structure if that is not how you store the reports on initial receipt. |
Hi @wojtek-rybak, |
@ghanekaromkar This is great news, I will send feedback immediately. @nlrussell We expect this mechanism to be used by multiple teams with varying needs and use cases. Our goal is to build a robust and flexible system that stores the reports once and allows each team to schedule their aggregation jobs independently. We want to provide teams the flexibility to choose how their reports are aggregated — whether it's every hour for lower latency, once a week for better signal-to-noise ratio, or a more sophisticated pattern, such as a daily sum over odd hours. |
According to the API specification, when creating a job, we must provide the bucket name and the prefix of the files with aggregatable reports to be included in the aggregation. Since I want to perform the aggregation every hour, it seems necessary to have a separate prefix for each hour. For example:
However, if I need to perform an aggregation over a 6-hour interval (using different filtering id), I encounter a problem. The API only allows one prefix, which means I would need to copy the data to a new location. This approach seems impractical and inefficient.
It would be highly beneficial if the aggregation service could accept a list of prefixes. This change would allow more flexibility in specifying the data intervals for aggregation without needing to duplicate data.
The text was updated successfully, but these errors were encountered: