-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extensions of dump_clean.sql for selective and daily dumps #699
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per #427 (comment) we should probably tweak a few things:
- Remove the blacklisting (this is for historical stuff and shouldn't be required for a new daily export)
- rename it from "mclean" to avoid confusion
@Frangible should be able to advise here if there are any other questions beyond what was on the original PR
@matschaffer did you mean removing these filters: https://github.com/Safecast/safecastapi/pull/699/files#diff-2a92cbfc2514ad867e2c7fa579a3d0c9R21-R38 |
I'm pretty sure that's what @Frangible meant. Would be nice if he could confirm. |
Also wondering if we should leave Not sure how a second bulk export got into the request. @absalomshu can you provide more info on why you need another selective bulk export (instead of just the daily?) Those full-data-set exports are really expensive in terms of DB resources, so I'd rather not have two. I'm okay with the one we have today plus a daily incremental export. |
Yes, if you’re not querying historical data you shouldn’t need any of the exclusions/blacklisting.
… On Jun 17, 2020, at 7:16 PM, Mat Schaffer ***@***.***> wrote:
I'm pretty sure that's what @Frangible meant. Would be nice if he could confirm.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
cron/dump_clean_daily.sql
Outdated
AND captured_at IS NOT NULL | ||
AND value IS NOT NULL | ||
AND LOCATION IS NOT NULL | ||
AND ((unit = 'cpm' | ||
AND ((device_id IS NULL | ||
AND value BETWEEN 10.00 AND 350000.0) | ||
OR ((device_id <= 24 | ||
OR device_id IN (69, | ||
89)) | ||
AND value BETWEEN 10.00 AND 30000.0))) | ||
OR (unit = 'celsius' | ||
AND value BETWEEN -80 AND 80)) | ||
AND (ST_X(LOCATION::geometry) != 0.0 | ||
OR ST_Y(LOCATION::geometry) != 0.0) | ||
AND ST_Y(LOCATION::geometry) BETWEEN -85.05 AND 85.05 | ||
AND ST_X(LOCATION::geometry) BETWEEN -180.00 AND 180.00 | ||
AND (user_id NOT IN (9, | ||
442) | ||
OR value < 35.0 | ||
OR ST_Y(LOCATION::geometry) NOT BETWEEN 35.4489 AND 35.7278 | ||
OR ST_X(LOCATION::geometry) NOT BETWEEN 139.5706 AND 139.9186) | ||
AND (user_id != 366 | ||
OR value < 35.0 | ||
OR ST_Y(LOCATION::geometry) NOT BETWEEN -45.5201 AND -7.6228 | ||
OR ST_X(LOCATION::geometry) NOT BETWEEN 111.3241 AND 153.8620 | ||
OR (value < 105.0 | ||
AND ST_X(LOCATION::geometry) < 111.3241)) ) ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Frangible @matschaffer Let me know if we need these filters
Hi,
I don’t think they’re needed, other filtering can be done locally. By the way, where the daily exports would be located and how many will be stored?
Cheers,
Julian Villegas, Ph.D.
… On Jun 19, 2020, at 0:14 AM, Sasha ***@***.***> wrote:
@sasharevzin commented on this pull request.
In cron/dump_clean_daily.sql:
> + AND captured_at IS NOT NULL
+ AND value IS NOT NULL
+ AND LOCATION IS NOT NULL
+ AND ((unit = 'cpm'
+ AND ((device_id IS NULL
+ AND value BETWEEN 10.00 AND 350000.0)
+ OR ((device_id <= 24
+ OR device_id IN (69,
+ 89))
+ AND value BETWEEN 10.00 AND 30000.0)))
+ OR (unit = 'celsius'
+ AND value BETWEEN -80 AND 80))
+ AND (ST_X(LOCATION::geometry) != 0.0
+ OR ST_Y(LOCATION::geometry) != 0.0)
+ AND ST_Y(LOCATION::geometry) BETWEEN -85.05 AND 85.05
+ AND ST_X(LOCATION::geometry) BETWEEN -180.00 AND 180.00
+ AND (user_id NOT IN (9,
+ 442)
+ OR value < 35.0
+ OR ST_Y(LOCATION::geometry) NOT BETWEEN 35.4489 AND 35.7278
+ OR ST_X(LOCATION::geometry) NOT BETWEEN 139.5706 AND 139.9186)
+ AND (user_id != 366
+ OR value < 35.0
+ OR ST_Y(LOCATION::geometry) NOT BETWEEN -45.5201 AND -7.6228
+ OR ST_X(LOCATION::geometry) NOT BETWEEN 111.3241 AND 153.8620
+ OR (value < 105.0
+ AND ST_X(LOCATION::geometry) < 111.3241)) ) ;
@Frangible @matschaffer Let me know if we need these filters
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi,
I think keeping, say, the last week of daily backups could be useful. I imagine someone feeding a third-party analysis missing a couple of days after a system migration or failure, so having those extra days could make the recovery faster…
Anyway, I’m really grateful that you guys are doing this.
Have a nice weekend,
Julian Villegas, Ph.D.
… On Jun 19, 2020, at 14:52 PM, Andrew Todd ***@***.***> wrote:
@auspicacious commented on this pull request.
In cron/dump_clean_daily:
> +# This differs from dump_measurements, which does not filter data at all.
+# The schema returned is also different; redundant/unused columns have been eliminated, and id/user_id finally included.
+# The ORDER BY clause was removed for performance.
+
+set -euo pipefail
+
+source cron_env.sh
+
+cd ../public/system
+
+psql -f "${CRON_DIR}/dump_clean_daily.sql"
+
+tar -czf dump_clean_daily.tar.gz dump_clean_daily-out.csv
+mv dump_clean_daily-out.csv dump_clean_daily.csv
+
+# Now this is available as https://api.safecast.org/system/dump_clean_daily.tar.gz
Is this sufficient for our users' needs? Or should we name these files differently for each day, and retain a few days of dumps? For instance dump_clean_daily_2020-07-01.tar.gz
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@julovi On a server itself and can be found here: https://api.safecast.org/system/dump_clean_daily.tar.gz |
I’d go with dated exports to s3, and just keep them indefinitely. I suspect
even a few years of daily data will be well within budget.
On Sat, Jun 20, 2020 at 10:18 Sasha ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In cron/dump_clean_daily
<#699 (comment)>:
> +# This differs from dump_measurements, which does not filter data at all.
+# The schema returned is also different; redundant/unused columns have been eliminated, and id/user_id finally included.
+# The ORDER BY clause was removed for performance.
+
+set -euo pipefail
+
+source cron_env.sh
+
+cd ../public/system
+
+psql -f "${CRON_DIR}/dump_clean_daily.sql"
+
+tar -czf dump_clean_daily.tar.gz dump_clean_daily-out.csv
+mv dump_clean_daily-out.csv dump_clean_daily.csv
+
+# Now this is available as https://api.safecast.org/system/dump_clean_daily.tar.gz
Thank you very much but it might take time to me to understand what I’m
looking at, haha. I’m trying to understand what is the task. Do we want to
have a list of dumps back to three months? What is our goal? :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#699 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAFMQC3T2PARXW2UAR6ILRXQE7FANCNFSM4N6GSJCQ>
.
--
…-Mat
matschaffer.com
|
Certainly less than the $75/mo we’d have to pay to run kubernetes ;) |
Btw, we might want to get these moved to s3 first. Then we can implement retention via s3 lifetime rules |
@sasharevzin and @matschaffer I'm thinking that we:
What do you think? |
Yep. I'd say #571 is basically that "move it to s3 issue", just need to make a note about the retention rules. |
@matschaffer @auspicacious I think we are good to go with the first item, for a second I created a ticket: #710 Thanks |
Sorry for the back and forth here, but two things:
Just make it a straight measurements export.
|
Then yeah, let's do #710 after we set up the s3 storage. Then we don't have to worry about local clean up we can just |
@matschaffer two new crons to setup on a server:
# https://api.safecast.org/system/dump_clean_daily.tar.gz cron/dump_clean_daily