-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for job start/end time not exactly matching forecast data points #54
Conversation
Looks like this is only working with python3.10+, because I'm relying on being able to pass a sorting key to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes all look good to me. One question - what happens if somebody submits a task that lasts less than 30 mins? For now, we maybe just want a test case (and I hope it 'just works'). In principle I guess we could start fitting some smooth function to the CI data, but I don't think we want to go there.
This allows handling of short jobs that fit between two consecutive data points. It also makes the implementation simpler, in my opinion.
Well that's a very good question, and I don't think this case was handled correctly. In response to this I changed the implementation slightly so it 'just works' for short jobs, in other words it shouldn't be a corner case anymore. I'm adding a new test with a 6 minutes job to check that this is correct. Details: instead of computing the mid-points over the (over-estimated) window and then fixing them, it now builds the window, interpolating both ends, and computes the midpoints as if nothing happened. Besides handling short jobs naturally, I think this actually makes the code a lot easier to understand. Example: You run
I think linear interpolation is enough. If you look at the carbon intensity timeseries, it's already very smooth. In other words the signal doesn't exhibit large variations within 30 min. I guess forecast providers provide data points at an interval ensuring the timeseries is well resolved. |
I'd agree that linear interpolation is good enough here. Getting things within the right 30 minute period will achieve what we need in terms of emissions minimisation. Part of me is tempted to say the best strategy for sub 30 minute jobs is to randomly determine when they run within the lowest 30 minute window available. This way if many people were launching small jobs they wouldn't all try to run at the same time (even when distributed across multiple unrelated systems). |
Smart. I don't really want to add more features to this PR but that's probably a good starting point for a new one. Or an issue. |
Yes, it's definitely something for a different PR/issue and not critical for getting a minimum viable product ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. I've just ran small test that I had implemented in #43, and for some reason I don't get the expected result. It looks like lbound compute the wrong value here, but it may also well be that I ran the test incorrectly for the current structure. Do you know why it fails?
CI_forecast = [
CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,8,30), value=26),
CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,9,0), value=40),
CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,9,30), value=50),
CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,10,0), value=60),
CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,10,30), value=25),
]
wf = WindowedForecast(data=CI_forecast, duration=70, start=datetime(2023,1,1,9,15))
print(wf[0])
start_new = 45
end_new = 60 + (25-60)/30*25
expected_result_trapezoidal = ((start_new+50)*15/2 + (50+60)*30/2 + (60+end_new)*25/2)/70
print(expected_result_trapezoidal)
assert math.isclose(wf[0].value, expected_result_trapezoidal)
(I believe the integral between 9:15 and 10:25 should be 49.9.., with the left bound 45 and right bound 30.83..)
# second data point (index + 1) in the window. The ending | ||
# intensity value is interpolated between the last and | ||
# penultimate data points in he window. | ||
window_start = self.start + index * self.data_stepsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential downside of doing all the calculations in the __getitem__
method is that it requires redoing calculations every time (I think?): as in, min
already calculates all windows, but then w[0]
does it again etc.. An alternative would be to calculate all windows once, store it in self and only access it after that. Or are there performance benefits to doing everything here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think performance should not be a concern here. Even then, only the the first window is computed twice. But yes, if you were to run min
twice, you'd compute all windows twice.
Thanks for testing @Llannelongue because I think you've exposed a mistake! In getitem
dividing by |
Although inner midpoints are separated by the same timestep. The first (interpolated) point and second are not. Same goes for penultimate and last (interpolated) points.
Co-authored-by: Loïc Lannelongue <[email protected]>
I changed the average calculation to account for the difference in weights. @Llannelongue I added a test case based on your test above.
Agreed. I think it is now ;) |
7bd423f
to
db76c19
Compare
26d8476
to
50319ee
Compare
Thank you @andreww @colinsauze and @Llannelongue for reviewing this. It really helped make this better! |
Currently, the average carbon intensity for a job is estimated over a time interval spanning$N + 1$ data points, where $N$ is the
ceil
of the ratio between the job duration and the timestep between data points. Moreover, cats approximates the job start time as the time of the first available data points. For carbonintensity.org.uk, this the previous top of the hour, or half hour.example:$210min = 7 \times 30min$ where $30min$ is the interval between two data points.
current time is 12:48 and my job is 194 minutes long. The job start time will be approximated as 12:30, and the job duration will be approximation as
The approach followed in the PR is similar to changes included in #43 , but the implementation fits within the current structure by only enhancing the
WindowedForecast
class. These changes are covered by a new testtest_average_intensity_with_offset
.implementation:
The average carbon intensity over a given time window is estimated by summing the intensity values at the midpoints between each consecutive data points in the window. The first midpoint (midpt[0]
) is overidden to account for the fact that the first point (used to computemidpt[0]
) should be located at the job start time. The corresponding intensity value is interpolated between the first and second data point. Similarly, the last midpoint (midpt[-1]
) is overidden to account for the fast that the last point should be located at the job end time. The corresponding intensity value is interpolated between the penultimate and last data points in the window.For a given candidate window, the first (last) data point is interpolated using the directly preceding (following) available data point before (after) the start (end) of the job.
edit 2023-07-24 13:20: Changed the implementation to handle short jobs correctly.