Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for job start/end time not exactly matching forecast data points #54

Merged
merged 23 commits into from
Aug 1, 2023

Conversation

tlestang
Copy link
Collaborator

@tlestang tlestang commented Jul 23, 2023

Currently, the average carbon intensity for a job is estimated over a time interval spanning $N + 1$ data points, where $N$ is the ceil of the ratio between the job duration and the timestep between data points. Moreover, cats approximates the job start time as the time of the first available data points. For carbonintensity.org.uk, this the previous top of the hour, or half hour.

example:
current time is 12:48 and my job is 194 minutes long. The job start time will be approximated as 12:30, and the job duration will be approximation as $210min = 7 \times 30min$ where $30min$ is the interval between two data points.

The approach followed in the PR is similar to changes included in #43 , but the implementation fits within the current structure by only enhancing the WindowedForecast class. These changes are covered by a new test test_average_intensity_with_offset.

implementation:
The average carbon intensity over a given time window is estimated by summing the intensity values at the midpoints between each consecutive data points in the window. The first midpoint (midpt[0]) is overidden to account for the fact that the first point (used to compute midpt[0]) should be located at the job start time. The corresponding intensity value is interpolated between the first and second data point. Similarly, the last midpoint (midpt[-1]) is overidden to account for the fast that the last point should be located at the job end time. The corresponding intensity value is interpolated between the penultimate and last data points in the window.

For a given candidate window, the first (last) data point is interpolated using the directly preceding (following) available data point before (after) the start (end) of the job.

edit 2023-07-24 13:20: Changed the implementation to handle short jobs correctly.

@tlestang
Copy link
Collaborator Author

Looks like this is only working with python3.10+, because I'm relying on being able to pass a sorting key to bisect. Probably worth working around this to be able to support 3.9 as well.

Copy link
Collaborator

@andreww andreww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes all look good to me. One question - what happens if somebody submits a task that lasts less than 30 mins? For now, we maybe just want a test case (and I hope it 'just works'). In principle I guess we could start fitting some smooth function to the CI data, but I don't think we want to go there.

This allows handling of short jobs that fit between two consecutive
data points.  It also makes the implementation simpler, in my opinion.
@tlestang
Copy link
Collaborator Author

tlestang commented Jul 24, 2023

what happens if somebody submits a task that lasts less than 30 mins?

Well that's a very good question, and I don't think this case was handled correctly.

In response to this I changed the implementation slightly so it 'just works' for short jobs, in other words it shouldn't be a corner case anymore. I'm adding a new test with a 6 minutes job to check that this is correct.

Details: instead of computing the mid-points over the (over-estimated) window and then fixing them, it now builds the window, interpolating both ends, and computes the midpoints as if nothing happened. Besides handling short jobs naturally, I think this actually makes the code a lot easier to understand.

Example: You run cats at 12:48 for a 6minutes job. The second candidate window is 13:18 to 13:24, located between data points at 13:00 (data[1]) and 13:30 (data[2]). Both carbon intensity values at 13:18 and 13:24 and interpolated by drawing a straight line between data points at 13:00 and 13:30.

I guess we could start fitting some smooth function to the CI data

I think linear interpolation is enough. If you look at the carbon intensity timeseries, it's already very smooth. In other words the signal doesn't exhibit large variations within 30 min. I guess forecast providers provide data points at an interval ensuring the timeseries is well resolved.

cats/forecast.py Outdated Show resolved Hide resolved
@colinsauze
Copy link
Member

Example: You run cats at 12:48 for a 6minutes job. The second candidate window is 13:18 to 13:24, located between data points at 13:00 (data[1]) and 13:30 (data[2]). Both carbon intensity values at 13:18 and 13:24 and interpolated by drawing a straight line between data points at 13:00 and 13:30.

I guess we could start fitting some smooth function to the CI data

I think linear interpolation is enough. If you look at the carbon intensity timeseries, it's already very smooth. In other words the signal doesn't exhibit large variations within 30 min. I guess forecast providers provide data points at an interval ensuring the timeseries is well resolved.

I'd agree that linear interpolation is good enough here. Getting things within the right 30 minute period will achieve what we need in terms of emissions minimisation.

Part of me is tempted to say the best strategy for sub 30 minute jobs is to randomly determine when they run within the lowest 30 minute window available. This way if many people were launching small jobs they wouldn't all try to run at the same time (even when distributed across multiple unrelated systems).

@tlestang
Copy link
Collaborator Author

Part of me is tempted to say the best strategy for sub 30 minute jobs is to randomly determine when they run within the lowest 30 minute window available. This way if many people were launching small jobs they wouldn't all try to run at the same time (even when distributed across multiple unrelated systems).

Smart. I don't really want to add more features to this PR but that's probably a good starting point for a new one. Or an issue.

@colinsauze
Copy link
Member

Yes, it's definitely something for a different PR/issue and not critical for getting a minimum viable product ready.

Copy link
Contributor

@Llannelongue Llannelongue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I've just ran small test that I had implemented in #43, and for some reason I don't get the expected result. It looks like lbound compute the wrong value here, but it may also well be that I ran the test incorrectly for the current structure. Do you know why it fails?

CI_forecast = [
    CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,8,30), value=26),
    CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,9,0), value=40),
    CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,9,30), value=50),
    CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,10,0), value=60),
    CarbonIntensityPointEstimate(datetime=datetime(2023,1,1,10,30), value=25),
]
wf = WindowedForecast(data=CI_forecast, duration=70, start=datetime(2023,1,1,9,15))
print(wf[0])
start_new = 45
end_new = 60 + (25-60)/30*25
expected_result_trapezoidal = ((start_new+50)*15/2 + (50+60)*30/2 + (60+end_new)*25/2)/70
print(expected_result_trapezoidal)
assert math.isclose(wf[0].value, expected_result_trapezoidal)

(I believe the integral between 9:15 and 10:25 should be 49.9.., with the left bound 45 and right bound 30.83..)

# second data point (index + 1) in the window. The ending
# intensity value is interpolated between the last and
# penultimate data points in he window.
window_start = self.start + index * self.data_stepsize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential downside of doing all the calculations in the __getitem__ method is that it requires redoing calculations every time (I think?): as in, min already calculates all windows, but then w[0] does it again etc.. An alternative would be to calculate all windows once, store it in self and only access it after that. Or are there performance benefits to doing everything here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think performance should not be a concern here. Even then, only the the first window is computed twice. But yes, if you were to run min twice, you'd compute all windows twice.

@tlestang
Copy link
Collaborator Author

Thanks for testing @Llannelongue because I think you've exposed a mistake! In getitem

        return CarbonIntensityAverageEstimate(
            start=window_start,
            end=window_end,
            value=sum(midpt) / (self.ndata - 1),
        )

dividing by self.ndata - 1 assumes all midpoints are regularly spaced: true for inner points but not the first and last one, based on interpolated data points. I've also made this assumption in the tests I believe... 🤕

tlestang and others added 4 commits July 27, 2023 11:34
Although inner midpoints are separated by the same timestep. The
first (interpolated) point and second are not. Same goes for
penultimate and last (interpolated) points.
@tlestang
Copy link
Collaborator Author

I changed the average calculation to account for the difference in weights. @Llannelongue I added a test case based on your test above.

I believe the integral between 9:15 and 10:25 should be 49.9.., with the left bound 45 and right bound 30.83..)

Agreed. I think it is now ;)

@sadielbartholomew sadielbartholomew removed their request for review August 1, 2023 14:01
@tlestang
Copy link
Collaborator Author

tlestang commented Aug 1, 2023

Thank you @andreww @colinsauze and @Llannelongue for reviewing this. It really helped make this better!

@tlestang tlestang merged commit f625fa2 into main Aug 1, 2023
3 checks passed
@tlestang tlestang deleted the adjust_integration_window branch August 1, 2023 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants