Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time series arithmetic between the different elements #994

Open
3 tasks
zanete opened this issue Aug 27, 2024 · 6 comments
Open
3 tasks

time series arithmetic between the different elements #994

zanete opened this issue Aug 27, 2024 · 6 comments
Assignees
Labels
core-only This issue is reserved for the IF core team only
Milestone

Comments

@zanete
Copy link

zanete commented Aug 27, 2024

Why: Sub of #949 . In order to create a realistic manifest file for the GSF website
What: We need the ability to carry out simple arithmetic between the different elements
Context

Let's say we have a component containing a time series for number of page views per hour, which may have been populated using an importer plugin for e.g. google analytics.

We also have a separate component, e.g. web-server that has impacts for energy and carbon in the same time intervals as the page visits.

Now we want to calculate our SCI score by dividing carbon in each observation in the web-server component by the page views in the page-views component - we can't because all the information we need to process an observation and create a new output value has to exist within the same component as that observation.

This is problematic because it suggests we have to either know the page views in advance and manually add them everywhere we need them across our manifest, or we have to run some importer plugin for every component in the tree that wants to access that data, leading to a lot of repetition, points of failure and unnecessary carbon expenditure.

What this amounts to is that today, unless we want to make manual interventions to the manifest, we cannot use time series data for our functional unit in SCI calculations.

Here's what we want to be able to do:

  • we have a component, component A, in a tree whose input data is a time series populated using an importer plugin. This component tracks page visits for a website
  • we have a bunch more components, B -> Z that also have input data and a pipeline of plugins that eventually yield carbon per timestep
  • we run time sync so everything is snapped to a common grid
  • we then want to calculate SCI for all the components B -> Z by dividing carbon in each timestep in their time series element-wise by site-visits in components A's time series data.
  • we aggregate the sci values, skipping component A because it doesn't have carbon values

We might have to assert that --observe plugins across the whole tree are executed before any --compute plugins are executed, otherwise we have ordering requirements for certain compute plugins (e.g. we could try to execute a sci that relies on some functional unit in another component where those values haven't been imported yet).

note Why not just use the importer inside each component and add the page-visits to each observation?
A few reasons - first is that it's a wasteful way to get the data, it would require an external API call per component for data we already have, which is time, energy and carbon inefficient. Also, it's plausible the response could change from one component to another. It also requires that the data arriving from the importer is already sync'd with the existing set of timestamps, which it may or may not be - this would be tricky to handle internally. These are the reasons i think separate components plus cross-component operations are the way to go.

*Narek's implementation notes

To let the framework know that we will want to reuse the observed value in other child components, we have to pass store-result: true flag to the plugin config in initialize section like this:

azure-importer:
  store-result: true
  ...

In the pipeline user can mention name of the plugin and the components name to reuse it’s data:

pipeline:
  compute:
    - child-1:azure-importer
  regroup:
    - some-field
...

** Note from @jmcook1186**: I prefer something like global: true compared to store-result: true. Then we can invoke using global: page-views rather than using the original component name.

Meanwhile the framework will check, if the name in the compute section is present in the plugins storage, then it will execute from scratch, otherwise framework will check results storage to see if there is any data saved by previous child component.

Scope of work:

  • IF behaviour updated to enable cross-component operations
  • documentation updated
  • test cases added

Acceptance Criteria

Scenario 1

GIVEN the cross-component operations are working
WHEN I run the following manifest:

name: sci demo
description: successful path
tags:
initialize:
  plugins:
    page-visits:
      kind: plugin
      global: true
      method: AnalyticsImporter
      path: "some-path"
      config:
        functional-unit: requests
        output-parameter: 'page-visits'
    sci:
      kind: plugin
      method: Sci
      path: "builtin"
      config:
        functional-unit: global/page-visits
tree:
  children:
    component-1:
      pipeline:
        compute:
          - analytics-importer
      inputs:
    server:
      pipeline:
        compute:
          - sci
      inputs:
        - timestamp: 2023-07-06T00:00
          duration: 3600
          energy: 5
          carbon-operational: 5
          carbon-embodied: 0.02
          carbon: 5.02

I get the following output:

name: sci
description: successful path
tags:
initialize:
  plugins:
    page-visits:
      kind: plugin
      method: AnalyticsImporter
      path: "some-path"
      config:
        output-parameter: 'page-visits'
    sci:
      kind: plugin
      method: Sci
      path: "builtin"
      config:
        functional-unit: global/page-visits
tree:
  children:
    component-1:
      pipeline:
        compute:
          - analytics-importer
      inputs:
        - timestamp: 2023-07-06T00:00
          duration: 3600
          page-visits: 10      
    server:
      pipeline:
        compute:
          - sci
      inputs:
        - timestamp: 2023-07-06T00:00
          duration: 3600
          energy: 5
          carbon-operational: 5
          carbon-embodied: 0.02
          carbon: 5.02
      outputs:
        - timestamp: 2023-07-06T00:00
          duration: 3600
          energy: 5
          carbon-operational: 5
          carbon-embodied: 0.02
          carbon: 5.02
          sci: 0.502

@zanete zanete added this to IF Aug 27, 2024
@zanete zanete converted this from a draft issue Aug 27, 2024
@zanete zanete added draft The issue is still being written, no need to respond or action on anything. core-only This issue is reserved for the IF core team only labels Aug 27, 2024
@zanete zanete moved this from In Design to In Refinement in IF Aug 27, 2024
@jmcook1186 jmcook1186 removed the draft The issue is still being written, no need to respond or action on anything. label Aug 27, 2024
@zanete zanete moved this from In Refinement to Ready in IF Aug 29, 2024
@zanete zanete moved this from In Progress to Parked in IF Sep 16, 2024
@zanete zanete mentioned this issue Sep 16, 2024
8 tasks
@zanete
Copy link
Author

zanete commented Sep 30, 2024

@jawache please review this solution

@zanete zanete added this to the IF 1.0 milestone Sep 30, 2024
@jawache
Copy link
Contributor

jawache commented Oct 1, 2024

@narekhovhannisyan and @jmcook1186, so for now we can live with multiple API calls and lots of human massaging of data.

Historically we've discussed this in much much earlier versions of IF and the solution we landed on was to implement some internal caching feature. So end users configure the plugins as they want, and we optimize by caching results for plugins and then returning the same results "for the same query", that kind of approach still needs a lot of refinement since it's not actually that straight forward.

However I think it's a bit premature to optimize in this way, the impact is low (a few repeated api calls and/or copy/paste of data) and the proposed solution I think can have a lot of unintended consequences, global data, automatic-copying of sub-trees which we'll be stuck with for a long time.

@jawache
Copy link
Contributor

jawache commented Oct 1, 2024

@narekhovhannisyan I saw this in the text above "We might have to assert that --observe plugins across the whole tree are executed before any --compute plugins are executed, otherwise we have ordering requirements for certain compute plugins (e.g. we could try to execute a sci that relies on some functional unit in another component where those values haven't been imported yet)."

Can you confirm whether we are running observe as a stage independent of group and compute or whether we have merged it all into compute?

I had assumed with our new pure functional architecture these would have been written as there own pure functions?

@narekhovhannisyan
Copy link
Member

@jawache Yeah observe is executed first, then group andcompute at the end.
You can see it here https://github.com/Green-Software-Foundation/if/blob/main/src/if-run/lib/compute.ts#L107

@jmcook1186
Copy link
Contributor

jmcook1186 commented Oct 1, 2024

@jawache @narekhovhannisyan

Ok, accept the feedback on cross-component interactions for now. Let me just quickly add some colour to the response re observe though.

The compute phase in IF executes before the regroup and compute phases in IF. However, it can get more complicated sometimes because you also have to make a decision about which plugins you define as observe plugins (i.e. they are included in the observe pipeline and operate on the inputs array only, as enforced in their source code). This isn't always as obvious a decision as it might seem, as some plugins we might want to be observe plugins because they make some external API call or read from some file might rely on values generated by compute plugins. In that case, they have to be bumped out of the observe pipeline and into compute.

For example,. it's not uncommon to need to chain input plugins together. Maybe we want to look up a processor name and then look up its TDP using two external API calls - unless the two APIs have identical naming systems we need to insert the regex plugin in between the two to make them interoperate. As regex is a compute plugin, the two observe plugins have to become compute plugins or the chain breaks.

This is why it can be harder than it seems to separate out the observe phase and generate a static file to pass to compute or regroup.

@jawache
Copy link
Contributor

jawache commented Oct 4, 2024

@jmcook1186 I agree, it's a fluffy decision whether something is an observe plugin or compute.

The original intention of breaking up the pipeline was to make the decision about where to put the TimeSync and Grouping plugins more obvious, they were supposed to both be baked into the regroup step.

Another reason was for verification, plugins that require non-public data, API keys, logins etc... can go in observe so someone "verifying" can just rerun compute and not require all the same permissions, keys, etc...

But it's making less and less sense, for instance WattTime plugin still only would work after the grouping but given the above we would want it to fit in observe.

It's getting a code smell to me, perhaps we need to revisit. My gut is telling me there is a super elegant and simple solution to do with an alternative view on time (global time window, fixed duration, dunno?) and atomic observations (treating each observation independently of the time series it's part of negates the need to group to get a unique time series). It's on the tip of my tongue but just out of reach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-only This issue is reserved for the IF core team only
Projects
Status: Parked
Development

No branches or pull requests

4 participants