Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Types of Tasks #43

Open
matthewhanson opened this issue Sep 19, 2023 · 4 comments
Open

Types of Tasks #43

matthewhanson opened this issue Sep 19, 2023 · 4 comments

Comments

@matthewhanson
Copy link
Member

In @gadomski's PR #42 several types of tasks are defined.

class Task(BaseModel, ABC, Generic[Input, Output]):
    """A generic task."""

class PassthroughTask(Task[Anything, Anything]):
    """A simple task that doesn't modify the items at all."""

class StacOutputTask(Task[Input, Item], ABC):
    """Anything in, STAC out task."""

class ItemTask(StacOutputTask[Item], ABC):
    """STAC In, STAC Out task.

class HrefTask(StacOutputTask[Href], ABC):
    """Href in, STAC Out task.

I really like this way to define the input and output for different types of tasks, especially if it gives us JSON Schema!

Want to review these two Tasks:
StacOutputTask - Anything in, STAC out task.
HrefTask - Href in, STAC Out task

These tasks captures the need to create STAC Items from scratch. In the current payload structure you pass in parameters to the task in the process definition, you don't hand them in as part of the Task Input (which would normally be a FeatureCollection. So the href (or multiple hrefs), along with other parameters, would be provided in the process.tasks.taskname.parameterfield. I think that should be the preferred model and Input/Output is always going to be STAC Items, or nothing.

Next is the ItemTask which defines a single Items, but stac-tasks current are ItemCollections. A STAC task can take in 1 or more STAC Items as input, and returns 1 or more STAC Items. Note that this is not 1:1, a task doesn't process each item independently to create an array of output items (although you could write a task to do that). A task might take in one Item and create two derived Items from it, or it takes in an Item of data and a few other Items of auxiliary data used in the processing to create a single output Item.

Each task would have requirements on the number of input Items.

So I'd propose
StacOutputTask - Nothing in, STAC out task
ItemCollectionTask - ItemCollection in, ItemCollection out

I suppose we could also have an ItemTask for single Item input and output (a most common scenario), but I'm not sure I see the advantage over using ItemCollection with 1 Item.

@gadomski
Copy link
Member

it takes in an Item of data and a few other Items of auxiliary data used in the processing to create a single output Item.

I think this is an anti-pattern ... I'd prefer if all input assets be contained in a single object, and any "global" configuration passed in as a task parameter. E.g.

{
  "features": [
    {
       "data": "http://example.com/data.tif",
       "metadata": "http://example.com/metadata.xml"
    }
  ],
  "process": {
    "tasks": {
      "my-task": {
        "global-config": "http://example.com/dataset-config.json"
      }
    }
  }
}

Note that this is not 1:1, a task doesn't process each item independently to create an array of output items (although you could write a task to do that).

This is supported in the currently proposed model, e.g. you could define:

class OneToManyTask(Task[Item, Item]):
    def process(self, item: Item) -> List[Item]:
        """Explode a netCDF into a bunch of COGs, or whatever."""

One thing that isn't demonstrated in the current PR (yet) is the ability to put constraints on your input model with the Input generic. E.g. you can define what assets you expect:

from pydantic import BaseModel
from stac_task import Task
from stac_task.models import Asset, Item

class Assets(BaseModel):
    data: Asset
    metadata: Asset

class AssetsTask(Task[Assets, Item]):
    def process(self, input: Assets) -> List[Item]:
        """Creates an item from input assets."""
        item = do_the_thing(input)
        return [item]

output_dict = AssetsTask().process_dict({
    "data": { "href": "an/href.tif" },
    "metadata": { "href": "the/metadata.xml" },
})

@matthewhanson
Copy link
Member Author

The idea of a STAC based workflow is that you are working with existing STAC Items...you are not creating a new STAC Item as input to a process. This is why it's 1 or more STAC Items, and it's important to preserve that for data provenance.

For example, I want to take in a single Landsat scene and a DEM (or more to cover the area) for doing orthorectification. That set of Input STAC Items have self hrefs that point back to the source so they can be added in the derived_from field of the output orthorectified output.

I'm not sure if that was what you meant above or if you only meant when the input was strings and not STAC.

I'm not sure I like the ability to just define any arbitrary input or output here. Maybe this is better but the original goal was to have a strict convention that supports STAC workflows and not a generalized process-anything task. But I'll have to think on that some.

@gadomski
Copy link
Member

gadomski commented Sep 19, 2023

For example, I want to take in a single Landsat scene and a DEM (or more to cover the area) for doing orthorectification. That set of Input STAC Items have self hrefs that point back to the source so they can be added in the derived_from field of the output orthorectified output.

As laid out in #42 (comment) I still believe many-items-in-many-items-out is an anti-pattern that makes it harder to build scope-limited, easily-testable tasks. In your example, if you want to maintain derived from links, you can include them in the "to-process" item:

{
  "id": "item-to-process",
  "links": [
    { "href": "http:://landsat.stac/item-0.json", "rel": "derived-from" },
    { "href": "http:://sentinel2.stac/item-0.json", "rel": "derived-from" }
  ],
  "assets": {
    "landsat_B01": { "href": "http://landsat.stac/B01.tif" },
    "sentinel2_B01": { "href": "http://sentinel2.stac/B01.jp2" }
  }
}

This way, you can define a schema of what the inputs should look like (e.g. w/ pydantic):

class Assets(BaseModel):
    landsat_B01: Asset
    sentinel2_B02: Asset

class Input(BaseModel):
    id: str
    links: List[Link]
    assets: Assets

There's a couple of benefits (that I see) to my proposal:

  • I see a lot of "STAC cruft" ... fields that end up in output STAC items that are just copied again and again through a pipeline, and might be incorrect. My proposed model forces folks to be more intentional about the fields they are carrying through a pipeline, since they have to build the output item itself (rather than just copying the input items).
  • Each task can strictly define its input, rather than having to search through a list of items to find which item is sentinel2, which item is landsat, etc.
  • A one-in-many-out model is more easily parallelizeable in the case (e.g.) you're doing heavy processing on big metal and want to fan out.

@gadomski
Copy link
Member

gadomski commented Sep 20, 2023

Okay, after some rework, here's the core generic tasks in the library (Item is a pystac STAC Item):

class definition key method
Task process(self, input: List[Any]) -> List[Any]
StacOutTask process_to_items(self, input: List[Any]) -> List[Item]
StacInStacOutTask process_items(self, input: List[Item]) -> List[Item]
OneToManyTask process_one_to_many(self, input: Any) -> List[Any]
OneToOneTask process_one_to_one(self, input: Any) -> Any
ToItemTask process_to_item(self, input: Any) -> Item
ItemTask process_item(self, item: Item) -> Item
HrefTask process_href(self, href: str) -> Item

To make a task, you pick the one the best fits what you're trying to do, and implement. Cirrus would want a StacOutTask or StacInStacOutTask, but in simple cases (e.g. item modification/asset creation) you could get away with an ItemTask, e.g..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants