Types of Tasks #43

matthewhanson · 2023-09-19T00:48:07Z

In @gadomski's PR #42 several types of tasks are defined.

class Task(BaseModel, ABC, Generic[Input, Output]):
    """A generic task."""

class PassthroughTask(Task[Anything, Anything]):
    """A simple task that doesn't modify the items at all."""

class StacOutputTask(Task[Input, Item], ABC):
    """Anything in, STAC out task."""

class ItemTask(StacOutputTask[Item], ABC):
    """STAC In, STAC Out task.

class HrefTask(StacOutputTask[Href], ABC):
    """Href in, STAC Out task.

I really like this way to define the input and output for different types of tasks, especially if it gives us JSON Schema!

Want to review these two Tasks:
StacOutputTask - Anything in, STAC out task.
HrefTask - Href in, STAC Out task

These tasks captures the need to create STAC Items from scratch. In the current payload structure you pass in parameters to the task in the process definition, you don't hand them in as part of the Task Input (which would normally be a FeatureCollection. So the href (or multiple hrefs), along with other parameters, would be provided in the process.tasks.taskname.parameterfield. I think that should be the preferred model and Input/Output is always going to be STAC Items, or nothing.

Next is the ItemTask which defines a single Items, but stac-tasks current are ItemCollections. A STAC task can take in 1 or more STAC Items as input, and returns 1 or more STAC Items. Note that this is not 1:1, a task doesn't process each item independently to create an array of output items (although you could write a task to do that). A task might take in one Item and create two derived Items from it, or it takes in an Item of data and a few other Items of auxiliary data used in the processing to create a single output Item.

Each task would have requirements on the number of input Items.

So I'd propose
StacOutputTask - Nothing in, STAC out task
ItemCollectionTask - ItemCollection in, ItemCollection out

I suppose we could also have an ItemTask for single Item input and output (a most common scenario), but I'm not sure I see the advantage over using ItemCollection with 1 Item.

The text was updated successfully, but these errors were encountered:

gadomski · 2023-09-19T16:07:08Z

it takes in an Item of data and a few other Items of auxiliary data used in the processing to create a single output Item.

I think this is an anti-pattern ... I'd prefer if all input assets be contained in a single object, and any "global" configuration passed in as a task parameter. E.g.

{
  "features": [
    {
       "data": "http://example.com/data.tif",
       "metadata": "http://example.com/metadata.xml"
    }
  ],
  "process": {
    "tasks": {
      "my-task": {
        "global-config": "http://example.com/dataset-config.json"
      }
    }
  }
}

Note that this is not 1:1, a task doesn't process each item independently to create an array of output items (although you could write a task to do that).

This is supported in the currently proposed model, e.g. you could define:

class OneToManyTask(Task[Item, Item]):
    def process(self, item: Item) -> List[Item]:
        """Explode a netCDF into a bunch of COGs, or whatever."""

One thing that isn't demonstrated in the current PR (yet) is the ability to put constraints on your input model with the Input generic. E.g. you can define what assets you expect:

from pydantic import BaseModel
from stac_task import Task
from stac_task.models import Asset, Item

class Assets(BaseModel):
    data: Asset
    metadata: Asset

class AssetsTask(Task[Assets, Item]):
    def process(self, input: Assets) -> List[Item]:
        """Creates an item from input assets."""
        item = do_the_thing(input)
        return [item]

output_dict = AssetsTask().process_dict({
    "data": { "href": "an/href.tif" },
    "metadata": { "href": "the/metadata.xml" },
})

matthewhanson · 2023-09-19T16:24:05Z

The idea of a STAC based workflow is that you are working with existing STAC Items...you are not creating a new STAC Item as input to a process. This is why it's 1 or more STAC Items, and it's important to preserve that for data provenance.

For example, I want to take in a single Landsat scene and a DEM (or more to cover the area) for doing orthorectification. That set of Input STAC Items have self hrefs that point back to the source so they can be added in the derived_from field of the output orthorectified output.

I'm not sure if that was what you meant above or if you only meant when the input was strings and not STAC.

I'm not sure I like the ability to just define any arbitrary input or output here. Maybe this is better but the original goal was to have a strict convention that supports STAC workflows and not a generalized process-anything task. But I'll have to think on that some.

gadomski · 2023-09-19T17:58:22Z

For example, I want to take in a single Landsat scene and a DEM (or more to cover the area) for doing orthorectification. That set of Input STAC Items have self hrefs that point back to the source so they can be added in the derived_from field of the output orthorectified output.

As laid out in #42 (comment) I still believe many-items-in-many-items-out is an anti-pattern that makes it harder to build scope-limited, easily-testable tasks. In your example, if you want to maintain derived from links, you can include them in the "to-process" item:

{
  "id": "item-to-process",
  "links": [
    { "href": "http:://landsat.stac/item-0.json", "rel": "derived-from" },
    { "href": "http:://sentinel2.stac/item-0.json", "rel": "derived-from" }
  ],
  "assets": {
    "landsat_B01": { "href": "http://landsat.stac/B01.tif" },
    "sentinel2_B01": { "href": "http://sentinel2.stac/B01.jp2" }
  }
}

This way, you can define a schema of what the inputs should look like (e.g. w/ pydantic):

class Assets(BaseModel):
    landsat_B01: Asset
    sentinel2_B02: Asset

class Input(BaseModel):
    id: str
    links: List[Link]
    assets: Assets

There's a couple of benefits (that I see) to my proposal:

I see a lot of "STAC cruft" ... fields that end up in output STAC items that are just copied again and again through a pipeline, and might be incorrect. My proposed model forces folks to be more intentional about the fields they are carrying through a pipeline, since they have to build the output item itself (rather than just copying the input items).
Each task can strictly define its input, rather than having to search through a list of items to find which item is sentinel2, which item is landsat, etc.
A one-in-many-out model is more easily parallelizeable in the case (e.g.) you're doing heavy processing on big metal and want to fan out.

gadomski · 2023-09-20T22:37:58Z

Okay, after some rework, here's the core generic tasks in the library (Item is a pystac STAC Item):

class definition	key method
`Task`	`process(self, input: List[Any]) -> List[Any]`
`StacOutTask`	`process_to_items(self, input: List[Any]) -> List[Item]`
`StacInStacOutTask`	`process_items(self, input: List[Item]) -> List[Item]`
`OneToManyTask`	`process_one_to_many(self, input: Any) -> List[Any]`
`OneToOneTask`	`process_one_to_one(self, input: Any) -> Any`
`ToItemTask`	`process_to_item(self, input: Any) -> Item`
`ItemTask`	`process_item(self, item: Item) -> Item`
`HrefTask`	`process_href(self, href: str) -> Item`

To make a task, you pick the one the best fits what you're trying to do, and implement. Cirrus would want a StacOutTask or StacInStacOutTask, but in simple cases (e.g. item modification/asset creation) you could get away with an ItemTask, e.g..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Types of Tasks #43

Types of Tasks #43

matthewhanson commented Sep 19, 2023

gadomski commented Sep 19, 2023

matthewhanson commented Sep 19, 2023

gadomski commented Sep 19, 2023 •

edited

Loading

gadomski commented Sep 20, 2023 •

edited

Loading

Types of Tasks #43

Types of Tasks #43

Comments

matthewhanson commented Sep 19, 2023

gadomski commented Sep 19, 2023

matthewhanson commented Sep 19, 2023

gadomski commented Sep 19, 2023 • edited Loading

gadomski commented Sep 20, 2023 • edited Loading

gadomski commented Sep 19, 2023 •

edited

Loading

gadomski commented Sep 20, 2023 •

edited

Loading