CartographySchema with Node shared by multiple intel #1207

jychp · 2023-02-02T10:57:49Z

jychp
Feb 2, 2023
Collaborator

Example:

GoogleSuite use Human to link GoogleAccount to a "Person" with analysis job (h)-[r:IDENTITY_GSUITE]->(user)
GitHub use Human to link GitHubAccount to a "Person" with analysis job (h)-[r:IDENTITY_GITHUB]->(user)
I wanted to do the same think with Lastpass (Add Lastpass source #1083)

I'm stuck with the CartographySchema implementation :

Problem 1 - multiple definition

Even with the schema reorganization (#1096) the Human node must be redefined several times that could lead to inconsistant fields
Add a cross_intel/common/wellknown list of Node in this schema will lead to a PropertyRef issue (field can change between intel sources)
Suggestion: add a notion of "constraint/schema" verified by the CI which checks the consistency of the schema (the nodes are defined several times but consistency is guaranteed)

Problem 2 - auto clean

In my usage, there is not one execution of the script but several (which can be triggered by Terraform CIs, crons etc ...)
With a CartographySchemaNode, autoclean will erasing the Humans created by another Intel
ex: GitHub will update a small subset of Human, but maybe sooner GoogleSuite updated more Humans nodes that will be purged by autoclean on the "GitHub execution"
Suggestion: add a cli parameter: autoclean_max_age (0 by default) and purge for lastupdate < NOW - autoclean_max_age instead of lastupdate != NOW

achantavy · 2023-02-07T07:49:09Z

achantavy
Feb 7, 2023
Maintainer

Hey, thank you for the thoroughly written issue.

Regarding problem 1

I don't think the Human node will need to be defined in multiple places. The GSuiteUser-to-Human link makes that connection if GSuiteUser.email == Human.email. Let me see if I can roughly sketch out what the model will look like.

@dataclass(frozen=True)
class GSuiteUser(CartographyNodeSchema):
    label: str = 'GSuiteUser'
    properties: ... omitted ...
    sub_resource_relationship: ... omitted ...
    other_relationships: Optional[OtherRelationships] = OtherRelationships(
        GSuiteUserToHumanRel(),
    )

@dataclass(frozen=True)
class GSuiteUserToHumanRel(CartographyRelSchema):
    target_node_label: str = 'Human'
    target_node_matcher: TargetNodeMatcher = make_target_node_matcher({'email': PropertyRef('HumanEmail')})
    direction: LinkDirection = LinkDirection.INWARD
    rel_label: str = "IDENTITY_GSUITE"
    properties: ... omitted ...

and then loading it would be straightforward with

from cartography.client.core.tx import load

load(
    neo4j_session,
    GSuiteUser(),
    data,
    lastupdated=update_tag,
    ... other params ...
)

as long as the data looks like

data = [
    {
        # ... all the fields that a GSuiteUser has.
        'HumanEmail': '[email protected]',
    },
    {
        # ... all the fields that a GSuiteUser has.
        'HumanEmail': '[email protected]',
    },
    {
        # ... all the fields taht a GSuiteUser has.
        # This still works even if some items in the data don't
        # have a `HumanEmail` defined because Neo4j subqueries let us do an
        # optional match, and this lets us use a single generated query
        # to ingest data in the same list with varying relationships.
    }
]

For a concrete example, see this integration test: https://github.com/lyft/cartography/blob/81902b23fa80e4ba5332ba00b4477e3a556d5eb7/tests/integration/cartography/graph/test_querybuilder_rel_subsets.py#L11-L18
And the code here: https://github.com/lyft/cartography/blob/81902b23fa80e4ba5332ba00b4477e3a556d5eb7/cartography/graph/querybuilder.py#L348-L356,
here: https://github.com/lyft/cartography/blob/81902b23fa80e4ba5332ba00b4477e3a556d5eb7/cartography/graph/querybuilder.py#L231
and here: https://github.com/lyft/cartography/blob/81902b23fa80e4ba5332ba00b4477e3a556d5eb7/cartography/graph/querybuilder.py#L195

The Human to GitHubUser link isn't present in OSS cartography but modeling it would look similar as above.

Regarding this point,

Add a cross_intel/common/wellknown list of Node in this schema will lead to a PropertyRef issue (field can change between intel sources)

If field values can change between intel sources, it might make sense to prefix the field name on a given node with ${source_name}_${field_name} so that we don't run into conflict. The cartography data model also correctly handles the case where we try to ingest data that has only a subset of the fields defined on its schema. For example if I use this SimpleNode and run

SIMPLE_NODE_MISSING_PROPS = [
    {
        'Id': 'SimpleNode1',
        'property1': 'The',
    },
        'Id': 'SimpleNode2',
        'property2': 'Fox',
    },
]

load(neo4j_session, SimpleNodeSchema(), SIMPLE_NODE_MISSING_PROPS, lastupdated=1)

then the result I get with

match(n:SimpleNode) return n.id, n.property1, n.property2;

is

and all the nulls get treated as properties that don't exist on the node, which is exactly what we want. I'll add this as a formal test to the code though.

Regarding problem 2

With a CartographySchemaNode, autoclean will erasing the Humans created by another Intel
ex: GitHub will update a small subset of Human, but maybe sooner GoogleSuite updated more Humans nodes that will be purged by autoclean on the "GitHub execution"

Automatically deleting objects created by another intel module is absolutely something we intend to avoid. We should try to make it so that the autocleanup can smartly delete only the objects we want.

It's getting a bit late where I am so I'll give more of a hand-wavy explanation of my last few thoughts:

This is still very early stages and we will learn more as we go. The idea is we first match on paths that are as long (i.e. restrictive) as possible, delete stale nodes of our target type in the path, and then delete stale rels (only up to the current node schema).
At this early stage of rolling out the schema, we do not support automatic cleanup of schema objects that do not define a sub resource.
In the situation you described, an automatic cleanup of the GSuiteUser would generate queries that look something like these tests. I think I do see a potential problem with our current implementation of the cleanup job - specifically I think these last 2 queries are actually the least restrictive of the others in the list and this will end up deleting more than we want int his case.

Thanks again for writing this up and informing the design of the data model. To summarize, I think the new data model addresses the concerns of problem 1, but I think I agree with problem 2 and I see a potential problem. Again, this is early stages and we are figuring this out as quickly as possible and I think things will make more sense as we continue to put them together. Will fix and make things smooth for your Lastpass change.

0 replies

achantavy · 2023-02-09T07:50:09Z

achantavy
Feb 9, 2023
Maintainer

I decided to do a longer write up to explain more of the background and the "why" behind the data model: https://docs.google.com/document/d/1HI_EUgXd55affTNznEj80aY3vNVWlnWSLjwyxJ8nDpQ/edit#. It's long but I figured it's complicated enough that it needs to be documented in some way.

0 replies

jychp · 2023-02-09T19:09:25Z

jychp
Feb 9, 2023
Collaborator Author

Thank you for this amazing reply and documentation.

I will try to go deeper in this new schema, will open new issues if needed.

0 replies

achantavy · 2023-07-14T03:53:16Z

achantavy
Jul 14, 2023
Maintainer

Going to transfer this to a discussion since there may be additional schema discussions to talk about.

0 replies

achantavy · 2023-07-14T04:50:06Z

achantavy
Jul 14, 2023
Maintainer

Issue #1210 is relevant - I'll draft a fix PR shortly. Hopefully this will unblock the refactors for the rest of the project. Then we can create issues to perform the refactors and hopefully everything will be smooth.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CartographySchema with Node shared by multiple intel #1207

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CartographySchema with Node shared by multiple intel #1207

jychp Feb 2, 2023 Collaborator

Problem 1 - multiple definition

Problem 2 - auto clean

Replies: 5 comments

achantavy Feb 7, 2023 Maintainer

Regarding problem 1

Regarding problem 2

achantavy Feb 9, 2023 Maintainer

jychp Feb 9, 2023 Collaborator Author

achantavy Jul 14, 2023 Maintainer

achantavy Jul 14, 2023 Maintainer

jychp
Feb 2, 2023
Collaborator

achantavy
Feb 7, 2023
Maintainer

achantavy
Feb 9, 2023
Maintainer

jychp
Feb 9, 2023
Collaborator Author

achantavy
Jul 14, 2023
Maintainer

achantavy
Jul 14, 2023
Maintainer