✨ NEW: Add `orm.Entity.fields` interface for `QueryBuilder` #5088

chrisjsewell · 2021-08-19T04:50:06Z

The QueryBuilder currently has a conceptual problem, in that it essentially exposes the backend database to the user. A classic example of this is that, on the front end, we use Node.pk, but when using the QueryBuilder, one has to use the "backend" name id (always confusing for new users) (#2577, #2196).
Additionally, there is currently no "programmatic" way to infer what can be retrieved from the database for each ORM entity, which is something very desirable for the REST API (aiidateam/aiida-restapi#1).

In this PR, I propose a new API (mainly in aiida/orm/fields.py), for:

~~"decorating" ORM properties as database fields,~~ Adding a __qb_fields__ class attribute containing QbField instances (and the QbAttrField subclass)
"gathering" these instances into a new Entity.fields class attribute, then
allowing these fields to be used when constructing QueryBuilder instances.

The API is backward compatible and looks like this (also see tests/orm/test_fields.py):

On any ORM entity, it will inherit parent fields, then if you want to add extra fields, you can add a __fields__ class attribute or directly decorate methods:

from typing import Optional
from aiida.orm import Data, QbAttrField

class NewData(Data):
   
     __fields__ = (
          QbAttrField("new_key", dtype=str, doc="A new key"),
     )

This class will now have a fields class attribute, which gives you the mapping of ORM properties to database fields:

In [1]: NewData.fields
Out[1]: 
{'ctime': 'QbField(ctime) -> datetime',
 'description': 'QbField(description) -> str',
 'extras': 'QbField(extras.*) -> Dict[str, Any]',
 'label': 'QbField(label) -> str',
 'mtime': 'QbField(mtime) -> datetime',
 'new_key': 'QbAttrField(attributes.new_key) -> str',
 'node_type': 'QbField(node_type) -> str',
 'pk': 'QbField(id) -> int',
 'repository_metadata': 'QbField(repository_metadata) -> Dict[str, Any]',
 'source': 'QbAttrField(attributes.source.*) -> Union[dict, NoneType]',
 'user_pk': 'QbField(user_id) -> int',
 'uuid': 'QbField(uuid) -> str'}
In [2]: NewData.fields.new_key
Out[2]: QbAttrField('new_key', dtype=str)

(you can see the fields for all aiida nodes in tests/orm/test_fields)

Within the QueryBuilder you can then directly use these fields for the project, filters and order_by values e.g.

In [4]: qb = QueryBuilder().append(Node, project=[Node.fields.pk, Node.fields.mtime], filters={Node.fields.pk: 1})
In [5]: qb.order_by({'node_1': Node.fields.mtime})

Out[5]: QueryBuilder(path=[{'entity_type': '', 'orm_base': 'node', 'tag': 'node_1', 'joining_keyword': None, 'joining_value': None, 'edge_tag': None, 'outerjoin': False}], filters={'node_1': {'node_type': {'like': '%'}, 'id': 1}}, project={'node_1': [{'id': {}}, {'mtime': {}}]}, project_map={'node_1': {'id': 'pk'}}, order_by=[{'node_1': [{'mtime': {'order': 'asc'}}]}], limit=None, offset=None, distinct=False)

# to get all fields
In [6]: QueryBuilder().append(Node, project=Node.fields)
Out[6]: QueryBuilder(path=[{'entity_type': '', 'orm_base': 'node', 'tag': 'node_1', 'joining_keyword': None, 'joining_value': None, 'edge_tag': None, 'outerjoin': False}], filters={'node_1': {'node_type': {'like': '%'}}}, project={'node_1': [{'ctime': {}, 'description': {}, 'extras': {}, 'label': {}, 'mtime': {}, 'node_type': {}, 'id': {}, 'repository_metadata': {}, 'user_id': {}, 'uuid': {}}]}, project_map={'node_1': {'id': 'pk', 'user_id': 'user_pk'}}, order_by=[], limit=None, offset=None, distinct=False)

# you can also use field comparitors to construct filters
In [7]: QueryBuilder().append(Node, project=Node.fields.pk, filters=(Node.fields.pk <= 2) & (Node.fields.pk.in_([2, 3])))
Out[7]: QueryBuilder(path=[{'entity_type': '', 'orm_base': 'node', 'tag': 'node_1', 'joining_keyword': None, 'joining_value': None, 'edge_tag': None, 'outerjoin': False}], filters={'node_1': {'node_type': {'like': '%'}, 'id': {'and': [{'<=': 2}, {'in': {2, 3}}]}}}, project={'node_1': [{'id': {}}]}, project_map={'node_1': {'id': 'pk'}}, order_by=[], limit=None, offset=None, distinct=False)

# fields can be designated as subscribtable to index into them
In [8]: QueryBuilder().append(Dict, project=Dict.fields.dict["subkey"])
Out[8]: QueryBuilder(path=[{'entity_type': 'data.core.dict.Dict.', 'orm_base': 'node', 'tag': 'Dict_1', 'joining_keyword': None, 'joining_value': None, 'edge_tag': None, 'outerjoin': False}], filters={'Dict_1': {'node_type': {'like': 'data.core.dict.%'}}}, project={'Dict_1': [{'attributes.subkey': {}}]}, project_map={'Dict_1': {'attributes.subkey': 'dict.subkey'}}, order_by=[], limit=None, offset=None, distinct=False)

then run the query as normal.

If you run qb.dict()/qb.iterdict(), it will correctly give you the front end projection keys, e.g.

In [1]: QueryBuilder().append(Dict, project=[Dict.fields.pk, Dict.fields.dict["a"]]).dict()
Out[1]: 
[{'Dict_1': {'pk': 15, 'dict.a': 1}},
 {'Dict_1': {'pk': 7, 'dict.a': 1}},
 {'Dict_1': {'pk': 3, 'dict.a': None}}]

~~Probably the main limitation at present, is that the output by qb.dict()/qb.iterdict() will still return the backend keys, e.g. id not pk~~ (a4ee061)

Some other things that come to mind:

~~For e.g. Dict.fields.dict we would ideally like to be able to do e.g. Dict.fields.dict.key1~~ (use subscriptable=True when defining field)
~~Some fields are missing for joined ids e.g. Node.user_id~~ (added foreign_key field kwarg)
~~Composite fields, e.g. in StructureData you have pbc -> [attributes.pbc1, attributes.pbc2, attributes.pbc3] in the database~~ (can now use __qb_fields__)
~~Field post-conversions (after having been returned from the database ORM), e.g. list to numpy array~~ (possible, but not adding to this PR)
~~You have to use some tyep: ignore[misc] on the decorators currently, due to Decorated property not supported python/mypy#1362~~
~~Obviously, also need to add tests~~ (added tests/orm/test_fields.py)

Definitely love to have some feedback on this cc @sphuber, @giovannipizzi, @ltalirz, @mbercx, @ramirezfranciscof

codecov · 2021-08-19T05:08:11Z

Codecov Report

Merging #5088 (c830af3) into develop (ea0f447) will decrease coverage by 0.64%.
The diff coverage is 87.65%.

❗ Current head c830af3 differs from pull request most recent head 9631a91. Consider uploading reports for the commit 9631a91 to get more accurate results

@@             Coverage Diff             @@
##           develop    #5088      +/-   ##
===========================================
- Coverage    82.12%   81.48%   -0.63%     
===========================================
  Files          533      531       -2     
  Lines        38510    37348    -1162     
===========================================
- Hits         31624    30431    -1193     
- Misses        6886     6917      +31

Flag	Coverage Δ
django	`76.98% <87.65%> (-0.22%)`	⬇️
sqlalchemy	`75.98% <87.65%> (-0.53%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/orm/utils/node.py	`91.18% <60.00%> (ø)`
aiida/orm/querybuilder.py	`84.63% <80.00%> (-0.37%)`	⬇️
aiida/orm/fields.py	`83.22% <83.22%> (ø)`
aiida/orm/__init__.py	`100.00% <100.00%> (ø)`
aiida/orm/authinfos.py	`83.83% <100.00%> (+0.50%)`	⬆️
aiida/orm/comments.py	`91.38% <100.00%> (+0.31%)`	⬆️
aiida/orm/computers.py	`81.57% <100.00%> (-0.60%)`	⬇️
aiida/orm/entities.py	`96.72% <100.00%> (+0.98%)`	⬆️
aiida/orm/groups.py	`94.17% <100.00%> (+0.05%)`	⬆️
aiida/orm/implementation/querybuilder.py	`94.24% <100.00%> (+4.04%)`	⬆️
... and 247 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da5f9b0...9631a91. Read the comment docs.

chrisjsewell · 2021-08-19T12:00:20Z

another thought would be in some way to tie the fields to the backend, e.g. Node.fields.name would return a dynamic field name to use for querying based on the backend. This is certainly possible, and could be added at a later date without changing the user API

ltalirz · 2021-08-19T12:20:30Z

thanks @chrisjsewell!
I definitely agree that a route for accessing database properties of any given node programmatically through the frontend API is badly needed.

One question that comes to mind is how much tangible benefit the separation between frontend and backend will really still bring once we've dropped one of the two backends. But I guess that's still a bit further down the line and more controversial, so by adding this mapping explicitly and making it available for programmatic use I think we are making a step in the right direction.

Thinking about how to reuse this in e.g. the REST API, and your open question:

For e.g. Dict.fields.dict we would ideally like to be able to do e.g. Dict.fields.dict.key1

If I see correctly, you are using python's type hints to write the "schema" here, which I guess might be limiting when you want to represent more elaborate constraints than data types, such as specific field names in a dictionary that are optional; allowed value ranges (which could also be very interesting to reuse in the verdi CLI) as well as documentation of the fields.

In the aiida-restapi package we use pydantic for this - do you think something like this would make sense here as well?
Or maybe type hints can also do this - you're the expert - in that case: even better.

chrisjsewell · 2021-08-19T12:20:50Z

Another, another thought would be, perhaps you could also extend this to specifying repository attributes, e.g.

@has_fields
class NewData(orm.Data):
   @property
   @field_method("path/to/file", in_repo=True)
   def y(self) -> str:
      return self.get_object_content("path/to/file")

chrisjsewell · 2021-08-19T12:28:22Z

thanks @ltalirz

you are using python's type hints to write the "schema" here, which I guess might be limiting when you want to represent more elaborate constraints than data types, such as ... allowed value ranges

Well I guess here you could just add additional key-words to field_method, e.g.

   @field_method("attributes.x", validate=dict(min=0, max=100))
   def x(self) -> Optional[int]:
      return self.get_attribute("x", None)

In the aiida-restapi package we use pydantic for this - do you think something like this would make sense here as well?

I feel the drawback of pydantic is that maybe it does not work well with sub-classing; if we do not use it directly (e.g. for EntityFields) we can certainly take inspiration from it
Note my API takes some inspiration from https://github.com/python-attrs/attrs and https://docs.python.org/3/library/dataclasses.html

ltalirz · 2021-08-19T12:54:20Z

thanks, I also recently came across attrs in a different context - looks pretty cool.

I guess what I'm saying is that when making this change it would be great if you could include with it one small code example of how e.g. the REST API (or the verdi cli) could take advantage of this interface, removing the duplicated schema information that currently resides there (explicitly in the case of the new aiida-restapi; implicitly in old REST API & the verdi cli).

Just to make sure that we're achieving this goal - or if we are not, to be aware of what's still missing and how we can get there.

chrisjsewell · 2021-08-19T12:58:31Z

I also recently came across attrs in a different context - looks pretty cool.

Yeh I use it all over the place, its great, e.g. https://github.com/executablebooks/MyST-Parser/blob/master/myst_parser/main.py#L31

Just to make sure that we're achieving this goal - or if we are not, to be aware of what's still missing and how we can get there.

Yep fair. I don't really want to start rewriting the old REST API though lol, do you have any CLI commands in mind where this could be useful?

chrisjsewell · 2021-08-20T00:27:21Z

(in 74f39f2 I have removed the need for using has_fields, and introduced __fields__, will update initial comment)

chrisjsewell · 2021-08-20T00:47:27Z

aiida/orm/nodes/process/process.py

@@ -134,6 +144,23 @@ def process_class(self) -> Type['Process']:

        return process_class

+    @property  # type: ignore
+    @field_method()
+    def process_type(self) -> Optional[str]:


I think process_type should be removed from the base Node class, since obviously it is just a process specific field and an implementation detail that it is also contained in the database table for all nodes.

(I added it here so that it only shows up in fields of ProcessNode subclasses, not all nodes)

(same for computer, that is only relevant to code and process nodes)

No, computer should stay for all nodes I think, some other nodes use it (specifically: RemoteData). Default = None.

computer should definitely stay on the Node level.

reverted this change as no longer necessary

chrisjsewell · 2021-08-20T22:43:21Z

Ok added just about all the fields for all the node types.
It does make it so much easier, e.g. with verdi shell tab-completion, to find what fields you can use in queries:

giovannipizzi · 2021-08-25T07:38:41Z

Thanks @chrisjsewell, in general I think this was a very important missing feature.

Just a few quick questions:

to double check: the QueryBuilder interface is backward compatible, meaning you can still use strings to project etc., or now one has to use the new syntax only?
Once this is merged, would the plan to deprecate the old syntax at some point, or not? Or at least to prioritise one over the other in 1) the documentation, and 2) the usage in the code itself (e.g. all queries in the cmdline)
Is the limitation you mention ("qb.dict()/qb.iterdict() will still return the backend keys, e.g. id not pk") hard to overcome? I imagine that, in order to go to a fully machine-usable querybuilder, knowing in a reliable way which keys will be returned when one uses .dict() is important?

chrisjsewell · 2021-08-25T07:53:08Z

to double check: the QueryBuilder interface is backward compatible, meaning you can still use strings to project etc., or now one has to use the new syntax only?

Yes, fully backward compatible

nce this is merged, would the plan to deprecate the old syntax at some point, or not? Or at least to prioritise one over the other in 1) the documentation, and 2) the usage in the code itself (e.g. all queries in the cmdline)

certainly to prioritise this syntax, then, once people have had a good chance to play around with it, I would indeed consider deprecating

Is the limitation you mention ("qb.dict()/qb.iterdict() will still return the backend keys, e.g. id not pk") hard to overcome? I imagine that, in order to go to a fully machine-usable querybuilder, knowing in a reliable way which keys will be returned when one uses .dict() is important?

I think with #5093 there is definitely a way forward with this

chrisjsewell · 2021-08-25T19:46:23Z

Probably the main limitation at present, is that the output by qb.dict()/qb.iterdict() will still return the backend keys, e.g. id not pk

Alright now, after #5093, I know what the heck is going on with the QueryBuilder lol, it was pretty easy to add this (a4ee061).

So now you can do e.g.:

In [1]: QueryBuilder().append(Dict, project=[Dict.fields.pk, Dict.fields.dict["a"]]).dict()
Out[1]: 
[{'Dict_1': {'pk': 15, 'dict.a': 1}},
 {'Dict_1': {'pk': 7, 'dict.a': 1}},
 {'Dict_1': {'pk': 5, 'dict.a': 1}},
 {'Dict_1': {'pk': 4, 'dict.a': None}},
 {'Dict_1': {'pk': 3, 'dict.a': None}}]

🎉

(again, this is fully back compatible, i.e. if you projected just the "id" string, that is what you will get back)

chrisjsewell · 2021-08-25T19:48:56Z

ok, presuming the tests pass, this is now good to go IMO

chrisjsewell · 2021-08-25T19:50:25Z

Obviously it should also be documented but, so as not to make this PR any more complex, I will do that in a separate PR once this API is agreed on and merged

giovannipizzi · 2021-08-26T08:01:15Z

Thanks @chrisjsewell !
One question, what is Dict.fields.dict? Is this just an alias for Dict.fields.attributes? Do other data types have similar aliases?

While I'm OK to keep the API flexible, I'm wondering if it's more confusing or more clear to make this mapping automatically - at some point, users need to know that properties are stored either as files or as attributes (or extras), and use those (I don't see this as a schema detail, that indeed should be visible, but as an "ontological" aspect of how we represent node content in AiiDA). And in practice only attributes (and extras) are efficiently queryable, not content in the files.
Or maybe you just meant Dict.fields.attributes?

Also (this is for my clarity), Dict_1 is the automatic tag since none was specified, and if you specify a tag explicitly, this is what will appear in the .dict() output, right?

chrisjsewell · 2021-08-26T11:00:48Z

Also (this is for my clarity), Dict_1 is the automatic tag since none was specified, and if you specify a tag explicitly, this is what will appear in the .dict() output, right?

yes

One question, what is Dict.fields.dict? Is this just an alias for Dict.fields.attributes? Do other data types have similar aliases?

yes; it is an alias and also set to subscriptable=True, i.e. it allows Dict.fields.dict["key"].
Note, in the string reporesentation, the .* indicates it is subscriptable:

In [1]: str(Dict.fields.dict)
Out[1]: 'OrmField(attributes.*) -> dict'
In [2]: str(Dict.fields.extras)
Out[2]: 'OrmField(extras.*) -> dict'

While I'm OK to keep the API flexible, I'm wondering if it's more confusing or more clear to make this mapping automatically - at some point, users need to know that properties are stored either as files or as attributes (or extras), and use those (I don't see this as a schema detail, that indeed should be visible, but as an "ontological" aspect of how we represent node content in AiiDA). And in practice only attributes (and extras) are efficiently queryable, not content in the files.
Or maybe you just meant Dict.fields.attributes?

So this is actually a broader point that I was planning to make in response to #4976 (and I plant to open a PR for programmatically/dynamically setting __dir__ on Node subclasses):

Firstly, I would say there are broadly three kinds of users of the code (that can overlap):

a developer of aiida-core
a developer of a plugin
a user of a plugin

Here I am talking about (3).

IMO they should never need to know about the concept of attributes on a Node subclass; ontologically, they should know about three things:

keys that relate to "queryable" fields, that are read-only once the node is stored
keys that relate to "queryable" fields, that are read/write always (i.e. extras)
keys that relate to "non-queryable" fields

Take Int for example:

When you tab complete on it, you get 92 options!

In [1]: print(sorted([a for a in Int().__dir__() if not a.startswith('_')]))
['Collection', 'add_comment', 'add_incoming', 'attributes', 'attributes_items', 'attributes_keys', 'backend', 'backend_entity', 'check_mutability', 'class_node_type', 'clear_attributes', 'clear_extras', 'clear_hash', 'clone', 'computer', 'convert', 'creator', 'ctime', 'delete_attribute', 'delete_attribute_many', 'delete_extra', 'delete_extra_many', 'delete_object', 'description', 'erase', 'export', 'extras', 'extras_items', 'extras_keys', 'from_backend_entity', 'get', 'get_all_same_nodes', 'get_attribute', 'get_attribute_many', 'get_cache_source', 'get_comment', 'get_comments', 'get_description', 'get_export_formats', 'get_extra', 'get_extra_many', 'get_hash', 'get_incoming', 'get_object_content', 'get_outgoing', 'get_stored_link_triples', 'has_cached_links', 'id', 'importfile', 'importstring', 'init_from_backend', 'initialize', 'is_created_from_cache', 'is_stored', 'is_valid_cache', 'label', 'list_object_names', 'list_objects', 'logger', 'mtime', 'new', 'node_type', 'objects', 'open', 'pk', 'process_type', 'put_object_from_file', 'put_object_from_filelike', 'put_object_from_tree', 'rehash', 'remove_comment', 'repository_metadata', 'repository_serialize', 'reset_attributes', 'reset_extras', 'set_attribute', 'set_attribute_many', 'set_extra', 'set_extra_many', 'set_source', 'source', 'store', 'store_all', 'update_comment', 'user', 'uuid', 'validate_incoming', 'validate_outgoing', 'validate_storability', 'value', 'verify_are_parents_stored', 'walk']

Firstly, this is not user-friendly but, more conceptually, a user should never be directly setting/getting attributes or objects (i.e. files) on an Int; these are implementation details

(w.r.t to tab completion, the naming of methods is also not ideal, e.g. if I want to deal with extras, then really I just want to do Int().extra<TAB> to see all the available methods)

You could roughly reduce this to at least 53:

In [1]: print(sorted([a for a in Int().__dir__() if not a.startswith('_')]))
['add_comment', 'add_incoming', 'class_node_type', 'clear_extras', 'clear_hash', 'clone', 'convert', 'ctime', 'delete_extra', 'delete_extra_many', 'description', 'erase', 'export', 'extras', 'extras_items', 'extras_keys', 'get_all_same_nodes', 'get_cache_source', 'get_comment', 'get_comments', 'get_description', 'get_export_formats', 'get_extra', 'get_extra_many', 'get_hash', 'get_incoming', 'get_outgoing', 'get_stored_link_triples', 'has_cached_links', 'is_created_from_cache', 'is_stored', 'label', 'mtime', 'new', 'node_type', 'pk', 'rehash', 'remove_comment', 'reset_extras', 'set_extra', 'set_extra_many', 'set_source', 'source', 'store', 'store_all', 'update_comment', 'user', 'uuid', 'validate_incoming', 'validate_outgoing', 'validate_storability', 'value', 'verify_are_parents_stored']

(and this could be dynamically reduced even more when the node is stored)

Analogously, for querying, a user shouldn't need to "know"/care the value is stored as an attribute, they just want to get the value, i.e. they should not be exposed to:

In[1]: QueryBuilder().append(Int, project="attribute.value").dict()
Out[1]:
[{'Int_1': {'attribute.value': 1}}]

they just want to do:

In[1]: QueryBuilder().append(Int, project=Int.fields.value).dict()
Out[1]:
[{'Int_1': {'value': 1}}]

and then this is "magnified" on other types, that have a lot more thing stored as keys of attributes

In [1]: Code.fields
Out[1]: 
{'append_text': 'OrmField(attributes.append_text) -> Union[str, NoneType]',
 'ctime': 'OrmField(ctime) -> datetime',
 'description': 'OrmField(description) -> str',
 'extras': 'OrmField(extras.*) -> dict',
 'input_plugin': 'OrmField(attributes.input_plugin) -> Union[str, NoneType]',
 'is_local': 'OrmField(attributes.is_local) -> Union[bool, NoneType]',
 'label': 'OrmField(label) -> str',
 'local_executable': 'OrmField(attributes.local_executable) -> Union[str, '
                     'NoneType]',
 'mtime': 'OrmField(mtime) -> datetime',
 'node_type': 'OrmField(node_type) -> str',
 'pk': 'OrmField(id) -> int',
 'prepend_text': 'OrmField(attributes.prepend_text) -> Union[str, NoneType]',
 'remote_exec_path': 'OrmField(attributes.remote_exec_path) -> Union[str, '
                     'NoneType]',
 'repository_metadata': 'OrmField(repository_metadata) -> Dict',
 'source': 'OrmField(attributes.source) -> Union[dict, NoneType]',
 'user_id': 'OrmField(user_id) -> int -> user',
 'uuid': 'OrmField(uuid) -> str'}

In [2]: CalcJobNode.fields
Out[2]: 
{'ctime': 'OrmField(ctime) -> datetime',
 'description': 'OrmField(description) -> str',
 'exception': 'OrmField(attributes.exception) -> Union[str, NoneType]',
 'exit_message': 'OrmField(attributes.exit_message) -> Union[str, NoneType]',
 'exit_status': 'OrmField(attributes.exit_status) -> Union[int, NoneType]',
 'extras': 'OrmField(extras.*) -> dict',
 'job_state': 'OrmField(attributes.state) -> Union[str, NoneType]',
 'label': 'OrmField(label) -> str',
 'mtime': 'OrmField(mtime) -> datetime',
 'node_type': 'OrmField(node_type) -> str',
 'paused': 'OrmField(attributes.paused) -> bool',
 'pk': 'OrmField(id) -> int',
 'process_label': 'OrmField(attributes.process_label) -> Union[str, NoneType]',
 'process_state': 'OrmField(attributes.process_state) -> Union[str, NoneType]',
 'process_status': 'OrmField(attributes.process_status) -> Union[str, '
                   'NoneType]',
 'process_type': 'OrmField(process_type) -> Union[str, NoneType]',
 'repository_metadata': 'OrmField(repository_metadata) -> Dict',
 'scheduler_state': 'OrmField(attributes.scheduler_state) -> str',
 'sealed': 'OrmField(attributes.sealed) -> bool',
 'user_id': 'OrmField(user_id) -> int -> user',
 'uuid': 'OrmField(uuid) -> str'}

A final note here; the fact that we use an attributes JSONB field to store most of the data for Node subclasses is a compromise we have made, to allow re-use of the same DB table for all Node subclasses. A likely unavoidable compromise, but a compromise nonetheless.
Obviously, if you wanted to make the most efficient implementation for storing e.g. a ProcessNode and its subclasses, you would make most of these attribute fields into actual standalone fields (and set indexes on some).

Or maybe you just meant Dict.fields.attributes?

So coming back to your original question: no I don't meant attributes; that is an implementation detail, I mean dict. That being said dict is a bit of a weird one, because for all other "base types" (Int, Float, List, etc) we use value to represent the field of the value.
I could change this field to value, but then there would be a disconnect between the class attribute (Data().dict) and the query field (Data.fields.value)

sphuber

Thanks @chrisjsewell . I think the concept proposed here is very useful and badly needed indeed. I went through the changes and most of it looks fine, with some minor comments here and there, but I also still have some more generic questions about the design and the implications.

With the current design, we are effectively coupling the front-end directly to implementation details in the backend. By having to specify the database column in the front-end, the separation that we had will be gone. I am not saying that this would necessarily be a deal-breaker, since anyway we might be getting rid of the need of a separation if we're getting rid of Django, but I wonder if it is good and necessary in principle. What we really need in the front-end for the user is what you summarized very succinctly elsewhere:

keys that relate to "queryable" fields, that are read-only once the node is stored
keys that relate to "queryable" fields, that are read/write always (i.e. extras)
keys that relate to "non-queryable" fields

I would maybe add "queryable" but "private" fields that AiiDA stores in the database (hash etc.). This is an effective implementation-free description of what is stored in the repository and what in the database. Could we not move the concept of OrmFields to the backend layer and have the front-end orm.Entity automatically detect the defined fields on the BackendEntity that it contains. This would allow us to later extend the concept of fields in the front-end Entity with non-queryable fields, which could be data stored in the repository.

A second question is about consistency. There are two methods of defining a field to an ORM class: by explicitly declaring it in __fields__ or by decorating a class method or property. This is mostly a problem of historicity and having to keep backwards compatibility, but if we were to design the ORM from scratch, wouldn't we ideally just define the fields explicitly through __fields__ and have those generate properties dynamically. This would save a lot of code typing and would guarantee a uniform interface. My question is now that if you agree with this, and if so, if we should maybe already try to take this approach and drop the decorator. We can then also deprecate the methods that happen to not follow the naming convention of the automatically and dynamically named getters. There can of course still be custom methods if they need to perform additional logic to just retrieving a database value and returning it.

I think this approach would also prevent inconsistencies in to what fields are marked as fields, and which aren't. Reading through the code, I think there are quite a few entities that do not have certain fields defined, even though they probably should. Computer.user is one of them for example. Of course we could always thoroughly go through the code before merging to make sure everything is there, but I was wondering if the other approach may prevent the potential for missing fields.

sphuber · 2021-09-19T15:48:10Z

aiida/orm/implementation/querybuilder.py

@@ -77,6 +77,8 @@ class QueryDictType(TypedDict):
    # mapping: tag -> [] -> field -> 'func' -> 'max' | 'min' | 'count'
    #                                'cast' -> 'b' | 'd' | 'f' | 'i' | 'j' | 't'
    project: Dict[str, List[Dict[str, Dict[str, Any]]]]
+    # mapping: tag -> field -> return key for iterdict method
+    project_map: Dict[str, Dict[str, str]]


You define this here but don't actually use it I believe. Yet in other files (e.g., aiida.orm.querybuilder), you could use it but then just literally write Dict[str, Dict[str, str]] again.

@chrisjsewell what's the plan here? I see a thumbs up but not an implementation.

aiida/orm/nodes/data/structure.py

sphuber · 2021-09-19T17:24:24Z

aiida/orm/nodes/process/calculation/calcjob.py

+    @field_method(f'attributes.{SCHEDULER_STATE_KEY}', key='scheduler_state', dtype=str)
    def get_scheduler_state(self) -> Optional['JobState']:


I notice here that the dtype does not necessarily have to correspond to the return type of the method that it decorates. Makes sense since the method can operate on the raw database value before returning it as is the case here. I was just wondering if there could be a potential problem here. I guess in the end it boils down to the choice of defining an entities fields through its methods. I guess this is mostly just a convenience thing, because you also have the possibility to define the OrmFields directly and manually in the fields attribute. I am wondering if there is a downside to having both, especially since the decoration of method can have this discrepancy in types.

Yeh, I like the idea of having the field “close” to the method, as it feels like it will make things like refactoring easier, but indeed the special case discrepancies are not ideal. I guess you will still have this problem if, as you mention, we want to look to auto generate the methods from the fields

See #5088 (comment)

sphuber · 2021-09-19T17:25:08Z

aiida/orm/nodes/process/calculation/calcjob.py

+SCHEDULER_STATE_KEY = 'scheduler_state'
+CALC_JOB_STATE_KEY = 'state'


Should we make all class constants of CalcJobNode module constants, as you did for ProcessNode, just for consistency?

Actually, I realised this change was not necessary, so have reverted it (for ProcessNode and CalcJobNode)

aiida/orm/fields.py

chrisjsewell · 2021-09-19T18:54:29Z

Thanks for the feedback @sphuber
some quick mobile points:
(now in a McDonald’s waiting for a train from Zagreb to Munich lol)

With the current design, we are effectively coupling the front-end directly to implementation details in the backend.
Could we not move the concept of OrmFields to the backend layer and have the front-end orm.Entity

I certainly agree we want to push good abstractions 👍. Obviously the query builder was actually abstracted in the first place, in the sense you request/receive directly backend columns. I’ll have a think if there is a better abstraction, but a key thing here is that I would like this API to be available to be available to plugin developers, to define their own fields on added data nodes, which I think would not be possible if moved to the backend

but if we were to design the ORM from scratch, wouldn't we ideally just define the fields explicitly through fields

yeh interesting, possibly. With the current implementation, I was sort of thinking along the lines of how sqlalchemy works; having field attributes, which auto-generate the table attribute. What you suggest would be, I guess an inversion of this. The only issue I foresee is with static analysis and instance attribute auto completion, since then all these would be dynamically generated

chrisjsewell · 2021-12-15T07:31:07Z

Ok, in be43aec I have removed the field_method decorator method, instead leaving only the single __qb_fields__ class attribute mechanism for specifying fields.
As @sphuber said, it's over-complex to have two mechanisms for specifying these fields.

I have also renamed __fields__ -> __qb_fields__ and OrmField -> QbField, to make it clear that these are fields that can be returned from the QueryBuilder.
Lastly, I have added the QbAttrField subclass, as a shorthand for specifying keys in the attributes field. This is what data subclasses use to denote data they store in the attributes.

chrisjsewell · 2022-02-17T22:05:11Z

@sphuber this is all rebased and "simplified" (see #5088 (comment)) if you want to have time to have another look

Outstanding:

whether to uncomment "JSONB only" methods on QbField (or have some way to activate only relevant methods)
A sphinx directive for displaying all the available QueryBuilder fields, per node type, in the documentation

edan-bainglass · 2023-12-24T08:57:21Z

Reviewing this PR. Tests passing where it is currently. I'm now incrementally moving (rebasing) it towards v2.5.

I've made it as far as #6141. A few minor adjustments along the way. Will discuss these at a later time.

I'm now running into an issue with #6134. Specifically, the new test_iterall_persistence test in test_querybuilder.py passes but raises a logging error. See my comments at #6134.

edan-bainglass · 2023-12-30T08:19:53Z

Okay. Rebased on main (e1255ce3). All tests pass other than the #6134 issue that needs review. @sphuber @chrisjsewell ready for your review.

There has been a good deal of development merged since Chris has last given this attention (I presume). Though my rebasing covered some, I suspect I may have missed a thing or two (or more). One thing that comes to mind is if any core data types have been added that require __qb_fields__ attention.

A few other things that I don't quite understand:

__qb_fields__ in ProcessNode and CalcJobNode, but not WorkChainNode
QbField and QbAttrField, but not QbExtraField

sphuber · 2024-01-03T08:35:52Z

Thanks @edan-bainglass . You are not yet up-to-date with main. Note that a few weeks ago (after the release) the aiida directory was moved to src/aiida. Maybe it is just that, that you haven't moved the changes in yet. If you need help fixing it, let me know

edan-bainglass · 2024-01-03T08:40:09Z

Huh 🤔 I was sure I included the move to src (between v2.5.0 and main). Maybe I didn't push? I'll check in a bit. Out replenishing stock 🛒

Back from 🛒

Oh right. I'm doing all of this on my own fork (of Chris's fork). There I am caught up to main. And there the issue occurs. However, per @sphuber (see comments on #6134), since the issue occurs during post-testing teardown, it can safely be ignored for now.

So, with that out of the way, what is the policy? Should I PR to Chris's fork, or just push my rebasing directly to his repo? Also, there are still comments on the PR that I am awaiting response. @chrisjsewell, if you can please address these, that'd be super 🙏

sphuber · 2024-01-03T13:15:42Z

So, with that out of the way, what is the policy? Should I PR to Chris's fork, or just push my rebasing directly to his repo? Also, there are still comments on the PR that I am awaiting response. @chrisjsewell, if you can please address these, that'd be super 🙏

There is no real existing policy. Depends on what @chrisjsewell prefers. But if he is no longer interested or doesn't have the time, which I can understand, maybe we can continue with your fork. We close this PR and you open a new one and continue there. That might be easiest

edan-bainglass · 2024-01-03T13:18:17Z

Sure. If @chrisjsewell has no objections, we can proceed as you suggest.

chrisjsewell marked this pull request as ready for review August 19, 2021 05:00

chrisjsewell marked this pull request as draft August 19, 2021 05:01

chrisjsewell force-pushed the entity-fields branch from b403194 to bf90802 Compare August 19, 2021 22:49

chrisjsewell commented Aug 20, 2021

View reviewed changes

chrisjsewell marked this pull request as ready for review August 20, 2021 12:20

chrisjsewell force-pushed the entity-fields branch 2 times, most recently from b326479 to e20114e Compare August 25, 2021 10:20

This was linked to issues Aug 25, 2021

Allow usage of pk in QueryBuilder project and filter statements #2577

Closed

Extend QueryBuilder to support mappings of entity attribute name to a differently named column #2196

Open

chrisjsewell force-pushed the entity-fields branch from 3ed1875 to d768a62 Compare September 16, 2021 13:17

sphuber reviewed Sep 19, 2021

View reviewed changes

chrisjsewell mentioned this pull request Sep 24, 2021

♻️ REFACTOR: New archive format #5145

Merged

1 task

sphuber mentioned this pull request Nov 17, 2021

orm.Computer define class constants for all properties #2135

Open

chrisjsewell force-pushed the entity-fields branch 2 times, most recently from 48f157d to 9631a91 Compare February 17, 2022 21:12

✨ NEW: Add orm.Entity.fields interface for QueryBuilder

02bed54

chrisjsewell force-pushed the entity-fields branch from 9631a91 to 02bed54 Compare February 17, 2022 21:47

chrisjsewell requested a review from sphuber February 17, 2022 22:02

This was referenced Mar 14, 2022

👌 IMPROVE: Hide dev/non-public API methods #4976

Closed

♻️ REFACTOR: NodeRepositoryMixin -> NodeRepository #5439

Closed

chrisjsewell mentioned this pull request Apr 12, 2022

DOC: add list of valid keys in query builder of project and filter #5489

Closed

janssenhenning mentioned this pull request Aug 6, 2022

Alternative implementation for complex number support #5614

Closed

2 tasks

sphuber mentioned this pull request Nov 10, 2022

DevOps: Add AiiDA deprecation warnings aiidateam/aiida-restapi#45

Merged

edan-bainglass mentioned this pull request Jan 5, 2024

✨ NEW: Add orm.Entity.fields interface for QueryBuilder (cont.) #6245

Merged

sphuber mentioned this pull request Jan 19, 2024

Add AEP: Add a schema to ORM classes aiidateam/AEP#40

Open

sphuber closed this Mar 13, 2024

sphuber mentioned this pull request Jun 1, 2024

Allow usage of pk in QueryBuilder project and filter statements #2577

Closed

		@field_method(f'attributes.{SCHEDULER_STATE_KEY}', key='scheduler_state', dtype=str)
		def get_scheduler_state(self) -> Optional['JobState']:

		SCHEDULER_STATE_KEY = 'scheduler_state'
		CALC_JOB_STATE_KEY = 'state'

✨ NEW: Add orm.Entity.fields interface for QueryBuilder #5088

✨ NEW: Add orm.Entity.fields interface for QueryBuilder #5088

Conversation

chrisjsewell commented Aug 19, 2021 • edited Loading

codecov bot commented Aug 19, 2021 • edited Loading

Codecov Report

chrisjsewell commented Aug 19, 2021 • edited Loading

ltalirz commented Aug 19, 2021 • edited Loading

chrisjsewell commented Aug 19, 2021

chrisjsewell commented Aug 19, 2021 • edited Loading

ltalirz commented Aug 19, 2021 • edited Loading

chrisjsewell commented Aug 19, 2021 • edited Loading

chrisjsewell commented Aug 20, 2021

chrisjsewell Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

chrisjsewell Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sphuber Aug 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisjsewell commented Aug 20, 2021 • edited Loading

giovannipizzi commented Aug 25, 2021

chrisjsewell commented Aug 25, 2021

chrisjsewell commented Aug 25, 2021 • edited Loading

chrisjsewell commented Aug 25, 2021

chrisjsewell commented Aug 25, 2021

giovannipizzi commented Aug 26, 2021

chrisjsewell commented Aug 26, 2021 • edited Loading

sphuber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisjsewell Dec 15, 2021 • edited Loading

Choose a reason for hiding this comment

chrisjsewell commented Sep 19, 2021

chrisjsewell commented Dec 15, 2021 • edited Loading

chrisjsewell commented Feb 17, 2022

edan-bainglass commented Dec 24, 2023 • edited Loading

edan-bainglass commented Dec 30, 2023 • edited Loading

sphuber commented Jan 3, 2024

edan-bainglass commented Jan 3, 2024 • edited Loading

sphuber commented Jan 3, 2024

edan-bainglass commented Jan 3, 2024

✨ NEW: Add `orm.Entity.fields` interface for `QueryBuilder` #5088

✨ NEW: Add `orm.Entity.fields` interface for `QueryBuilder` #5088

chrisjsewell commented Aug 19, 2021 •

edited

Loading

codecov bot commented Aug 19, 2021 •

edited

Loading

chrisjsewell commented Aug 19, 2021 •

edited

Loading

ltalirz commented Aug 19, 2021 •

edited

Loading

chrisjsewell commented Aug 19, 2021 •

edited

Loading

ltalirz commented Aug 19, 2021 •

edited

Loading

chrisjsewell commented Aug 19, 2021 •

edited

Loading

chrisjsewell Aug 20, 2021 •

edited

Loading

chrisjsewell Aug 20, 2021 •

edited

Loading

sphuber Aug 30, 2021 •

edited

Loading

chrisjsewell commented Aug 20, 2021 •

edited

Loading

chrisjsewell commented Aug 25, 2021 •

edited

Loading

chrisjsewell commented Aug 26, 2021 •

edited

Loading

chrisjsewell Dec 15, 2021 •

edited

Loading

chrisjsewell commented Dec 15, 2021 •

edited

Loading

edan-bainglass commented Dec 24, 2023 •

edited

Loading

edan-bainglass commented Dec 30, 2023 •

edited

Loading

edan-bainglass commented Jan 3, 2024 •

edited

Loading