Partial context updates #93

pseusys · 2023-03-13T21:31:20Z

Description

Context storages are updated partially now instead of reading and writing whole data at once.

Checklist

I have covered the code with tests
I have added comments to my code to help others understand it
I have updated the documentation to reflect the changes
I have performed a self-review of the changes
Consider extending UpdateScheme from BaseModel
Decide how we want to use clear method.

…ppavlov/dialog_flow_framework into feat/partial_context_updates

…d classes; rework validation

…ates

dff/context_storages/database.py

dff/context_storages/update_scheme.py

dff/context_storages/json.py

dff/context_storages/update_scheme.py

…ppavlov/dialog_flow_framework into feat/partial_context_updates

This reverts commit 5340256. This feature should be implemented in a separate PR.

RLKRo · 2024-10-24T20:16:02Z

@pseusys
I reverted get_context_ids and filters to process them in a separate PR (#399).

RLKRo · 2024-10-24T20:30:49Z

Suggestion for performance analysis:

Add logging (add debug level logs inside db methods that log used statements; the logs should be both before and after statement execution to measure statement execution time via difference between log event times) (also logs inside context dict).
Try doing following: enable logs; clear db; add a context with 10000 turns into the db (message and misc dimensions both (10, 10)); run the pipeline on that context and chat with it via CLI -> this will collect logs + will confirm if the time it takes to update context at that point is actually 3 seconds (per benchmark results).

RLKRo · 2024-10-25T11:42:26Z

OSError: [Errno 24] Too many open files: '/root/.cache/pypoetry/virtualenvs/chatsky-KJesWijk-py3.10/lib/python3.10/site-packages/sqlalchemy/dialects/mysql/init.py'

Traceback:

File "/home/git_clones/chatsky/chatsky/utils/db_benchmark/benchmark.py", line 290, in _run
    self.db_factory.db(),

  File "/home/git_clones/chatsky/chatsky/utils/db_benchmark/benchmark.py", line 146, in db
    return getattr(module, self.factory)(self.uri)

  File "/home/git_clones/chatsky/chatsky/context_storages/database.py", line 195, in context_storage_factory
    return target_class(path, **kwargs)

  File "/home/git_clones/chatsky/chatsky/context_storages/sql.py", line 156, in __init__

  File "/root/.cache/pypoetry/virtualenvs/chatsky-KJesWijk-py3.10/lib/python3.10/site-packages/sqlalchemy/ext/asyncio/engine.py", line 120, in create_async_engine

  File "<string>", line 2, in create_engine

  File "/root/.cache/pypoetry/virtualenvs/chatsky-KJesWijk-py3.10/lib/python3.10/site-packages/sqlalchemy/util/deprecations.py", line 281, in warned

  File "/root/.cache/pypoetry/virtualenvs/chatsky-KJesWijk-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 550, in create_engine

  File "/root/.cache/pypoetry/virtualenvs/chatsky-KJesWijk-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/url.py", line 758, in _get_entrypoint

  File "/root/.cache/pypoetry/virtualenvs/chatsky-KJesWijk-py3.10/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 365, in load

  File "/root/.cache/pypoetry/virtualenvs/chatsky-KJesWijk-py3.10/lib/python3.10/site-packages/sqlalchemy/dialects/__init__.py", line 47, in _auto_fn

  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load

  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked

  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked

  File "<frozen importlib._bootstrap_external>", line 879, in exec_module

  File "<frozen importlib._bootstrap_external>", line 1016, in get_code

  File "<frozen importlib._bootstrap_external>", line 1073, in get_data

…ppavlov/dialog_flow_framework into feat/partial_context_updates

Remove create_logger (logger configuration should not happen inside the library).

# Conflicts: # chatsky/utils/db_benchmark/benchmark.py # poetry.lock

RLKRo · 2024-11-08T10:54:37Z

chatsky/utils/context_dict/ctx_dict.py

Move this file to chatsky/core.

RLKRo · 2024-11-08T11:02:16Z

chatsky/context_storages/file.py

+    :param serializer: Serializer that will be used for serializing contexts.
+    """
+
+    is_asynchronous = False


What is the point of this flag?
Even if you limit concurrent execution within the processing of a single context, you still have potentially multiple contexts being processed at the same time.
IMO it's better to use asyncio.lock inside the methods that require to read and write data within a single method call.

RLKRo · 2024-11-08T11:10:33Z

chatsky/utils/context_dict/ctx_dict.py

+    _items: Dict[K, V] = PrivateAttr(default_factory=dict)
+    _hashes: Dict[K, int] = PrivateAttr(default_factory=dict)
+    _keys: Set[K] = PrivateAttr(default_factory=set)
+    _added: Set[K] = PrivateAttr(default_factory=set)
+    _removed: Set[K] = PrivateAttr(default_factory=set)


I don't like that many private attributes.
Maybe split this into two classes (context dict and context dict db connector)?
The first would implement all dict methods and have a single private attribute (context dict db connector);
and the second would hold all the fields and implement some methods (load, get, set).

pseusys

Honestly, I dislike logging. For now, it just looks inconsistent, why do we have it only for a fraction of code. I think we should either add it literally everywhere or only preserve it somewhere where it would make sence (e.g. when calling some external interfaces, like databases - but in that case we should also consider adding logging to message interfaces).

pseusys · 2024-11-14T13:15:05Z

chatsky/__rebuild_pydantic_models__.py

Could anyone please confirm this file works fine? Considering all the pydantic models added?

If there's no error, it works fine.
But yes. Some of these rebuilds are no longer required:

SerializableStorage no longer references Context, so there's no need to rebuild it;

ContextDict does indeed need to be rebuild with DBContextStorage because it is imported in a TYPE_CHECKING block during ContextDict definition.

I would suggest trying to remove all the changes but

from chatsky.context_storages.database import DBContextStorage from chatsky.core.ctx_dict import ContextDict ContextDict.model_rebuild()

and see if that works.

pseusys · 2024-11-14T13:25:49Z

chatsky/core/context.py

+"""
+class Turn(BaseModel):
+    label: Optional[NodeLabel2Type] = Field(default=None)
+    request: Optional[Message] = Field(default=None)
+    response: Optional[Message] = Field(default=None)
+"""


Do we need Turns after all? Do we add methods like last_turn, or turn_id(int)?..

Not as a class.
We could add a method that zips all the turn fields for convenience:

async def turns(self, slice) -> Iterable[Label, Message, Message]: return zip(*asyncio.gather( self.labels.__getitem__(slice), self.requests.__getitem__(slice), self.responses.__getitem__(slice) ))

pseusys · 2024-11-14T13:28:47Z

chatsky/core/context.py

-        init_kwargs = {
-            "labels": {0: AbsoluteNodeLabel.model_validate(start_label)},
-        }
+    async def connected(


I remember we wamted to get rid of connected method. Do we still want to do that? If so, what should be the preferred alternatives?

I think I like the way it is now but I need a bit more time to work with the new context to see if there are any issues.

pseusys · 2024-11-14T13:31:15Z

chatsky/core/ctx_dict.py

+    _value_type: Optional[TypeAdapter[Type[V]]] = PrivateAttr(None)
+
+    @classmethod
+    async def new(cls, storage: DBContextStorage, id: str, field: str, value_type: Type[V]) -> "ContextDict":


Once again, just like with Context class, what's our final decision about ContextDict creation? Should we preserve two classmethods or switch to one?

My issues with the new creation methods stem from the fact that simply initializing context with Context() will no longer produce a functional context but you would also need to set _value_type for all context dicts:

chatsky/tests/conftest.py

Lines 89 to 92 in 46e0112

ctx = Context()

ctx.labels._value_type = TypeAdapter(AbsoluteNodeLabel)

ctx.requests._value_type = TypeAdapter(Message)

ctx.responses._value_type = TypeAdapter(Message)

Which is not convenient for testing and debugging.

I think the best solution is to add validators to Context that would prep its context dicts (by setting their value types) so that Context() is a functional context without db connection.

pseusys · 2024-11-14T13:36:25Z

chatsky/utils/testing/cleanup_db.py

-        await conn.run_sync(storage.table.drop, storage.engine)
+    async with storage.engine.begin() as conn:
+        for table in [storage.main_table, storage.turns_table]:
+            await conn.run_sync(table.drop, storage.engine)


 async def delete_ydb(storage: YDBContextStorage):


There is an issue with YDB that some coroutines remain running after the cleanup. Still, everything works fine (Except for disturbing log entries). Are we OK with that?

Could you give more info?
Does await storage.pool.retry_operation(callee) request the db to cleanup but doesn't wait until it does so?

pseusys · 2024-11-14T13:41:25Z

docs/source/user_guides/context_guide.rst

-Private methods
-^^^^^^^^^^^^^^^
-
-These methods should not be used outside of the internal workings.
-
-* **set_last_response**
-* **set_last_request**
-* **add_request**
-* **add_response**
-* **add_label**
-


There is some ambiguity about context field access. Most of the times, they are accessed like this: ctx.labels[ctx.current_turn_id]. However inside of the context we call them like this: self.labels._items[self.labels.keys()[-1]]. We could've also added a property setter for that. I think we should define a universally-correct way and use it everywhere.

The difference between the two is that ctx.labels[ctx.current_turn_id] is a coroutine and I didn't want last_... kind of properties to become async.

That reminds me that for this workaround to work we need to make subscript value at least 1 (so that it always loads the last turn).

RLKRo

The first two comments are from my unfinished review.

RLKRo · 2024-11-08T13:15:28Z

chatsky/context_storages/sql.py

        self.engine = create_async_engine(self.full_path, pool_pre_ping=True)
        self.dialect: str = self.engine.dialect.name
+        self._insert_limit = _get_write_limit(self.dialect)


This is not used.

RLKRo · 2024-11-08T13:16:32Z

chatsky/context_storages/sql.py

+        self.main_table = Table(
+            f"{table_name_prefix}_{self._main_table_name}",
+            metadata,
+            Column(self._id_column_name, String(self._UUID_LENGTH), index=True, unique=True, nullable=False),


Context id doesn't have to be a UUID.
It could also be a telegram username, so it is not limited to uuid length.

RLKRo · 2024-11-15T18:30:36Z

chatsky/__rebuild_pydantic_models__.py

If there's no error, it works fine.
But yes. Some of these rebuilds are no longer required:

SerializableStorage no longer references Context, so there's no need to rebuild it;

ContextDict does indeed need to be rebuild with DBContextStorage because it is imported in a TYPE_CHECKING block during ContextDict definition.

I would suggest trying to remove all the changes but

from chatsky.context_storages.database import DBContextStorage from chatsky.core.ctx_dict import ContextDict ContextDict.model_rebuild()

and see if that works.

RLKRo · 2024-11-15T18:39:42Z

chatsky/core/context.py

+"""
+class Turn(BaseModel):
+    label: Optional[NodeLabel2Type] = Field(default=None)
+    request: Optional[Message] = Field(default=None)
+    response: Optional[Message] = Field(default=None)
+"""


Not as a class.
We could add a method that zips all the turn fields for convenience:

async def turns(self, slice) -> Iterable[Label, Message, Message]: return zip(*asyncio.gather( self.labels.__getitem__(slice), self.requests.__getitem__(slice), self.responses.__getitem__(slice) ))

RLKRo · 2024-11-15T18:43:22Z

chatsky/core/context.py

-        init_kwargs = {
-            "labels": {0: AbsoluteNodeLabel.model_validate(start_label)},
-        }
+    async def connected(


I think I like the way it is now but I need a bit more time to work with the new context to see if there are any issues.

RLKRo · 2024-11-15T18:51:18Z

chatsky/core/ctx_dict.py

+    _value_type: Optional[TypeAdapter[Type[V]]] = PrivateAttr(None)
+
+    @classmethod
+    async def new(cls, storage: DBContextStorage, id: str, field: str, value_type: Type[V]) -> "ContextDict":


My issues with the new creation methods stem from the fact that simply initializing context with Context() will no longer produce a functional context but you would also need to set _value_type for all context dicts:

chatsky/tests/conftest.py

Lines 89 to 92 in 46e0112

ctx = Context()

ctx.labels._value_type = TypeAdapter(AbsoluteNodeLabel)

ctx.requests._value_type = TypeAdapter(Message)

ctx.responses._value_type = TypeAdapter(Message)

Which is not convenient for testing and debugging.

I think the best solution is to add validators to Context that would prep its context dicts (by setting their value types) so that Context() is a functional context without db connection.

RLKRo · 2024-11-15T18:54:20Z

chatsky/utils/testing/cleanup_db.py

-        await conn.run_sync(storage.table.drop, storage.engine)
+    async with storage.engine.begin() as conn:
+        for table in [storage.main_table, storage.turns_table]:
+            await conn.run_sync(table.drop, storage.engine)


 async def delete_ydb(storage: YDBContextStorage):


Could you give more info?
Does await storage.pool.retry_operation(callee) request the db to cleanup but doesn't wait until it does so?

RLKRo · 2024-11-15T18:58:20Z

docs/source/user_guides/context_guide.rst

-Private methods
-^^^^^^^^^^^^^^^
-
-These methods should not be used outside of the internal workings.
-
-* **set_last_response**
-* **set_last_request**
-* **add_request**
-* **add_response**
-* **add_label**
-


The difference between the two is that ctx.labels[ctx.current_turn_id] is a coroutine and I didn't want last_... kind of properties to become async.

That reminds me that for this workaround to work we need to make subscript value at least 1 (so that it always loads the last turn).

pseusys self-assigned this Mar 13, 2023

pseusys requested review from kudep and RLKRo April 7, 2023 01:43

pseusys added the enhancement New feature or request label Apr 7, 2023

pseusys marked this pull request as ready for review April 7, 2023 01:43

ruthenian8 added 2 commits April 24, 2023 13:40

Partly get the tests passing

4a096d2

partial fix of tests

f6794d1

kudep marked this pull request as draft April 24, 2023 16:41

ruthenian8 and others added 14 commits April 24, 2023 20:01

base refactor: tests not passing

336434b

Partly get the tests passing

57d31b8

partial fix of tests

c497250

clear table implemented

679f0d4

Merge branch 'feat/partial_context_updates' of https://github.com/dee…

4bd5e7d

…ppavlov/dialog_flow_framework into feat/partial_context_updates

tests fixed

a90e885

merge feat/partial_context_updates

1779588

TODO removed

6cb2fc2

Remove FieldType parameter; introduce ValueField, DictField, ListFiel…

3e57997

…d classes; rework validation

Merge branch 'proposal/partial_updates' into feat/partial_context_upd…

8be1be9

…ates

Merge proposal && apply lint

3af0995

Fix mongo len method

50ed68e

ignore long lines in ydb

77bef1d

update docstrings: add examples paths

45c0c3f

kudep requested changes May 10, 2023

View reviewed changes

dff/context_storages/database.py Outdated Show resolved Hide resolved

dff/context_storages/update_scheme.py Outdated Show resolved Hide resolved

kudep requested changes May 10, 2023

View reviewed changes

ruthenian8 and others added 6 commits May 11, 2023 17:12

rename outlook to subscript

3ebdc26

remove mark_db_not_persistent

cccc667

rename variables, move comprehensions to own variable from for loops;

db110db

keycast function renamed

63a9101

read write policies

a14c72d

lint applied

e5b98bd

This comment was marked as outdated.

Sign in to view

pseusys and others added 5 commits October 21, 2024 19:08

key filter implementation

5340256

Merge branch 'feat/partial_context_updates' of https://github.com/dee…

9aad1bb

…ppavlov/dialog_flow_framework into feat/partial_context_updates

ctx_dict hashes update added

b32b367

added and removed sets cleared upon storage

edc85bd

Revert "key filter implementation"

e61b1b7

This reverts commit 5340256. This feature should be implemented in a separate PR.

pseusys and others added 14 commits October 28, 2024 18:50

sql and file logging added

d114d42

Merge branch 'feat/partial_context_updates' of https://github.com/dee…

3619125

…ppavlov/dialog_flow_framework into feat/partial_context_updates

debug logging added

5618484

use standard logging practices

5e6e223

Remove create_logger (logger configuration should not happen inside the library).

make logging more uniform across the methods and collapse long lists

4323871

fix potential error in prefix parsing

93144df

Merge branch 'refs/heads/dev' into feat/partial_context_updates

83c7b33

# Conflicts: # chatsky/utils/db_benchmark/benchmark.py # poetry.lock

create tmp file only for file dbs

b763f21

add test for load_field_items

69d1520

test fix: misc no longer context dict

291396f

test fix: load_field_items no longer returns dict

c3d8c73

test fix: field config was removed

4bb6ca7

remove debug artefact

dbbbb28

all user input escapedin ydb

710554c

RLKRo reviewed Nov 8, 2024

View reviewed changes

pseusys and others added 4 commits November 8, 2024 20:51

ctx_dict moved

20b6b5f

async lock introduced

2b6eebf

codestyle fixed

6c458c6

Merge branch 'dev' into feat/partial_context_updates

46e0112

pseusys commented Nov 14, 2024

View reviewed changes

RLKRo reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial context updates #93

Partial context updates #93

pseusys commented Mar 13, 2023 •

edited

Loading

This comment was marked as outdated.

RLKRo commented Oct 24, 2024

RLKRo commented Oct 24, 2024

RLKRo commented Oct 25, 2024

RLKRo Nov 8, 2024

pseusys Nov 8, 2024

RLKRo Nov 8, 2024

pseusys Nov 8, 2024

RLKRo Nov 8, 2024

pseusys left a comment

pseusys Nov 14, 2024

RLKRo Nov 15, 2024

pseusys Nov 14, 2024

RLKRo Nov 15, 2024

pseusys Nov 14, 2024

RLKRo Nov 15, 2024

pseusys Nov 14, 2024

RLKRo Nov 15, 2024

pseusys Nov 14, 2024

RLKRo Nov 15, 2024

pseusys Nov 14, 2024

RLKRo Nov 15, 2024

RLKRo left a comment

RLKRo Nov 8, 2024

RLKRo Nov 8, 2024

RLKRo Nov 15, 2024

RLKRo Nov 15, 2024

RLKRo Nov 15, 2024

RLKRo Nov 15, 2024

RLKRo Nov 15, 2024

RLKRo Nov 15, 2024

	ctx = Context()
	ctx.labels._value_type = TypeAdapter(AbsoluteNodeLabel)
	ctx.requests._value_type = TypeAdapter(Message)
	ctx.responses._value_type = TypeAdapter(Message)

Partial context updates #93

Are you sure you want to change the base?

Partial context updates #93

Conversation

pseusys commented Mar 13, 2023 • edited Loading

Description

Checklist

This comment was marked as outdated.

RLKRo commented Oct 24, 2024

RLKRo commented Oct 24, 2024

RLKRo commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pseusys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RLKRo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pseusys commented Mar 13, 2023 •

edited

Loading