-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial context updates #93
base: dev
Are you sure you want to change the base?
Conversation
…ppavlov/dialog_flow_framework into feat/partial_context_updates
…d classes; rework validation
This comment was marked as outdated.
This comment was marked as outdated.
…ppavlov/dialog_flow_framework into feat/partial_context_updates
This reverts commit 5340256. This feature should be implemented in a separate PR.
Suggestion for performance analysis:
|
OSError: [Errno 24] Too many open files: '/root/.cache/pypoetry/virtualenvs/chatsky-KJesWijk-py3.10/lib/python3.10/site-packages/sqlalchemy/dialects/mysql/init.py' Traceback:
|
…ppavlov/dialog_flow_framework into feat/partial_context_updates
Remove create_logger (logger configuration should not happen inside the library).
# Conflicts: # chatsky/utils/db_benchmark/benchmark.py # poetry.lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this file to chatsky/core
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chatsky/context_storages/file.py
Outdated
:param serializer: Serializer that will be used for serializing contexts. | ||
""" | ||
|
||
is_asynchronous = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this flag?
Even if you limit concurrent execution within the processing of a single context, you still have potentially multiple contexts being processed at the same time.
IMO it's better to use asyncio.lock
inside the methods that require to read and write data within a single method call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_items: Dict[K, V] = PrivateAttr(default_factory=dict) | ||
_hashes: Dict[K, int] = PrivateAttr(default_factory=dict) | ||
_keys: Set[K] = PrivateAttr(default_factory=set) | ||
_added: Set[K] = PrivateAttr(default_factory=set) | ||
_removed: Set[K] = PrivateAttr(default_factory=set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that many private attributes.
Maybe split this into two classes (context dict and context dict db connector)?
The first would implement all dict methods and have a single private attribute (context dict db connector);
and the second would hold all the fields and implement some methods (load, get, set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I dislike logging. For now, it just looks inconsistent, why do we have it only for a fraction of code. I think we should either add it literally everywhere or only preserve it somewhere where it would make sence (e.g. when calling some external interfaces, like databases - but in that case we should also consider adding logging to message interfaces).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could anyone please confirm this file works fine? Considering all the pydantic models added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no error, it works fine.
But yes. Some of these rebuilds are no longer required:
SerializableStorage
no longer referencesContext
, so there's no need to rebuild it;ContextDict
does indeed need to be rebuild withDBContextStorage
because it is imported in aTYPE_CHECKING
block duringContextDict
definition.
I would suggest trying to remove all the changes but
from chatsky.context_storages.database import DBContextStorage
from chatsky.core.ctx_dict import ContextDict
ContextDict.model_rebuild()
and see if that works.
""" | ||
class Turn(BaseModel): | ||
label: Optional[NodeLabel2Type] = Field(default=None) | ||
request: Optional[Message] = Field(default=None) | ||
response: Optional[Message] = Field(default=None) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Turn
s after all? Do we add methods like last_turn
, or turn_id(int)
?..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as a class.
We could add a method that zips all the turn fields for convenience:
async def turns(self, slice) -> Iterable[Label, Message, Message]:
return zip(*asyncio.gather(
self.labels.__getitem__(slice),
self.requests.__getitem__(slice),
self.responses.__getitem__(slice)
))
init_kwargs = { | ||
"labels": {0: AbsoluteNodeLabel.model_validate(start_label)}, | ||
} | ||
async def connected( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we wamted to get rid of connected
method. Do we still want to do that? If so, what should be the preferred alternatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like the way it is now but I need a bit more time to work with the new context to see if there are any issues.
_value_type: Optional[TypeAdapter[Type[V]]] = PrivateAttr(None) | ||
|
||
@classmethod | ||
async def new(cls, storage: DBContextStorage, id: str, field: str, value_type: Type[V]) -> "ContextDict": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, just like with Context
class, what's our final decision about ContextDict
creation? Should we preserve two classmethod
s or switch to one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issues with the new creation methods stem from the fact that simply initializing context with Context()
will no longer produce a functional context but you would also need to set _value_type
for all context dicts:
Lines 89 to 92 in 46e0112
ctx = Context() | |
ctx.labels._value_type = TypeAdapter(AbsoluteNodeLabel) | |
ctx.requests._value_type = TypeAdapter(Message) | |
ctx.responses._value_type = TypeAdapter(Message) |
Which is not convenient for testing and debugging.
I think the best solution is to add validators to Context
that would prep its context dicts (by setting their value types) so that Context()
is a functional context without db connection.
await conn.run_sync(storage.table.drop, storage.engine) | ||
async with storage.engine.begin() as conn: | ||
for table in [storage.main_table, storage.turns_table]: | ||
await conn.run_sync(table.drop, storage.engine) | ||
|
||
|
||
async def delete_ydb(storage: YDBContextStorage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue with YDB that some coroutines remain running after the cleanup. Still, everything works fine (Except for disturbing log entries). Are we OK with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give more info?
Does await storage.pool.retry_operation(callee)
request the db to cleanup but doesn't wait until it does so?
Private methods | ||
^^^^^^^^^^^^^^^ | ||
|
||
These methods should not be used outside of the internal workings. | ||
|
||
* **set_last_response** | ||
* **set_last_request** | ||
* **add_request** | ||
* **add_response** | ||
* **add_label** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some ambiguity about context field access. Most of the times, they are accessed like this: ctx.labels[ctx.current_turn_id]
. However inside of the context we call them like this: self.labels._items[self.labels.keys()[-1]]
. We could've also added a property setter for that. I think we should define a universally-correct way and use it everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between the two is that ctx.labels[ctx.current_turn_id]
is a coroutine and I didn't want last_...
kind of properties to become async.
That reminds me that for this workaround to work we need to make subscript value at least 1
(so that it always loads the last turn).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first two comments are from my unfinished review.
self.engine = create_async_engine(self.full_path, pool_pre_ping=True) | ||
self.dialect: str = self.engine.dialect.name | ||
self._insert_limit = _get_write_limit(self.dialect) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used.
self.main_table = Table( | ||
f"{table_name_prefix}_{self._main_table_name}", | ||
metadata, | ||
Column(self._id_column_name, String(self._UUID_LENGTH), index=True, unique=True, nullable=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context id doesn't have to be a UUID.
It could also be a telegram username, so it is not limited to uuid length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no error, it works fine.
But yes. Some of these rebuilds are no longer required:
SerializableStorage
no longer referencesContext
, so there's no need to rebuild it;ContextDict
does indeed need to be rebuild withDBContextStorage
because it is imported in aTYPE_CHECKING
block duringContextDict
definition.
I would suggest trying to remove all the changes but
from chatsky.context_storages.database import DBContextStorage
from chatsky.core.ctx_dict import ContextDict
ContextDict.model_rebuild()
and see if that works.
""" | ||
class Turn(BaseModel): | ||
label: Optional[NodeLabel2Type] = Field(default=None) | ||
request: Optional[Message] = Field(default=None) | ||
response: Optional[Message] = Field(default=None) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as a class.
We could add a method that zips all the turn fields for convenience:
async def turns(self, slice) -> Iterable[Label, Message, Message]:
return zip(*asyncio.gather(
self.labels.__getitem__(slice),
self.requests.__getitem__(slice),
self.responses.__getitem__(slice)
))
init_kwargs = { | ||
"labels": {0: AbsoluteNodeLabel.model_validate(start_label)}, | ||
} | ||
async def connected( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like the way it is now but I need a bit more time to work with the new context to see if there are any issues.
_value_type: Optional[TypeAdapter[Type[V]]] = PrivateAttr(None) | ||
|
||
@classmethod | ||
async def new(cls, storage: DBContextStorage, id: str, field: str, value_type: Type[V]) -> "ContextDict": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issues with the new creation methods stem from the fact that simply initializing context with Context()
will no longer produce a functional context but you would also need to set _value_type
for all context dicts:
Lines 89 to 92 in 46e0112
ctx = Context() | |
ctx.labels._value_type = TypeAdapter(AbsoluteNodeLabel) | |
ctx.requests._value_type = TypeAdapter(Message) | |
ctx.responses._value_type = TypeAdapter(Message) |
Which is not convenient for testing and debugging.
I think the best solution is to add validators to Context
that would prep its context dicts (by setting their value types) so that Context()
is a functional context without db connection.
await conn.run_sync(storage.table.drop, storage.engine) | ||
async with storage.engine.begin() as conn: | ||
for table in [storage.main_table, storage.turns_table]: | ||
await conn.run_sync(table.drop, storage.engine) | ||
|
||
|
||
async def delete_ydb(storage: YDBContextStorage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give more info?
Does await storage.pool.retry_operation(callee)
request the db to cleanup but doesn't wait until it does so?
Private methods | ||
^^^^^^^^^^^^^^^ | ||
|
||
These methods should not be used outside of the internal workings. | ||
|
||
* **set_last_response** | ||
* **set_last_request** | ||
* **add_request** | ||
* **add_response** | ||
* **add_label** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between the two is that ctx.labels[ctx.current_turn_id]
is a coroutine and I didn't want last_...
kind of properties to become async.
That reminds me that for this workaround to work we need to make subscript value at least 1
(so that it always loads the last turn).
Description
Context storages are updated partially now instead of reading and writing whole data at once.
Checklist
UpdateScheme
fromBaseModel
clear
method.