Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/pipeline simpler fitting #36

Merged
merged 30 commits into from
Nov 12, 2024
Merged

Feat/pipeline simpler fitting #36

merged 30 commits into from
Nov 12, 2024

Conversation

voorhs
Copy link
Collaborator

@voorhs voorhs commented Nov 5, 2024

Лаконичный пример работы с новым апи для оптимизации пайплайна.

# load data
from autointent.context.data_handler import Dataset
from autointent.context.utils import load_data

train_dataset = load_data("./data/train_data.json")
val_dataset = load_data("./data/test_data.json")

# define search space
from autointent.pipeline.optimization import PipelineOptimizer

config = {
    "nodes": [
        {
            "node_type": "scoring",
            "metric": "scoring_roc_auc",
            "search_space": [
                {"module_type": "knn", "k": [5, 10], "weights": ["uniform", "distance", "closest"], "model_name": ["avsolatorio/GIST-small-Embedding-v0"]},
                {"module_type": "linear", "model_name": ["avsolatorio/GIST-small-Embedding-v0"]},
            ],
        },
        {
            "node_type": "prediction",
            "metric": "prediction_accuracy",
            "search_space": [
                {"module_type": "threshold", "thresh": [0.5]},
                {"module_type": "tunable"},
            ],
        },
    ]
}

pipeline_optimizer = PipelineOptimizer.from_dict_config(config)

# optionally, configure your run
from autointent.configs.optimization_cli import LoggingConfig, VectorIndexConfig, EmbedderConfig
from pathlib import Path

pipeline_optimizer.set_config(LoggingConfig(run_name="sweet_cucumber", dirpath=Path(".").resolve(), dump_modules=False))
pipeline_optimizer.set_config(VectorIndexConfig(db_dir=Path("./my_vector_db").resolve(), device="cuda"))
pipeline_optimizer.set_config(EmbedderConfig(batch_size=16, max_length=32))

# run optimization
context = pipeline_optimizer.optimize_from_dataset(train_dataset, val_dataset)

# dump logs
context.dump()

Еще из фич:

  • инициализация Context теперь не такая громоздкая
  • модули можно не дампить, если указать logs.dump_modules=False в конфиге

TODO:

  • опция очищать ли модули из RAM (т.е. убрать gc.collect() и проч по запросу пользователя)
  • очистка db_dir по запросу пользователя
  • fix unintended runs directory creation

@voorhs voorhs requested a review from Samoed November 5, 2024 15:51
def get_max_length(self) -> int | None:
return self.vector_index_client.embedder_max_length

def get_dump_dir(self) -> Path | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Сделай get... просто как property

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

context.config_logs(self.logging_config)
context.config_vector_index(self.vector_index_config, self.embedder_config)

self.optimize(context)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Мб лучше это сделать как init для оптимизатора, а потом он сам по себе будет оптимизировать?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Предлагаешь ещё один класс создать, чтобы у него был свой инит, который бы создавал контекст либо принимал существующий?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Предлагаю оставить такой метод только у самого context

@voorhs voorhs marked this pull request as draft November 6, 2024 10:19
voorhs and others added 6 commits November 6, 2024 13:34
# Conflicts:
#	autointent/context/optimization_info/data_models.py
#	autointent/context/optimization_info/optimization_info.py
#	autointent/pipeline/inference/inference_pipeline.py
#	autointent/pipeline/optimization/pipeline_optimizer.py
@voorhs voorhs marked this pull request as ready for review November 8, 2024 20:13
@voorhs voorhs requested a review from Samoed November 8, 2024 20:18
Darinochka and others added 4 commits November 9, 2024 10:47
* tess: added inference_test

* test: added inference pipeline cli

* test: fixed device

* test: added optimization tests

* fix `inference_config.yaml` not found error

---------

Co-authored-by: voorhs <[email protected]>
@voorhs
Copy link
Collaborator Author

voorhs commented Nov 9, 2024

пр готов к мерджу, жду только ревью от кого-нибудь

from .data_handler import Dataset


class NumpyEncoder(json.JSONEncoder):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Как будто этот класс больше не нужен

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

он используется еще в Context.dump() и infererence.cli_enpoint.main()



@dataclass
class ModulesList:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А может везде Pydantic сделаем. Есть ли минусы?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Конкретно в данном месте pydantic не получилось использовать из-за сложной схемы с тайпингом и проблемой circular import. Была ошибка, что ещё не определен объект Module. Как только я заменил на датакласс, ошибка пропала

def get_max_length(self) -> int | None:
return self.vector_index_client.embedder_max_length

def get_dump_dir(self) -> Path | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -52,3 +52,6 @@ def predict(self, *args: list[str] | npt.NDArray[Any], **kwargs: dict[str, Any])
@abstractmethod
def from_context(cls, context: Context, **kwargs: dict[str, Any]) -> Self:
pass

def get_embedder_name(self) -> str | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Тогда можно будет убрать все переопределения этого метода.

    def get_embedder_name(self) -> str | None:
        if hasattr(self, "embedder_name"):
            return getattr(self, "embedder_name", None)
        return None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Пока хочу пожить с такой версией. Просто боюсь вдруг понадобится в одном из потомков как-то поменять название embedder_name. Если даже в следующих релизах не понадобится, то уберем

помечу эту функцию в базовом Module как экспериментальную

context.vector_index_client.delete_db()

def optimize_from_dataset(
self, train_data: Dataset, val_data: Dataset | None = None, force_multilabel: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_dataset и test_dataset в моем понимании лучше

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В будущем ближе к релизу мб исправим. Проблем с неймингами много

self.vector_index_config = VectorIndexConfig()
self.embedder_config = EmbedderConfig()

def set_config(self, config: LoggingConfig | VectorIndexConfig | EmbedderConfig) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Можно так написать для красоты. Только сообщение для ошибки вынести в отдельную переменную, иначе ruff падает.

    def set_config(self, config: LoggingConfig | VectorIndexConfig | EmbedderConfig) -> None:
        match config:
            case LoggingConfig():
                self.logging_config = config
            case VectorIndexConfig():
                self.vector_index_config = config
            case EmbedderConfig():
                self.embedder_config = config
            case _:
                raise TypeError("unknown config type")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

жесть не знал что в питоне есть свой switch-case...

augmenter=augmenter,
)

def set_datasets(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Метод set_datasets, а на вход ..._data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

не оч понял

self.seed = seed
self._logger = logging.getLogger(__name__)

def config_logs(self, config: LoggingConfig) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Я бы подобные методы называл configure_logging или setup_logging

cfg.embedder.max_length,
)
context = Context(cfg.seed)
context.config_logs(cfg.logs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logs -> logging_config и т. д.

Тем более так сделано в классе PipelineOptimizer.


def predict(self, utterances: list[str]) -> list[LabelType]:
scores = self.nodes[NodeType.scoring].module.predict(utterances)
return self.nodes[NodeType.prediction].module.predict(scores) # type: ignore[return-value]

def fit(self, utterances: list[str], labels: list[LabelType]) -> None:
pass

@classmethod
def from_context(cls, context: Context) -> "InferencePipeline":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Рома пишет -> Self. Надо договориться

@voorhs voorhs mentioned this pull request Nov 11, 2024
@voorhs voorhs merged commit ad097e8 into dev Nov 12, 2024
20 checks passed
@voorhs voorhs deleted the feat/pipeline-simpler-fitting branch November 12, 2024 09:11
Darinochka added a commit that referenced this pull request Nov 15, 2024
Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: Darinka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants