-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/pipeline simpler fitting #36
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
f3d7777
stage result
voorhs b192dc8
decompose `Context.__init__()` and implement `get_` methods
voorhs 0f6f568
fix tests
voorhs a118e27
fix typing
voorhs 0e3fe2a
add `Context.set_datasets` and allow not dumping modules
voorhs c4e15d9
implement `PipelineOptimizer.fit_from_dataset`
voorhs 109d47e
enable configuration for python api
voorhs 2b0d371
fix typing
voorhs d7c4066
fix tests
voorhs d305bb5
add `clear_ram` option
voorhs d648849
infering modules from ram after optimization
voorhs e7d0fbd
minor change
voorhs 975c8df
fix unintended `runs` directory creation
voorhs 378e582
add `save_db` option
voorhs 2b80888
Merge branch 'refs/heads/dev' into feat/pipeline-simpler-fitting
Samoed 8c2eaff
fix circular imports
Samoed 322340b
fix tests
voorhs c487363
Test/pipeline simpler fitting (#39)
Darinochka a2e4dea
refactor github actions
voorhs c349f18
rename actions
voorhs b26a878
fix `model_name` issue
voorhs 7ccbca2
response to review
voorhs db7f207
attempt to fix `winerror access denied` problem
voorhs 11bb883
try to fix `unexpected argument` error
voorhs 616ba32
minor bug fix
voorhs 11bdb6f
another attempt to fix permission error
voorhs 9545bbf
stupid bug fix
voorhs 1096e04
refactor cache cleaning
voorhs d6884d4
another attempt (workaround: ingore permission error)
voorhs 9e55a65
change return type of classmethods-constructors to `Self`
voorhs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
name: test inference | ||
|
||
on: | ||
push: | ||
branches: | ||
- dev | ||
pull_request: | ||
branches: | ||
- dev | ||
|
||
jobs: | ||
test: | ||
runs-on: ${{ matrix.os }} | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
os: [ ubuntu-latest ] | ||
python-version: [ "3.10", "3.11", "3.12" ] | ||
include: | ||
- os: windows-latest | ||
python-version: "3.10" | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Setup Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
cache: "pip" | ||
|
||
- name: Install dependencies | ||
run: | | ||
pip install . | ||
pip install pytest pytest-asyncio | ||
- name: Run tests | ||
run: | | ||
pytest tests/pipeline/test_inference.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
name: test optimization | ||
|
||
on: | ||
push: | ||
branches: | ||
- dev | ||
pull_request: | ||
branches: | ||
- dev | ||
|
||
jobs: | ||
test: | ||
runs-on: ${{ matrix.os }} | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
os: [ ubuntu-latest ] | ||
python-version: [ "3.10", "3.11", "3.12" ] | ||
include: | ||
- os: windows-latest | ||
python-version: "3.10" | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Setup Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
cache: "pip" | ||
|
||
- name: Install dependencies | ||
run: | | ||
pip install . | ||
pip install pytest pytest-asyncio | ||
- name: Run tests | ||
run: | | ||
pytest tests/pipeline/test_optimization.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
name: unit tests | ||
|
||
on: | ||
push: | ||
branches: | ||
- dev | ||
pull_request: | ||
branches: | ||
- dev | ||
|
||
jobs: | ||
test: | ||
runs-on: ${{ matrix.os }} | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
os: [ ubuntu-latest ] | ||
python-version: [ "3.10", "3.11", "3.12" ] | ||
include: | ||
- os: windows-latest | ||
python-version: "3.10" | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Setup Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
cache: "pip" | ||
|
||
- name: Install dependencies | ||
run: | | ||
pip install . | ||
pip install pytest pytest-asyncio | ||
- name: Run tests | ||
run: | | ||
pytest --ignore=tests/nodes --ignore=tests/pipeline |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,42 +1,79 @@ | ||
import json | ||
import logging | ||
from dataclasses import asdict | ||
from pathlib import Path | ||
from typing import Any | ||
|
||
import yaml | ||
|
||
from autointent.configs.optimization_cli import ( | ||
AugmentationConfig, | ||
DataConfig, | ||
EmbedderConfig, | ||
LoggingConfig, | ||
VectorIndexConfig, | ||
) | ||
|
||
from .data_handler import DataAugmenter, DataHandler, Dataset | ||
from .optimization_info import OptimizationInfo | ||
from .utils import NumpyEncoder, load_data | ||
from .vector_index_client import VectorIndex, VectorIndexClient | ||
|
||
|
||
class Context: | ||
def __init__( # noqa: PLR0913 | ||
data_handler: DataHandler | ||
vector_index_client: VectorIndexClient | ||
optimization_info: OptimizationInfo | ||
|
||
def __init__( | ||
self, | ||
dataset: Dataset, | ||
test_dataset: Dataset | None = None, | ||
device: str = "cpu", | ||
multilabel_generation_config: str | None = None, | ||
regex_sampling: int = 0, | ||
seed: int = 42, | ||
db_dir: str | Path | None = None, | ||
dump_dir: str | Path | None = None, | ||
force_multilabel: bool = False, | ||
embedder_batch_size: int = 32, | ||
embedder_max_length: int | None = None, | ||
) -> None: | ||
augmenter = DataAugmenter(multilabel_generation_config, regex_sampling, seed) | ||
self.seed = seed | ||
self._logger = logging.getLogger(__name__) | ||
|
||
def configure_logging(self, config: LoggingConfig) -> None: | ||
self.logging_config = config | ||
self.optimization_info = OptimizationInfo() | ||
|
||
def configure_vector_index(self, config: VectorIndexConfig, embedder_config: EmbedderConfig | None = None) -> None: | ||
self.vector_index_config = config | ||
if embedder_config is None: | ||
embedder_config = EmbedderConfig() | ||
self.embedder_config = embedder_config | ||
|
||
self.vector_index_client = VectorIndexClient( | ||
self.vector_index_config.device, | ||
self.vector_index_config.db_dir, | ||
self.embedder_config.batch_size, | ||
self.embedder_config.max_length, | ||
) | ||
|
||
def configure_data(self, config: DataConfig, augmentation_config: AugmentationConfig | None = None) -> None: | ||
if augmentation_config is not None: | ||
self.augmentation_config = AugmentationConfig() | ||
augmenter = DataAugmenter( | ||
self.augmentation_config.multilabel_generation_config, | ||
self.augmentation_config.regex_sampling, | ||
self.seed, | ||
) | ||
else: | ||
augmenter = None | ||
|
||
self.data_handler = DataHandler( | ||
dataset, test_dataset, random_seed=seed, force_multilabel=force_multilabel, augmenter=augmenter | ||
dataset=load_data(config.train_path), | ||
test_dataset=None if config.test_path is None else load_data(config.test_path), | ||
random_seed=self.seed, | ||
force_multilabel=config.force_multilabel, | ||
augmenter=augmenter, | ||
) | ||
|
||
def set_datasets( | ||
self, train_data: Dataset, val_data: Dataset | None = None, force_multilabel: bool = False | ||
) -> None: | ||
self.data_handler = DataHandler( | ||
dataset=train_data, test_dataset=val_data, random_seed=self.seed, force_multilabel=force_multilabel | ||
) | ||
self.optimization_info = OptimizationInfo() | ||
self.vector_index_client = VectorIndexClient(device, db_dir, embedder_batch_size, embedder_max_length) | ||
|
||
self.db_dir = self.vector_index_client.db_dir | ||
self.embedder_max_length = embedder_max_length | ||
self.embedder_batch_size = embedder_batch_size | ||
self.device = device | ||
self.multilabel = self.data_handler.multilabel | ||
self.n_classes = self.data_handler.n_classes | ||
self.seed = seed | ||
self.dump_dir = Path.cwd() / "modules_dumps" if dump_dir is None else Path(dump_dir) | ||
|
||
def get_best_index(self) -> VectorIndex: | ||
model_name = self.optimization_info.get_best_embedder() | ||
|
@@ -48,10 +85,79 @@ def get_inference_config(self) -> dict[str, Any]: | |
cfg.pop("_target_") | ||
return { | ||
"metadata": { | ||
"device": self.device, | ||
"multilabel": self.multilabel, | ||
"n_classes": self.n_classes, | ||
"device": self.get_device(), | ||
"multilabel": self.is_multilabel(), | ||
"n_classes": self.get_n_classes(), | ||
"seed": self.seed, | ||
}, | ||
"nodes_configs": nodes_configs, | ||
} | ||
|
||
def dump(self) -> None: | ||
self._logger.debug("dumping logs...") | ||
optimization_results = self.optimization_info.dump_evaluation_results() | ||
|
||
logs_dir = self.logging_config.dirpath | ||
if logs_dir is None: | ||
msg = "something's wrong with LoggingConfig" | ||
raise ValueError(msg) | ||
|
||
# create appropriate directory | ||
logs_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
# dump search space and evaluation results | ||
logs_path = logs_dir / "logs.json" | ||
with logs_path.open("w") as file: | ||
json.dump(optimization_results, file, indent=4, ensure_ascii=False, cls=NumpyEncoder) | ||
# config_path = logs_dir / "config.yaml" | ||
# with config_path.open("w") as file: | ||
# yaml.dump(self.config, file) | ||
|
||
# self._logger.info(make_report(optimization_results, nodes=nodes)) | ||
|
||
# dump train and test data splits | ||
train_data, test_data = self.data_handler.dump() | ||
train_path = logs_dir / "train_data.json" | ||
test_path = logs_dir / "test_data.json" | ||
with train_path.open("w") as file: | ||
json.dump(train_data, file, indent=4, ensure_ascii=False) | ||
with test_path.open("w") as file: | ||
json.dump(test_data, file, indent=4, ensure_ascii=False) | ||
|
||
self._logger.info("logs and other assets are saved to %s", logs_dir) | ||
|
||
# dump optimization results (config for inference) | ||
inference_config = self.get_inference_config() | ||
inference_config_path = logs_dir / "inference_config.yaml" | ||
with inference_config_path.open("w") as file: | ||
yaml.dump(inference_config, file) | ||
|
||
def get_db_dir(self) -> Path: | ||
return self.vector_index_client.db_dir | ||
|
||
def get_device(self) -> str: | ||
return self.vector_index_client.device | ||
|
||
def get_batch_size(self) -> int: | ||
return self.vector_index_client.embedder_batch_size | ||
|
||
def get_max_length(self) -> int | None: | ||
return self.vector_index_client.embedder_max_length | ||
|
||
def get_dump_dir(self) -> Path | None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Сделай get... просто как property There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
if self.logging_config.dump_modules: | ||
return self.logging_config.dump_dir | ||
return None | ||
|
||
def is_multilabel(self) -> bool: | ||
return self.data_handler.multilabel | ||
|
||
def get_n_classes(self) -> int: | ||
return self.data_handler.n_classes | ||
|
||
def is_ram_to_clear(self) -> bool: | ||
return self.logging_config.clear_ram | ||
|
||
def has_saved_modules(self) -> bool: | ||
node_types = ["regexp", "retrieval", "scoring", "prediction"] | ||
return any(len(self.optimization_info.modules.get(nt)) > 0 for nt in node_types) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Метод
set_datasets
, а на вход..._data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не оч понял