Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: QlibDataLoader drops the cols added by inst_processor #1430

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
5 changes: 4 additions & 1 deletion qlib/data/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -587,7 +587,10 @@ def dataset_processor(instruments_d, column_names, start_time, end_time, freq, i

if len(new_data) > 0:
data = pd.concat(new_data, names=["instrument"], sort=False)
data = DiskDatasetCache.cache_to_origin_data(data, column_names)

# NOTE: InstProcessors may add new columns and using cache_to_origin_data will remove those added columns.
if not len(inst_processors):
data = DiskDatasetCache.cache_to_origin_data(data, column_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps modifying the cache_to_origin_data without checking len(inst_processors) is better?

    def cache_to_origin_data(data, fields):
        """cache data to origin data

        :param data: pd.DataFrame, cache data.
        :param fields: feature fields.
        :return: pd.DataFrame.
        """
        not_space_fields = remove_fields_space(fields)
        data_selected = data.loc[:, not_space_fields]
        # set features fields
        data_selected.columns = [str(i) for i in fields]

        _fields = [col for col in data.columns if col not in not_space_fields]
        _data_selected = data.loc[:, _fields]
        data = pd.concat([data_selected, _data_selected], axis=1)
        return data

else:
data = pd.DataFrame(
index=pd.MultiIndex.from_arrays([[], []], names=("instrument", "datetime")),
Expand Down
15 changes: 13 additions & 2 deletions qlib/data/dataset/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,13 @@
from typing import Tuple, Union, List

from qlib.data import D
from qlib.utils import load_dataset, init_instance_by_config, time_to_slc_point
from qlib.utils import (
load_dataset,
init_instance_by_config,
time_to_slc_point,
remove_fields_space,
remove_repeat_field,
)
from qlib.log import get_module_logger
from qlib.utils.serial import Serializable

Expand Down Expand Up @@ -215,7 +221,12 @@ def load_group_df(
self.inst_processors if isinstance(self.inst_processors, list) else self.inst_processors.get(gp_name, [])
)
df = D.features(instruments, exprs, start_time, end_time, freq=freq, inst_processors=inst_processors)
df.columns = names
# NOTE: InstProcessors may add new columns
if len(inst_processors):
df.rename(columns=dict(zip(remove_repeat_field(remove_fields_space(exprs)), names)), inplace=True)
else:
df.columns = names

if self.swap_level:
df = df.swaplevel().sort_index() # NOTE: if swaplevel, return <datetime, instrument>
return df
Expand Down