OrdinalEncoder
handle encoded_missing_value
and unknown_value
#1132
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the current version of
sklearn-onnx
, theencoded_missing_value
andunknown_value
parameters of theOrdinalEncoder
in scikit-learn are not properly handled. Specifically, these parameters are ignored during the ONNX model conversion.For example, if we create an
OrdinalEncoder
withencoded_missing_value
set to42
and fit it on the following data:np.array([["a"], ["b"], ["c"], ["d"], [np.nan]], dtype=np.object_)
, scikit-learn produces the expected output:[0, 1, 2, 3, 42]
. However, the converted ONNX model does not respect theencoded_missing_value
parameter, leading to an unexpected result:[0, 1, 2, 3, 4]
.Similarly, the
unknown_value
parameter is also ignored during conversion, which affects the expected output. To address this issue, thedefault_int64
parameter of the ONNXLabelEncoder
needs to be set when anunknown_value
is specified.I have included some tests to demonstrate this behavior and implemented a simple fix to resolve the issue.