You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an FP16 quantized model, and there are errors in the results during runtime. To test inference calculations, I created a single QuantizeLinear operator that uses the CPU. When the QuantizeLinear operator operates with FP32 data type, I obtain the expected results; however, when it operates with FP16 data type, I encounter unexpected results.
To reproduce
The following is test code.
importonnxfromonnximporthelper, TensorProto, numpy_helperimportnumpyasnpimportonnxruntimeasortdefcreate_quant_fp16_model():
# create input and output info input_info=helper.make_tensor_value_info(
"input", TensorProto.FLOAT16, [32]
)
output_info=helper.make_tensor_value_info(
"output", TensorProto.INT8, [32]
)
# create scale andzero_point scale=helper.make_tensor(
name="scale",
data_type=TensorProto.FLOAT16,
dims=[],
vals=[1.0] # set scale value to 1.0
)
zero_point=helper.make_tensor(
name="zero_point",
data_type=TensorProto.INT8,
dims=[],
vals=[0] # set zero_point value to 0
)
# create QuantizeLinear node quant_node=helper.make_node(
"QuantizeLinear",
inputs=["input", "scale", "zero_point"],
outputs=["output"],
name="quant_node"
)
# create graph graph=helper.make_graph(
nodes=[quant_node],
name="quant_test_model",
inputs=[input_info],
outputs=[output_info],
initializer=[scale, zero_point]
)
# create model model=helper.make_model(
graph,
producer_name="quant_test",
opset_imports=[helper.make_opsetid("", 21)]
)
# check model onnx.checker.check_model(model)
returnmodeldefcreate_quant_fp32_model():
# create input and output info input_info=helper.make_tensor_value_info(
"input", TensorProto.FLOAT, [32]
)
output_info=helper.make_tensor_value_info(
"output", TensorProto.INT8, [32]
)
# create scale andzero_point scale=helper.make_tensor(
name="scale",
data_type=TensorProto.FLOAT,
dims=[],
vals=[1.0] # set scale value to 1.0
)
zero_point=helper.make_tensor(
name="zero_point",
data_type=TensorProto.INT8,
dims=[],
vals=[0] # set zero_point value to 0
)
# create QuantizeLinear node quant_node=helper.make_node(
"QuantizeLinear",
inputs=["input", "scale", "zero_point"],
outputs=["output"],
name="quant_node"
)
# create graph graph=helper.make_graph(
nodes=[quant_node],
name="quant_test_model",
inputs=[input_info],
outputs=[output_info],
initializer=[scale, zero_point]
)
# create model model=helper.make_model(
graph,
producer_name="quant_test",
opset_imports=[helper.make_opsetid("", 21)]
)
# check model onnx.checker.check_model(model)
returnmodeldeftest_quant_model():
# create input data input_data=np.array([1.056, 1.439, 1.036, 1.228, 1.641, 1.019, 1.284,
1.819, 1.816, 1.923, 1.054, 1.284, 1.897, 1.748,
1.619, 1.071, 1.298, 1.727, 1.018 ,1.094, 1.791,
1.963, 1.293, 1.987, 1.771, 1.663, 1.791, 1.435,
1.532, 1.441, 1.38, 1.306], dtype=np.float16)
print("\ninput_data:")
print(input_data)
# create FP16 model model_fp16=create_quant_fp16_model()
# run inference with onnxruntime on cpu session=ort.InferenceSession(model_fp16.SerializeToString(), providers=['CPUExecutionProvider'])
input_name=session.get_inputs()[0].nameoutput_name=session.get_outputs()[0].nameoutput=session.run(
[output_name],
{input_name: input_data}
)[0]
print("-------FP16 model-------")
print("output_data:")
print(output, output.dtype)
# create FP32 modelinput_data=input_data.astype(np.float32)
model_fp32=create_quant_fp32_model()
# run inference with onnxruntime session=ort.InferenceSession(model_fp32.SerializeToString(), providers=['CPUExecutionProvider'])
input_name=session.get_inputs()[0].nameoutput_name=session.get_outputs()[0].nameoutput=session.run(
[output_name],
{input_name: input_data}
)[0]
print("-------FP32 model-------")
print("output_data:")
print(output, output.dtype)
if__name__=="__main__":
# testtest_quant_model()
Urgency
No response
Platform
Windows
OS Version
Win10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.19.2
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered:
Describe the issue
I have an FP16 quantized model, and there are errors in the results during runtime. To test inference calculations, I created a single QuantizeLinear operator that uses the CPU. When the QuantizeLinear operator operates with FP32 data type, I obtain the expected results; however, when it operates with FP16 data type, I encounter unexpected results.
To reproduce
The following is test code.
Urgency
No response
Platform
Windows
OS Version
Win10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.19.2
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: