Running the FP16 quantizeLinear operator produced an incorrect output #22741

MercuryHC · 2024-11-06T06:33:17Z

Describe the issue

I have an FP16 quantized model, and there are errors in the results during runtime. To test inference calculations, I created a single QuantizeLinear operator that uses the CPU. When the QuantizeLinear operator operates with FP32 data type, I obtain the expected results; however, when it operates with FP16 data type, I encounter unexpected results.

To reproduce

The following is test code.

import onnx  
from onnx import helper, TensorProto, numpy_helper  
import numpy as np  
import onnxruntime as ort  

def create_quant_fp16_model():
    # create input and output info  
    input_info = helper.make_tensor_value_info(  
        "input", TensorProto.FLOAT16, [32]  
    )  
    output_info = helper.make_tensor_value_info(  
        "output", TensorProto.INT8, [32]  
    )  
    
    # create scale andzero_point 
    scale = helper.make_tensor(  
        name="scale",  
        data_type=TensorProto.FLOAT16,  
        dims=[],
        vals=[1.0]  # set scale value to 1.0  
    )  
    
    zero_point = helper.make_tensor(  
        name="zero_point",  
        data_type=TensorProto.INT8,  
        dims=[],
        vals=[0]  # set zero_point value to 0  
    )  
    
    # create QuantizeLinear node  
    quant_node = helper.make_node(  
        "QuantizeLinear",  
        inputs=["input", "scale", "zero_point"],  
        outputs=["output"],  
        name="quant_node"  
    )  
    
    # create graph  
    graph = helper.make_graph(  
        nodes=[quant_node],  
        name="quant_test_model",  
        inputs=[input_info],  
        outputs=[output_info],  
        initializer=[scale, zero_point]  
    )  
    
    # create model  
    model = helper.make_model(  
        graph,  
        producer_name="quant_test",  
        opset_imports=[helper.make_opsetid("", 21)]  
    )  
    
    # check model  
    onnx.checker.check_model(model)  
    
    return model

def create_quant_fp32_model():
    # create input and output info  
    input_info = helper.make_tensor_value_info(  
        "input", TensorProto.FLOAT, [32]  
    )  
    output_info = helper.make_tensor_value_info(  
        "output", TensorProto.INT8, [32]  
    )  
    
    # create scale andzero_point 
    scale = helper.make_tensor(  
        name="scale",  
        data_type=TensorProto.FLOAT,  
        dims=[],
        vals=[1.0]  # set scale value to 1.0  
    )  
    
    zero_point = helper.make_tensor(  
        name="zero_point",  
        data_type=TensorProto.INT8,  
        dims=[],
        vals=[0]  # set zero_point value to 0  
    )  
    
    # create QuantizeLinear node  
    quant_node = helper.make_node(  
        "QuantizeLinear",  
        inputs=["input", "scale", "zero_point"],  
        outputs=["output"],  
        name="quant_node"  
    )  
    
    # create graph  
    graph = helper.make_graph(  
        nodes=[quant_node],  
        name="quant_test_model",  
        inputs=[input_info],  
        outputs=[output_info],  
        initializer=[scale, zero_point]  
    )  
    
    # create model  
    model = helper.make_model(  
        graph,  
        producer_name="quant_test",  
        opset_imports=[helper.make_opsetid("", 21)]  
    )  
    
    # check model  
    onnx.checker.check_model(model)  
    
    return model    


def test_quant_model():
    # create input data  
    input_data = np.array([1.056, 1.439, 1.036, 1.228, 1.641, 1.019, 1.284,
                            1.819, 1.816, 1.923, 1.054, 1.284, 1.897, 1.748,
                            1.619, 1.071, 1.298, 1.727, 1.018 ,1.094, 1.791,
                            1.963, 1.293, 1.987, 1.771, 1.663, 1.791, 1.435,
                            1.532, 1.441, 1.38,  1.306], dtype=np.float16)
    print("\ninput_data:")  
    print(input_data)
    
    # create FP16 model  
    model_fp16 = create_quant_fp16_model()
    # run inference with onnxruntime on cpu  
    session = ort.InferenceSession(model_fp16.SerializeToString(), providers=['CPUExecutionProvider'])  
    input_name = session.get_inputs()[0].name  
    output_name = session.get_outputs()[0].name  
    
    output = session.run(  
        [output_name],   
        {input_name: input_data}  
    )[0]  
    
    print("-------FP16 model-------")
    print("output_data:")  
    print(output, output.dtype)
    
    # create FP32 model
    input_data = input_data.astype(np.float32)
    model_fp32 = create_quant_fp32_model()
    # run inference with onnxruntime  
    session = ort.InferenceSession(model_fp32.SerializeToString(), providers=['CPUExecutionProvider'])  
    input_name = session.get_inputs()[0].name  
    output_name = session.get_outputs()[0].name  
    
    output = session.run(  
        [output_name],   
        {input_name: input_data}  
    )[0]  
    
    print("-------FP32 model-------")
    print("output_data:")  
    print(output, output.dtype)  

if __name__ == "__main__":  
    # test
    test_quant_model()

Urgency

No response

Platform

Windows

OS Version

Win10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

MercuryHC added the performance issues related to performance regressions label Nov 6, 2024

github-actions bot added the quantization issues related to quantization label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the FP16 quantizeLinear operator produced an incorrect output #22741

Running the FP16 quantizeLinear operator produced an incorrect output #22741

MercuryHC commented Nov 6, 2024

Running the FP16 quantizeLinear operator produced an incorrect output #22741

Running the FP16 quantizeLinear operator produced an incorrect output #22741

Comments

MercuryHC commented Nov 6, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?