Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the FP16 quantizeLinear operator produced an incorrect output #22741

Open
MercuryHC opened this issue Nov 6, 2024 · 0 comments
Open
Labels
performance issues related to performance regressions quantization issues related to quantization

Comments

@MercuryHC
Copy link

Describe the issue

I have an FP16 quantized model, and there are errors in the results during runtime. To test inference calculations, I created a single QuantizeLinear operator that uses the CPU. When the QuantizeLinear operator operates with FP32 data type, I obtain the expected results; however, when it operates with FP16 data type, I encounter unexpected results.
Image

To reproduce

The following is test code.

import onnx  
from onnx import helper, TensorProto, numpy_helper  
import numpy as np  
import onnxruntime as ort  

def create_quant_fp16_model():
    # create input and output info  
    input_info = helper.make_tensor_value_info(  
        "input", TensorProto.FLOAT16, [32]  
    )  
    output_info = helper.make_tensor_value_info(  
        "output", TensorProto.INT8, [32]  
    )  
    
    # create scale andzero_point 
    scale = helper.make_tensor(  
        name="scale",  
        data_type=TensorProto.FLOAT16,  
        dims=[],
        vals=[1.0]  # set scale value to 1.0  
    )  
    
    zero_point = helper.make_tensor(  
        name="zero_point",  
        data_type=TensorProto.INT8,  
        dims=[],
        vals=[0]  # set zero_point value to 0  
    )  
    
    # create QuantizeLinear node  
    quant_node = helper.make_node(  
        "QuantizeLinear",  
        inputs=["input", "scale", "zero_point"],  
        outputs=["output"],  
        name="quant_node"  
    )  
    
    # create graph  
    graph = helper.make_graph(  
        nodes=[quant_node],  
        name="quant_test_model",  
        inputs=[input_info],  
        outputs=[output_info],  
        initializer=[scale, zero_point]  
    )  
    
    # create model  
    model = helper.make_model(  
        graph,  
        producer_name="quant_test",  
        opset_imports=[helper.make_opsetid("", 21)]  
    )  
    
    # check model  
    onnx.checker.check_model(model)  
    
    return model

def create_quant_fp32_model():
    # create input and output info  
    input_info = helper.make_tensor_value_info(  
        "input", TensorProto.FLOAT, [32]  
    )  
    output_info = helper.make_tensor_value_info(  
        "output", TensorProto.INT8, [32]  
    )  
    
    # create scale andzero_point 
    scale = helper.make_tensor(  
        name="scale",  
        data_type=TensorProto.FLOAT,  
        dims=[],
        vals=[1.0]  # set scale value to 1.0  
    )  
    
    zero_point = helper.make_tensor(  
        name="zero_point",  
        data_type=TensorProto.INT8,  
        dims=[],
        vals=[0]  # set zero_point value to 0  
    )  
    
    # create QuantizeLinear node  
    quant_node = helper.make_node(  
        "QuantizeLinear",  
        inputs=["input", "scale", "zero_point"],  
        outputs=["output"],  
        name="quant_node"  
    )  
    
    # create graph  
    graph = helper.make_graph(  
        nodes=[quant_node],  
        name="quant_test_model",  
        inputs=[input_info],  
        outputs=[output_info],  
        initializer=[scale, zero_point]  
    )  
    
    # create model  
    model = helper.make_model(  
        graph,  
        producer_name="quant_test",  
        opset_imports=[helper.make_opsetid("", 21)]  
    )  
    
    # check model  
    onnx.checker.check_model(model)  
    
    return model    


def test_quant_model():
    # create input data  
    input_data = np.array([1.056, 1.439, 1.036, 1.228, 1.641, 1.019, 1.284,
                            1.819, 1.816, 1.923, 1.054, 1.284, 1.897, 1.748,
                            1.619, 1.071, 1.298, 1.727, 1.018 ,1.094, 1.791,
                            1.963, 1.293, 1.987, 1.771, 1.663, 1.791, 1.435,
                            1.532, 1.441, 1.38,  1.306], dtype=np.float16)
    print("\ninput_data:")  
    print(input_data)
    
    # create FP16 model  
    model_fp16 = create_quant_fp16_model()
    # run inference with onnxruntime on cpu  
    session = ort.InferenceSession(model_fp16.SerializeToString(), providers=['CPUExecutionProvider'])  
    input_name = session.get_inputs()[0].name  
    output_name = session.get_outputs()[0].name  
    
    output = session.run(  
        [output_name],   
        {input_name: input_data}  
    )[0]  
    
    print("-------FP16 model-------")
    print("output_data:")  
    print(output, output.dtype)
    
    # create FP32 model
    input_data = input_data.astype(np.float32)
    model_fp32 = create_quant_fp32_model()
    # run inference with onnxruntime  
    session = ort.InferenceSession(model_fp32.SerializeToString(), providers=['CPUExecutionProvider'])  
    input_name = session.get_inputs()[0].name  
    output_name = session.get_outputs()[0].name  
    
    output = session.run(  
        [output_name],   
        {input_name: input_data}  
    )[0]  
    
    print("-------FP32 model-------")
    print("output_data:")  
    print(output, output.dtype)  

if __name__ == "__main__":  
    # test
    test_quant_model()

Urgency

No response

Platform

Windows

OS Version

Win10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

@MercuryHC MercuryHC added the performance issues related to performance regressions label Nov 6, 2024
@github-actions github-actions bot added the quantization issues related to quantization label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance issues related to performance regressions quantization issues related to quantization
Projects
None yet
Development

No branches or pull requests

1 participant