Running VIT on android #19093

arseniymerkulov · 2024-01-11T14:55:32Z

arseniymerkulov
Jan 11, 2024

Hello! I'm trying to run vision transformer with onnxruntime on an android device. In Python, the model was finetuned to classify 14 classes and converted to onnx format. After that, preprocessing steps were added to the model (resize, conversion to float, normalization). Next, the model was tested on 10k images with an accuracy of 98%. Model file in onnx format attached:
vit_tiny_14_with_preprocessing.zip

On current step all external preprocessing you need in python to correctly run the model is like this:

from torchvision import transforms
from PIL import Image
import onnxruntime as ort

image = Image.open(image_path)
image = transforms.PILToTensor()(image)

ort_sess = ort.InferenceSession(model_path)
outputs = session.run(None, {'image_tensor': image.numpy()})
label = np.argmax(outputs[0][0])

Then i tried to convert this code to kotlin. My kotlin code is below:

val inputStream: InputStream = ... // get inputSteam for image
val bitmap = BitmapFactory.decodeStream(inputStream)
val width = bitmap.width
val height = bitmap.height

val buffer = ByteBuffer.allocate(width * height * 3)
for (y in 0 until height) {
    for (x in 0 until width) {
        val pixel = bitmap.getPixel(x, y)
        
        buffer.put(((pixel shr 16) and 0xff).toByte())
        buffer.put(((pixel shr 8) and 0xff).toByte())
        buffer.put((pixel and 0xff).toByte()) 
    }
}
buffer.flip()

val inputTensor = OnnxTensor.createTensor(
    ortEnv,
    buffer,
    longArrayOf(3, height.toLong(), width.toLong()),
    OnnxJavaType.UINT8
)

inputTensor.use {
    val output = ortSession.run(Collections.singletonMap("image_tensor", inputTensor), setOf("logits"))
    val scores = (output?.get(0)?.value as Array<FloatArray>)[0]
    val label = scores.withIndex().maxByOrNull { it.value }?.index ?: -1
}

In kotlin, model classify every input image as single "trash" class (1 of 14). Similar result can be achieved by feeding trash images to the model. So, probably i preprocess images wrong. Can you guide me towards what step is missing/incorrect in kotlin preprocessing?

Answered by Craigacp

Jan 11, 2024

Isn't the kotlin code emitting channels last, when the model expects channels first?

View full answer

Craigacp · 2024-01-11T18:01:26Z

Craigacp
Jan 11, 2024

Isn't the kotlin code emitting channels last, when the model expects channels first?

1 reply

arseniymerkulov Jan 12, 2024
Author

It does, thank you very much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running VIT on android #19093

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Running VIT on android #19093

arseniymerkulov Jan 11, 2024

Replies: 1 comment · 1 reply

Craigacp Jan 11, 2024

arseniymerkulov Jan 12, 2024 Author

arseniymerkulov
Jan 11, 2024

Replies: 1 comment 1 reply

Craigacp
Jan 11, 2024

arseniymerkulov Jan 12, 2024
Author