Skip to content

Output statistics about word confidence values

Philipp Zumstein edited this page Feb 24, 2017 · 1 revision

The hocr output of Tesseract contains confidence values for each word with x_wconf-property (values ranges from 0 to 99). It could be interesting to see for each file (page) the amount of words with high confidence and the amount of words with low confidence.

This bash scripts goes over all hocr-files and prints out 10 values with the amount of words with the corresponding word confidences. The first value gives just the number of words with word confidence starting with 0, which means that the word confidence lies in the range 0 up to 09. This continues in the same matter such that the last value outputs the number of words with confidence value starting with 9, i.e. confidence value must be in the range 90 up to 99.

#!/bin/bash

for f in *.hocr; do
    conf0=$(grep -o "x_wconf 0" "$f" | wc -l)
	conf1=$(grep -o "x_wconf 1" "$f" | wc -l)
	conf2=$(grep -o "x_wconf 2" "$f" | wc -l)
	conf3=$(grep -o "x_wconf 3" "$f" | wc -l)
	conf4=$(grep -o "x_wconf 4" "$f" | wc -l)
	conf5=$(grep -o "x_wconf 5" "$f" | wc -l)
	conf6=$(grep -o "x_wconf 6" "$f" | wc -l)
	conf7=$(grep -o "x_wconf 7" "$f" | wc -l)
	conf8=$(grep -o "x_wconf 8" "$f" | wc -l)
	conf9=$(grep -o "x_wconf 9" "$f" | wc -l)
	echo "$f" "$conf0" "$conf1" "$conf2" "$conf3" "$conf4" "$conf5" "$conf6" "$conf7" "$conf8" "$conf9"
done

The output can be further analyzed with Excel or some other tool and looks like this

481659978_08_0458.hocr 0 0 0 0 0 0 1 3 25 296
481659978_08_0459.hocr 0 0 0 0 0 0 0 1 22 311
481659978_08_0460.hocr 0 0 0 0 1 0 1 1 24 318
481659978_08_0461.hocr 0 0 0 0 0 0 3 5 35 301
481659978_08_0462.hocr 0 0 0 0 0 0 0 2 23 308
481659978_08_0463.hocr 0 0 0 0 1 1 2 7 27 271
481659978_08_0464.hocr 0 0 0 2 0 1 2 2 16 305
481659978_08_0465.hocr 0 0 0 0 0 3 2 2 11 322
481659978_08_0466.hocr 1 2 2 1 2 11 4 12 29 169
481659978_08_0467.hocr 0 0 2 0 0 2 1 2 19 276
Clone this wiki locally