OCR #65

yuliiabuchko · 2020-05-05T16:49:41Z

No description provided.

build.gradle

gradle/wrapper/gradle-wrapper.properties

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/configuration/TesseractConfiguration.kt

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/PDFTextExtractor.kt

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/ImageTextExtractor.kt

# Conflicts: # src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/AutoDetectTextExtractor.kt # src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/FileContentValidator.kt

apardyl

looks almost ok, please manually check if it works on pdf files from mordor.ksi.ii.uj.edu.pl

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/ImageTextExtractor.kt

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/FileContentValidator.kt

apardyl

looks ok, just some nits

apardyl · 2020-05-13T22:09:54Z

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/AutoDetectTextExtractor.kt

+        if (content == null) {
+            return null
+        }
+        val modified = content.replace('\n', ' ').replace('\t', ' ')


Suggested change

val modified = content.replace('\n', ' ').replace('\t', ' ')

val modified = content.replace("\\s".toRegex(), " ")

apardyl · 2020-05-13T22:11:05Z

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/AutoDetectTextExtractor.kt

+                return ImageTextExtractor(tessBaseAPI).extract(file, maxLength)
+            }
+        } catch (e: IOException) {
+            logger.error("File can not be read", e)


Suggested change

logger.error("File can not be read", e)

logger.warn("File can not be read: " + file.absolutePath, e)

yuliiabuchko and others added 2 commits May 5, 2020 17:45

Add AutoDetectTextExtractor

515e2a3

Reformat TextExtractor based on detekt errors

e992e67

yuliiabuchko requested a review from apardyl May 5, 2020 16:52

apardyl requested changes May 5, 2020

View reviewed changes

Fix review issues

3595d69

yuliiabuchko requested a review from apardyl May 6, 2020 11:09

yuliiabuchko and others added 10 commits May 6, 2020 16:40

Add bytedeco ImageTextExtractor

20e4064

Move text extractors to new package

0231468

Add PDF text extractor

5ab69e6

Fix temp file creation

7cc1040

Add max length validation

412d63d

Add white chars filter

38ed7b8

Add white chars filter

6b1d48f

Merge remote-tracking branch 'origin/ocr_new' into ocr_new

93661b5

# Conflicts: # src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/AutoDetectTextExtractor.kt # src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/FileContentValidator.kt

Merge branch 'ms-1' into ocr_new

3d29716

Rename extractors, update validator

2772416

yuliiabuchko linked an issue May 12, 2020 that may be closed by this pull request

Execute OCR on the scanned documents #55

Open

apardyl requested changes May 13, 2020

View reviewed changes

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/ImageTextExtractor.kt Outdated Show resolved Hide resolved

src/main/kotlin/pl/edu/uj/ii/ksi/mordor/services/text/extractor/FileContentValidator.kt Outdated Show resolved Hide resolved

Add minimum word length validator

d99b486

apardyl requested changes May 13, 2020

View reviewed changes

yuliiabuchko and others added 2 commits May 14, 2020 00:40

Add split using regex, review fixes

b5f2d58

Merge branch 'ms-1' into ocr_new

2c83681

apardyl approved these changes May 13, 2020

View reviewed changes

yuliiabuchko merged commit 4355720 into ms-1 May 14, 2020

yuliiabuchko deleted the ocr_new branch May 14, 2020 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR #65

OCR #65

yuliiabuchko commented May 5, 2020

apardyl left a comment

apardyl left a comment

apardyl May 13, 2020

apardyl May 13, 2020

	val modified = content.replace('\n', ' ').replace('\t', ' ')
	val modified = content.replace("\\s".toRegex(), " ")

	logger.error("File can not be read", e)
	logger.warn("File can not be read: " + file.absolutePath, e)

OCR #65

OCR #65

Conversation

yuliiabuchko commented May 5, 2020

apardyl left a comment

Choose a reason for hiding this comment

apardyl left a comment

Choose a reason for hiding this comment

apardyl May 13, 2020

Choose a reason for hiding this comment

apardyl May 13, 2020

Choose a reason for hiding this comment