Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR #2

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
b03ed22
Create dummy.txt
beladiyadarshan Dec 15, 2020
d74df7b
Add files via upload
beladiyadarshan Dec 15, 2020
019a850
Create dummy.txt
beladiyadarshan Dec 15, 2020
59c74ac
Add files via upload
beladiyadarshan Dec 15, 2020
5df3a5f
"finalized ocr"#
Dec 21, 2020
618d778
deleted unnecessary files
Dec 21, 2020
86ad2be
update format1 and Main.vue
Dec 22, 2020
6f960ee
Update .dockerignore
groverkds Dec 22, 2020
7cf9873
Added: indentation
Jan 25, 2021
d9c969e
Merge branch 'main' of https://github.com/beladiyadarshan/python-simp…
Jan 25, 2021
33d711a
Modified: variable names modified
Feb 12, 2021
2cef0a8
Modified: gitignore changed
Feb 12, 2021
e2c642f
Feb 12, 2021
24678df
--allow-empty-message
Feb 12, 2021
96e10c4
--allow-empty-message
Feb 12, 2021
a0aae9d
Merge branch 'main' of https://github.com/beladiyadarshan/python-simp…
Feb 12, 2021
12056fe
Revert "reverted"
Feb 12, 2021
bba5d48
Finalised
Feb 12, 2021
a7266dc
Added : Readme File
Feb 15, 2021
e160829
RFC: Cleaned code
Feb 16, 2021
dbb0bfa
added vscode config to gitignore
Feb 27, 2021
283ad1a
moved util funcitons to util library
Feb 27, 2021
43cec92
converted util folder to pakcage
Feb 27, 2021
f90c4d9
moved image_to_text to utils
Feb 27, 2021
26d3529
added format as a drop and refined code
Feb 27, 2021
4764a82
refactored code
Feb 27, 2021
0cf5278
use custom endpoint using env
Feb 27, 2021
0dce42a
added installation instructions
Feb 27, 2021
6334d75
added gitkeep
Feb 27, 2021
c1b46fa
changed the default url
Feb 27, 2021
9259639
generalized the name from patient to data
Feb 27, 2021
410b475
refactored code
Feb 27, 2021
053cc6b
removed console errors
Feb 27, 2021
aee5149
RFC: use init for just loading constants
groverkds Mar 1, 2021
2a031b3
BUG: bug fixes
Mar 1, 2021
a10ad87
Merge branch 'main' of https://github.com/beladiyadarshan/python-simp…
Mar 1, 2021
56f6c1a
RFC: refactored
Mar 1, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ __pycache__/

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
Expand Down Expand Up @@ -127,3 +126,7 @@ dmypy.json

# Pyre type checker
.pyre/

.vscode

.env
22 changes: 22 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,24 @@
# python-simple-ocr-project
A simple OCR Project using python with frontend in VueJS

# Overview
Repository for Medical-OCR.

# Folder Structure
1. **api** : backend python API

2. **frontend**: frontend vue client

# Setup Instructions For Local Environment
1. Clone the repository
```bash
git clone [email protected]:beladiyadarshan/python-simple-ocr-project.git
```

2. [Set up the API](https://github.com/beladiyadarshan/python-simple-ocr-project/blob/main/api/README.md)

3. [Set up the Frontend](https://github.com/beladiyadarshan/python-simple-ocr-project/blob/main/frontend/README.md)


Note: You will need bash and git to install and get started with this project.
1. Install Git or Gitbash (incase of windows) ([Setup instructions](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)).
5 changes: 5 additions & 0 deletions api/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
development.yml
Dockerfile
production.yml
docker-compose.yml
docker-compose.yml.example
19 changes: 19 additions & 0 deletions api/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
is_running
docker-compose.yml
.idea
*.xlsm
*.xlsx
__pycache__
logs
*.log
.vscode
.vscode/*
*.xls
.cache/
docker_data
Backup_Reports
docker_datagit
Backup_Reports
XDG_CACHE_HOME
*.log.*
*.pdf
29 changes: 29 additions & 0 deletions api/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM ubuntu:18.04


RUN apt-get update --fix-missing
RUN apt-get upgrade -y

RUN apt-get install -y libsm6 libxext6 libxrender-dev libleptonica-dev liblept5
RUN apt-get -y install nginx \
&& apt-get -y install python3-dev \
&& apt-get -y install build-essential \
&& apt-get -y install python3-pip \
&& apt-get -y install software-properties-common \
&& add-apt-repository -y ppa:alex-p/tesseract-ocr \
&& apt-get -y update \
&& apt-get -y install tesseract-ocr \
&& apt-get -y install curl \
&& apt-get -y install poppler-utils

WORKDIR /project

COPY ./requirements.txt /project/requirements.txt

RUN pip3 install -r requirements.txt

COPY ./ /project/

COPY ./default /etc/nginx/sites-available/

CMD ["gunicorn", "-b", "0.0.0.0:5001", "--workers=3", "--threads=3", "-t", "90", "--error-logfile", "/project/err.log", "--log-level=debug", "wsgi:app"]
37 changes: 37 additions & 0 deletions api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Setting up the API

### With Docker
1. Install Docker CE. ([Setup instructions](https://docs.docker.com/install/linux/docker-ce/ubuntu/))

2. Install docker-compose. ([Setup instructions](https://docs.docker.com/compose/install/))

3. Clone the project. (`https://github.com/beladiyadarshan/python-simple-ocr-project.git`)

4. Copy docker-compose.yml.example and save it as docker-compose.yml.

5. Build docker image
```
docker build .
```

### Without Docker
1. Install Python ([Setup instructions](https://wiki.python.org/moin/BeginnersGuide))

2. Install tesseract ([Setup instructions](https://github.com/tesseract-ocr/tesseract#installing-tesseract))

3. Install Python packages
```
pip3 install -r requirements.txt
```

# Running the API

### With Docker
```
docker-compose up
```

### Without Docker
```
python3 app.py
```
61 changes: 61 additions & 0 deletions api/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import os
import sys
import logging
from flask import Flask, request, json
from utils.generic_utils import get_random_string, allowed_file
from parser.parser import parse
from flask_cors import CORS
ROOT_DIR = os.path.dirname(__file__)
PARENT_DIR = os.path.dirname(__file__) + '/' + str(os.pardir)
sys.path.append(PARENT_DIR)

logging.basicConfig(level=logging.DEBUG)
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = "uploads"
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024
cors = CORS(app)


@app.route('/ocr', methods=['POST'])
def ocr():
file_path = ''
try:
format = request.form['format']
file = request.files['file']

file_path = app.config['UPLOAD_FOLDER'] + "/"\
+ get_random_string(32) + ".pdf"
file.save(file_path)
text, data, error = parse(file_path, format) # noqa

app.logger.info("----------------------------------")
app.logger.info(f"Data: {data}")
app.logger.info("----------------------------------")
response = app.response_class(
response=json.dumps({
"text": text,
"data": data
}),
status=200,
mimetype='application/json'
)
os.remove(file_path)
return response

except Exception as e:
response = app.response_class(
response=json.dumps({
"status": 0,
"message": "Some error occurred",
"error": str(e)
}),
status=500,
mimetype='application/json'
)
if file_path:
os.remove(file_path)
return response


if __name__ == "__main__":
app.run(host='0.0.0.0', port=5001, debug=True)
41 changes: 41 additions & 0 deletions api/default
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
##
# You should look at the following URL's in order to grasp a solid understanding
# of Nginx configuration files in order to fully unleash the power of Nginx.
# http://wiki.nginx.org/Pitfalls
# http://wiki.nginx.org/QuickStart
# http://wiki.nginx.org/Configuration
#
# Generally, you will want to move this file somewhere, and start with a clean
# file but keep this around for reference. Or just disable in sites-enabled.
#
# Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.
##

# Default server configuration
#
server {
listen 80 default_server;
listen [::]:80 default_server;

# SSL configuration
#
# listen 443 ssl default_server;
# listen [::]:443 ssl default_server;
#
# Self signed certs generated by the ssl-cert package
# Don't use them in a production server!
#
# include snippets/snakeoil.conf;

root /var/www/html;

# Add index.php to the list if you are using PHP
index index.html index.htm index.nginx-debian.html;

server_name _;

location / {
include proxy_params;
proxy_pass http://ocr-api:5001;
}
}
18 changes: 18 additions & 0 deletions api/docker-compose.yml.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
version: '3'

services:
app:
container_name: ocr-api
build: .
volumes:
- './:/project/'
restart: always
web:
image: nginx
volumes:
- './nginx.conf:/etc/nginx/conf.d/default.conf'
ports:
- '5001:80'
links:
- app
restart: always
6 changes: 6 additions & 0 deletions api/extractor/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from extractor import patient_details


FUNCTIONS = {
'patient_details': patient_details
}
29 changes: 29 additions & 0 deletions api/extractor/extract_details.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from flask import current_app as app
from utils.image_utils import get_text_from_image_list
from . import FUNCTIONS


def extract_details(page_list, file_format):
"""Extract details from a page list

Extract details from a page list depending upon the document type.

Parameters
----------
page_list : list(np.ndarray)
list of pages extracted from pdf
file_format : str
format a particular file is following

Returns
-------
text: str
text as extracted from teh set of images
data: list(tuple)
data stored as list of tuples
"""

text = get_text_from_image_list(page_list)
app.logger.info(text)
data = FUNCTIONS[file_format].extract_details(text)
return text, data
Loading