Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

362 speed up data generation #384

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions docs/machine_learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,26 @@ best features.
very big (compared to the bill width). I will also create a boolean variable
for this, but to calibrate the threshold we also need more data.



## Generating Data for ML via Docker
__Ubuntu:__
In the following commands the paths must be adapted!

Download and preprocess the bills:
~~~R
sudo docker-compose run ml rake machine_learning:import_bill_data &&
sudo docker-compose run ml rake machine_learning:add_dimensions &&
sudo docker-compose run ml rake machine_learning:add_prices &&
sudo chown -cR chillbill:chillbill ~/Dokumente/chillbill-recognizer/data/ &&
sudo docker-compose run ml rake machine_learning:list_bills
~~~

correct the yml files

generate csv:
~~~R
sudo docker-compose run ml rake machine_learning:generate_csvs &&
sudo chown -cR chillbill:chillbill ~/Dokumente/chillbill-recognizer/data/
~~~

## Description of the procedure
In the long run there are several possibilities to optimize the result:
Expand Down
2 changes: 1 addition & 1 deletion lib/image_processor.rb
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def apply_background(color)

def deskew
process_image do |image|
image.deskew(0.4, @image_width)
image.deskew(0.4)
end
update_width_and_height
self
Expand Down
89 changes: 89 additions & 0 deletions lib/machine_learning/datageneration_workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Data generation for machine learning - Work flow
##0. Get access
You need access to the S3 (files) and GitHub (error report).

Via the web you can not download multiple files from S3, so one way is to use a program like "Cyberduck" to download / move / delete files on S3. Another possibility is to zip several files (e.g. 20), upload them to s3 and everyone how processes the bills downloads one zip file, deletes it from s3, process the included bills and uploads it with a different name (e.g. "done_OriginalName").


## 1. Opening the next bill

Open one bill from the "InProcess" folder. Copy the id (e.g. ZwkzBBdB3SH45PbRi) and open the bill in the web app with
~~~shell
https://my.chillbill.co/bills/ZwkzBBdB3SH45PbRi
~~~
by changing the example bill id to its actual bill id.
## 2. Check dimensions

__All coordinates are relative (between 0 and 1) to the scan dimension. The origin is the top left corner of the scan.__

Check if the coordinates of the textbox make sense by comparing them with the scanned image. So you have to look at the values `text_box_top:`, ` text_box_bottom:`, `
text_box_left:` and `text_box_right:`.

If there are weird coordinates (e.g. greater than 1) stop and report an error in the rubric "Wrong text_box dimensions".



## 3. Check bill_format

Is the bill of type `A4`, `sales_check` or `email`? Write the correct one to `bill_format`. (The default is `A4`).



## 4. Check total_prices_candidates

1. If there is no candidate stop and report an error in the rubric "Prices missing".

2. If there is only one candidate check if it is correct by looking at the amount.

3. If there are more than one candidate, find the correct one by looking on the coordinates. Move the others into `remaining_prices`.



## 5. Check vat_prices_candidates

1. If there is no candidate stop and report an error in the rubric "Prices missing".


2. If there is only one candidate check if it is correct by looking at the amount.

3. If there are more than one candidate, find the correct one by looking on the coordinates. Move the others into `remaining_prices`.


## 6. Check remaining_prices
1. Check all prices in `remaining_prices` if they are actual prices. If something else (e.g. phone number, address, weight, percentage, ...) gets recognized as price delete this price and its coordinates and report an error in the rubric "Something else recognized as price ". In this case we can still use the bill.

2. While checking the remaining prices you also have to check for missing prices. Are there any prices on the bill (scan) that do not appear on the list? If so stop and report an error in the rubric "Prices missing".


## 7. Correct formatting
i. In the rubric "total_prices_candidates" change the name "total_prices_candidates" to "total_prices". Also replace the "-" with a space.

Example:
![exampleimage](images/total_prices_candidates.png "Before") becomes ![exampleimage](images/total_prices.png "After")


ii. In the rubric "vat_prices_candidates" change the name "vat_prices_candidates" to "vat_prices". Also replace the "-" with a space.

Example:
![exampleimage](images/vat_prices_candidates.png "Before") becomes ![exampleimage](images/vat_prices.png "After")


iii. If you moved prices from "total_prices_candidates" or "vat_prices_candidates" to "remaining_prices" the format is most likely incorrect. You have to shift the parts you moved left. In the editor "Atom" you can mark the parts you want to shift and press `shift` and `tab` at the same time.

Example:
![exampleimage](images/remaining_prices_before.png "Before") becomes ![exampleimage](images/remaining_prices_after.png "After")


## Action "stop"
If you have to stop to work on the bill because of any problem mentioned above, report an error, delete the file (bill) with the error and start with a new bill.

## Action "report an error"
To solve problems with the recognizer we need "bad" examples. In the following project you can report any error.
~~~
https://github.com/clemenshelm/chillbill-recognizer/projects/6
~~~
Just press in the correct column on the "+" sign to add a new note. Insert the bill id __with the file type__ (e.g. .pdf, .jpg, .tiff, ..). You can find the file type at the end of the second line (image_url). For every possible error, there should be a column. If you find an error and there is no column for it, please tell the person in charge.


## 8. Finish
If you do not need to stop because of any error, save the file and move it to the folder "done". Then start with a new bill :)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added lib/machine_learning/images/total_prices.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added lib/machine_learning/images/vat_prices.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions lib/tasks/machine_learning/add_prices.rake
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ namespace :machine_learning do
recognizer.recognize_words(png_file)
recognizer.filter_words

%w(total_prices_candidates total_prices vat_prices_candidates vat_prices)
#%w(total_prices_candidates total_prices vat_prices_candidates vat_prices)
%w(total_prices_candidates vat_prices_candidates)
.each { |attr| store[attr] = {} }

extractor = PriceExtractor.new
Expand All @@ -40,7 +41,7 @@ namespace :machine_learning do
candidates = prices.send(attr)
price_key = "#{attr}_#{vat_rate}"
store["#{attr}_prices_candidates"][price_key] = candidates.map(&:to_h)
store["#{attr}_prices"][price_key] = nil
#store["#{attr}_prices"][price_key] = nil
end
end
store['remaining_prices'] = extractor.remaining_prices.map(&:to_h)
Expand Down
4 changes: 2 additions & 2 deletions lib/tasks/machine_learning/import_bill_data.rake
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ namespace :machine_learning do
task :import_bill_data do
require 'mongo'
require 'yaml/store'
limit = 10
limit = 50
existing_ids = Dir['data/bills/*.yml']
.map { |f| f.match(%r{([^\/]+)\.yml})[1] }
client = Mongo::Client.new(ENV['MONGO_READ_URL'], ssl: true, ssl_verify: false)
Expand All @@ -20,7 +20,7 @@ namespace :machine_learning do
'accountingRecord.amounts.0.vatRate': { '$ne': 0 }
}
},
{ '$sample' => { size: 10 } }
{ '$sample' => { size: limit} }
]
)
.each do |bill|
Expand Down