diff --git a/docs/machine_learning.md b/docs/machine_learning.md index ccfbe44..8acb1ab 100644 --- a/docs/machine_learning.md +++ b/docs/machine_learning.md @@ -59,8 +59,26 @@ best features. very big (compared to the bill width). I will also create a boolean variable for this, but to calibrate the threshold we also need more data. - - +## Generating Data for ML via Docker +__Ubuntu:__ +In the following commands the paths must be adapted! + +Download and preprocess the bills: +~~~R +sudo docker-compose run ml rake machine_learning:import_bill_data && + sudo docker-compose run ml rake machine_learning:add_dimensions && + sudo docker-compose run ml rake machine_learning:add_prices && + sudo chown -cR chillbill:chillbill ~/Dokumente/chillbill-recognizer/data/ && + sudo docker-compose run ml rake machine_learning:list_bills +~~~ + +correct the yml files + +generate csv: +~~~R +sudo docker-compose run ml rake machine_learning:generate_csvs && + sudo chown -cR chillbill:chillbill ~/Dokumente/chillbill-recognizer/data/ +~~~ ## Description of the procedure In the long run there are several possibilities to optimize the result: diff --git a/lib/image_processor.rb b/lib/image_processor.rb index 86abb70..4225c56 100644 --- a/lib/image_processor.rb +++ b/lib/image_processor.rb @@ -61,7 +61,7 @@ def apply_background(color) def deskew process_image do |image| - image.deskew(0.4, @image_width) + image.deskew(0.4) end update_width_and_height self diff --git a/lib/machine_learning/datageneration_workflow.md b/lib/machine_learning/datageneration_workflow.md new file mode 100644 index 0000000..241877f --- /dev/null +++ b/lib/machine_learning/datageneration_workflow.md @@ -0,0 +1,89 @@ +# Data generation for machine learning - Work flow +##0. Get access +You need access to the S3 (files) and GitHub (error report). + +Via the web you can not download multiple files from S3, so one way is to use a program like "Cyberduck" to download / move / delete files on S3. Another possibility is to zip several files (e.g. 20), upload them to s3 and everyone how processes the bills downloads one zip file, deletes it from s3, process the included bills and uploads it with a different name (e.g. "done_OriginalName"). + + +## 1. Opening the next bill + +Open one bill from the "InProcess" folder. Copy the id (e.g. ZwkzBBdB3SH45PbRi) and open the bill in the web app with +~~~shell +https://my.chillbill.co/bills/ZwkzBBdB3SH45PbRi +~~~ +by changing the example bill id to its actual bill id. +## 2. Check dimensions + +__All coordinates are relative (between 0 and 1) to the scan dimension. The origin is the top left corner of the scan.__ + +Check if the coordinates of the textbox make sense by comparing them with the scanned image. So you have to look at the values `text_box_top:`, ` text_box_bottom:`, ` + text_box_left:` and `text_box_right:`. + + If there are weird coordinates (e.g. greater than 1) stop and report an error in the rubric "Wrong text_box dimensions". + + + +## 3. Check bill_format + +Is the bill of type `A4`, `sales_check` or `email`? Write the correct one to `bill_format`. (The default is `A4`). + + + +## 4. Check total_prices_candidates + +1. If there is no candidate stop and report an error in the rubric "Prices missing". + +2. If there is only one candidate check if it is correct by looking at the amount. + +3. If there are more than one candidate, find the correct one by looking on the coordinates. Move the others into `remaining_prices`. + + + +## 5. Check vat_prices_candidates + +1. If there is no candidate stop and report an error in the rubric "Prices missing". + + +2. If there is only one candidate check if it is correct by looking at the amount. + +3. If there are more than one candidate, find the correct one by looking on the coordinates. Move the others into `remaining_prices`. + + +## 6. Check remaining_prices +1. Check all prices in `remaining_prices` if they are actual prices. If something else (e.g. phone number, address, weight, percentage, ...) gets recognized as price delete this price and its coordinates and report an error in the rubric "Something else recognized as price ". In this case we can still use the bill. + +2. While checking the remaining prices you also have to check for missing prices. Are there any prices on the bill (scan) that do not appear on the list? If so stop and report an error in the rubric "Prices missing". + + +## 7. Correct formatting +i. In the rubric "total_prices_candidates" change the name "total_prices_candidates" to "total_prices". Also replace the "-" with a space. + +Example: +![exampleimage](images/total_prices_candidates.png "Before") becomes ![exampleimage](images/total_prices.png "After") + + +ii. In the rubric "vat_prices_candidates" change the name "vat_prices_candidates" to "vat_prices". Also replace the "-" with a space. + +Example: +![exampleimage](images/vat_prices_candidates.png "Before") becomes ![exampleimage](images/vat_prices.png "After") + + +iii. If you moved prices from "total_prices_candidates" or "vat_prices_candidates" to "remaining_prices" the format is most likely incorrect. You have to shift the parts you moved left. In the editor "Atom" you can mark the parts you want to shift and press `shift` and `tab` at the same time. + +Example: +![exampleimage](images/remaining_prices_before.png "Before") becomes ![exampleimage](images/remaining_prices_after.png "After") + + +## Action "stop" +If you have to stop to work on the bill because of any problem mentioned above, report an error, delete the file (bill) with the error and start with a new bill. + +## Action "report an error" +To solve problems with the recognizer we need "bad" examples. In the following project you can report any error. +~~~ +https://github.com/clemenshelm/chillbill-recognizer/projects/6 +~~~ +Just press in the correct column on the "+" sign to add a new note. Insert the bill id __with the file type__ (e.g. .pdf, .jpg, .tiff, ..). You can find the file type at the end of the second line (image_url). For every possible error, there should be a column. If you find an error and there is no column for it, please tell the person in charge. + + +## 8. Finish +If you do not need to stop because of any error, save the file and move it to the folder "done". Then start with a new bill :) diff --git a/lib/machine_learning/images/remaining_prices_after.png b/lib/machine_learning/images/remaining_prices_after.png new file mode 100644 index 0000000..f9c14fa Binary files /dev/null and b/lib/machine_learning/images/remaining_prices_after.png differ diff --git a/lib/machine_learning/images/remaining_prices_before.png b/lib/machine_learning/images/remaining_prices_before.png new file mode 100644 index 0000000..420cdcd Binary files /dev/null and b/lib/machine_learning/images/remaining_prices_before.png differ diff --git a/lib/machine_learning/images/total_prices.png b/lib/machine_learning/images/total_prices.png new file mode 100644 index 0000000..0f9b98b Binary files /dev/null and b/lib/machine_learning/images/total_prices.png differ diff --git a/lib/machine_learning/images/total_prices_candidates.png b/lib/machine_learning/images/total_prices_candidates.png new file mode 100644 index 0000000..fa28364 Binary files /dev/null and b/lib/machine_learning/images/total_prices_candidates.png differ diff --git a/lib/machine_learning/images/vat_prices.png b/lib/machine_learning/images/vat_prices.png new file mode 100644 index 0000000..2d72e1a Binary files /dev/null and b/lib/machine_learning/images/vat_prices.png differ diff --git a/lib/machine_learning/images/vat_prices_candidates.png b/lib/machine_learning/images/vat_prices_candidates.png new file mode 100644 index 0000000..2f2eee1 Binary files /dev/null and b/lib/machine_learning/images/vat_prices_candidates.png differ diff --git a/lib/tasks/machine_learning/add_prices.rake b/lib/tasks/machine_learning/add_prices.rake index ebb381d..a6bca31 100644 --- a/lib/tasks/machine_learning/add_prices.rake +++ b/lib/tasks/machine_learning/add_prices.rake @@ -27,7 +27,8 @@ namespace :machine_learning do recognizer.recognize_words(png_file) recognizer.filter_words - %w(total_prices_candidates total_prices vat_prices_candidates vat_prices) + #%w(total_prices_candidates total_prices vat_prices_candidates vat_prices) + %w(total_prices_candidates vat_prices_candidates) .each { |attr| store[attr] = {} } extractor = PriceExtractor.new @@ -40,7 +41,7 @@ namespace :machine_learning do candidates = prices.send(attr) price_key = "#{attr}_#{vat_rate}" store["#{attr}_prices_candidates"][price_key] = candidates.map(&:to_h) - store["#{attr}_prices"][price_key] = nil + #store["#{attr}_prices"][price_key] = nil end end store['remaining_prices'] = extractor.remaining_prices.map(&:to_h) diff --git a/lib/tasks/machine_learning/import_bill_data.rake b/lib/tasks/machine_learning/import_bill_data.rake index fde236c..863467f 100644 --- a/lib/tasks/machine_learning/import_bill_data.rake +++ b/lib/tasks/machine_learning/import_bill_data.rake @@ -4,7 +4,7 @@ namespace :machine_learning do task :import_bill_data do require 'mongo' require 'yaml/store' - limit = 10 + limit = 50 existing_ids = Dir['data/bills/*.yml'] .map { |f| f.match(%r{([^\/]+)\.yml})[1] } client = Mongo::Client.new(ENV['MONGO_READ_URL'], ssl: true, ssl_verify: false) @@ -20,7 +20,7 @@ namespace :machine_learning do 'accountingRecord.amounts.0.vatRate': { '$ne': 0 } } }, - { '$sample' => { size: 10 } } + { '$sample' => { size: limit} } ] ) .each do |bill|