Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking workflows #65

Open
markusweigelt opened this issue Aug 4, 2023 · 2 comments
Open

Benchmarking workflows #65

markusweigelt opened this issue Aug 4, 2023 · 2 comments

Comments

@markusweigelt
Copy link
Collaborator

  • measurement of workflow runtime
  • maybe integration of measurement tool
  • provide benchmark data
@bertsky
Copy link
Member

bertsky commented Aug 14, 2023

There are two types of data here:

  • metadata about the workflow (how often it ran, how many minutes per page on average, what quality score on average)
  • metadata about the processes (number of pages, CPU time, peak memory, estimated quality score of result)

The former depends on the latter, for which we rely on the Controller's (OCR-D) internal mechanisms to collect the primary data. By default (currently) the ocrd.log file in the workspace will contain runtime data (CPU time and peak memory), but one would still need to aggregate from the individual processing steps. Alternatively, we could install a custom ocrd_logging.conf in the Controller where we send the profiling messages to an external syslogd (on the Manager). Regardless, we must then parse the log messages that the ocrd.process.profile logger generates into our database.

@markusweigelt
Copy link
Collaborator Author

The motivation for this issue originated from an OCR-D call, where it was mentioned "we welcome any benchmark data" presumably to optimize the quality of the processors. So i assume, there should already be a way to evaluate data and make these comparable. We could provide these data as well, or is this simply the wrong approach and does the controller already come with monitoring?

Some ideas:

  • Identifying the data needed or that would assist in the optimization.
  • Perhaps it could make sense to install the benchmarking tool directly on the controller to generate this data. Does that sound reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants