Too restrictive file mode for mets.xml #403

stweil · 2020-01-14T09:46:42Z

When mets.xml is created or modified by ocrd, the new file mode only allows access by the owner which is typically not sufficient for practical use:

-rw------- 1 stweil stweil 822382 Jan 13 22:43 mets.xml

So it is always necessary to add a chmod command after processing with OCR-D.

The code should use the default file mode which respects umask.

The text was updated successfully, but these errors were encountered:

stweil · 2020-01-14T10:35:22Z

The problem is caused by atomic_write.

stweil · 2020-01-14T10:38:54Z

https://www.oreilly.com/library/view/python-cookbook/0596001673/ch04s25.html shows a portable alternative solution.

Do we need atomic writes here at all? Just using open would fix the problem, too, and we could remove a dependency. This would require documentation that certain commands (for example ocrd workspace add) must not run in parallel for the same METS file.

I doubt that atomic_write does what is really needed, namely allowing any number of processes modifying and reading the METS file without loosing written data or reading an inconsistent file. So maybe the problem is not the restrictive file mode, but incomplete file locking:

When reading a METS file, a lock is needed which only allows read access for any process.
When modifying a METS file, it must be opened with a lock for exclusive access before reading and writing.
If a lock fails, the program must wait until it succeeds to get the required lock or fail after a timeout.

kba · 2020-01-14T17:53:06Z

I doubt that atomic_write does what is really needed

You can check the code here, it's really short https://github.com/untitaker/python-atomicwrites/blob/master/atomicwrites/__init__.py

I don't want to move away from atomicwrites unless there is a real reason not to use it, it seems to do what it says on the package.

The code should use the default file mode which respects umask.

There's some entry points in atomicwrites to influence this, I'll have a look. See also untitaker/python-atomicwrites#42

stweil · 2020-01-14T18:40:24Z

I don't want to move away from atomicwrites unless there is a real reason not to use it, it seems to do what it says on the package.

What is required? Example: Should it be possible to run several ocrd workspace add commands which add new entries to the same METS file in parallel?

Without trying it I would say it cannot work: Task 1 reads the data, task 2 reads / modifies / writes the data, task 1 modifies / writes the data and overwrites the data which was written by task 2. I did not see locks to prevent that scenario, and atomic_write would not help here.

stweil · 2020-01-14T18:44:27Z

You can check the code here.

I checked it before writing that I have doubts.

bertsky · 2020-01-15T00:29:01Z

What is required? Example: Should it be possible to run several ocrd workspace add commands which add new entries to the same METS file in parallel?

One of the reasons for this was safe SIGINT (ctrl+c) IIRC.

stweil · 2020-01-15T13:48:41Z

SIGINT handling could be done with signal.signal. I think it would be sufficient to ignore SIGINT while writing the METS file. Or do we expect that this file can be so large that it should be possible to abort the writing?

Maybe it would even make sense to block SIGINT while processing a page and handle it only after a page was processed.

kba · 2020-01-15T13:59:48Z

I think it would be sufficient to ignore SIGINT while writing the METS file

That was my initial approach but I was happy to delegate this to a third-party library that handles the nitty-gritty details.

stweil · 2020-01-15T14:29:55Z

I see. So the have several requirements to address:

Safe handling of SIGINT
Support for parallel r/w access
Correct file mode

bertsky · 2020-01-15T14:42:14Z

Could also be SIGABRT, SIGTERM, SIGSEGV, SIGIO, SIGPIPE etc.

Maybe it would even make sense to block SIGINT while processing a page and handle it only after a page was processed.

No, ctrl+c should have immediate effect! Think of a badly written processor...

I don't think parallel access is really the issue here – as you said yourself, with the current API, which allows reading METS separate/independent from the modifications to be made, this is impossible, and we could make use of that kind of parallelization only in non-linear workflows anyway.

stweil · 2020-01-15T20:40:36Z

Could also be SIGABRT, SIGTERM, SIGSEGV, SIGIO, SIGPIPE etc.

Of course only few of those can occur during a file write.

I don't think parallel access is really the issue here.

If we agree that parallel access is not an issue but simply unsupported for the moment, that simplifies things a lot and we only have to document it clearly.

But some day people will ask whether on ocrd process can be made faster. This is less important as long as I have to process many books, so its possible to parallelize on the book level. It gets important if I have a single book and are waiting for the result. Then the single processes could run in parallel with a slight offset, so a page which was processed by the first process could be passed to the second process while the first process starts the next page and so on.

bertsky · 2020-01-15T21:35:35Z

Yes, page pipelining is another desideratum for a better API.

But for certain kinds of GPU-dependent processors, document/book-parallel processing (with a data generator that queues entire books into batches) might be the only efficient option.

Right now all that processors can do is parallelize across the line or page (which is not possible for stateful/contiguous processors).

I would also like to see this recurring theme documented and accessible (perhaps as a permament issue in spec)...

stweil · 2020-09-06T19:58:04Z

This is still an issue:

21:47:30.044 INFO ocrd.resolver - Writing METS to /tmp/xyz/mets.xml
open("/tmp/xyz/mets.xml", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2335, ...}) = 0
21:47:30.046 INFO ocrd.workspace - Saving mets '/tmp/xyz/mets.xml'
open("/tmp/xyz/tmpzz0d09k3", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 3
open("/tmp/xyz/tmpzz0d09k3", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
rename("/tmp/xyz/tmpzz0d09k3", "/tmp/xyz/mets.xml") = 0

The strace protocol shows that first mets.xml is created with mask 0666, but later another file which was created with mask 0600 is renamed to mets.xml. So the final file can only be read by the user who created it. Would a chmod in the Python code after the rename be an acceptable workaround until there is a better solution?

stweil · 2020-09-19T21:16:28Z

See pull request #608 for a possible fix.

bertsky · 2020-09-21T11:19:46Z

Would a chmod in the Python code after the rename be an acceptable workaround until there is a better solution?

As the discussion on untitaker/python-atomicwrites#42 shows, that might create another race IIUC.

stweil · 2020-09-21T11:29:22Z

As far as I see that race condition is only relevant for massiv parallel writes to the METS file. I currently don't have such use cases, and I am afraid that atomic_write would not correctly support them, too, because changes of the METS file might get lost.

bertsky · 2020-09-21T11:35:26Z

As far as I see that race condition is only relevant for massiv parallel writes to the METS file. I currently don't have such use cases, and I am afraid that atomic_write would not correctly support them, too, because changes of the METS file might get lost.

Ah, right, I remember now. We only use this for atomic signal handling, not against potential races.

bertsky · 2020-11-04T00:02:48Z

Fixed by #625

stweil changed the title ~~Wrong file mode for mets.xml~~ Too restrictive file mode for mets.xml Jan 14, 2020

stweil mentioned this issue Sep 8, 2020

Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid) OCR-D/ocrd_tesserocr#149

Closed

bertsky linked a pull request Sep 21, 2020 that will close this issue

Fix mode of METS file after atomic write (fix issue #403) #608

Closed

bertsky mentioned this issue Oct 23, 2020

Add fontshape processor and all-in-one segmentation OCR-D/ocrd_tesserocr#158

Merged

bertsky closed this as completed Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too restrictive file mode for mets.xml #403

Too restrictive file mode for mets.xml #403

stweil commented Jan 14, 2020 •

edited

Loading

stweil commented Jan 14, 2020

stweil commented Jan 14, 2020 •

edited

Loading

kba commented Jan 14, 2020

stweil commented Jan 14, 2020 •

edited

Loading

stweil commented Jan 14, 2020 •

edited

Loading

bertsky commented Jan 15, 2020

stweil commented Jan 15, 2020

kba commented Jan 15, 2020

stweil commented Jan 15, 2020

bertsky commented Jan 15, 2020

stweil commented Jan 15, 2020

bertsky commented Jan 15, 2020

stweil commented Sep 6, 2020

stweil commented Sep 19, 2020

bertsky commented Sep 21, 2020

stweil commented Sep 21, 2020

bertsky commented Sep 21, 2020

bertsky commented Nov 4, 2020

Too restrictive file mode for mets.xml #403

Too restrictive file mode for mets.xml #403

Comments

stweil commented Jan 14, 2020 • edited Loading

stweil commented Jan 14, 2020

stweil commented Jan 14, 2020 • edited Loading

kba commented Jan 14, 2020

stweil commented Jan 14, 2020 • edited Loading

stweil commented Jan 14, 2020 • edited Loading

bertsky commented Jan 15, 2020

stweil commented Jan 15, 2020

kba commented Jan 15, 2020

stweil commented Jan 15, 2020

bertsky commented Jan 15, 2020

stweil commented Jan 15, 2020

bertsky commented Jan 15, 2020

stweil commented Sep 6, 2020

stweil commented Sep 19, 2020

bertsky commented Sep 21, 2020

stweil commented Sep 21, 2020

bertsky commented Sep 21, 2020

bertsky commented Nov 4, 2020

stweil commented Jan 14, 2020 •

edited

Loading

stweil commented Jan 14, 2020 •

edited

Loading

stweil commented Jan 14, 2020 •

edited

Loading

stweil commented Jan 14, 2020 •

edited

Loading