Concurrent updates to file data field (e.g. derivatives) #665

ibrahima · 2023-12-18T20:09:18Z

ibrahima
Dec 18, 2023

Hi! I have been looking at Shrine for a while and recently joined a project that's using it for image processing. I noticed that in our codebase, we have used database locks when processing derivatives, presumably to guard against concurrent processes (e.g. different background jobs) trying to generate derivatives and producing incorrect data that might say, drop one derivative when updating another if the second writer reads stale data. It seems like the derivatives plugin tries to handle concurrent updates, but it only does so with a Mutex, which if I understand correctly would protect against multiple threads trying to update the same record, but would not help in the case that different worker processes tried to update the same record. However, I feel like the database locking solution is a potential source of future pain, because in a high concurrency situation it could amplify the load on the system as a bunch of writers get stuck waiting for locks.

It seems like at least for Postgres, there is a jsonb_set operator which could handle fine-grained updates to the relevant derivative's field without the potential for dropping updates to other derivatives or other JSON fields. But I don't know if that's something that's available across ActiveRecord backends...

I think in general storing a lot of data on a JSONB field feels kinda risky to me due to possibilities like this, so I'm also wondering if there's a way to store this data in say, a separate table instead. I didn't find anything obvious right now but I haven't really dived into Shrine much yet (besides researching it several years ago for a different project).

I want to stress that I'm just thinking out loud here, and only came across this potential issue about half an hour ago. I don't know if we're just doing something wrong to need to lock these DB records, so it may be a non-issue for most people. (Maybe we should just avoid concurrent derivative processing or something? IDK.) Thanks for humoring me with this discussion!

Edit: After reading some docs, I did find this: https://shrinerb.com/docs/plugins/persistence#atomic-persistence. So maybe we need to be using that instead? Right now we are just calling record.image_derivatives! and it seems like we needed to throw a DB lock around that, so it seems like to use atomic_persist we'd need to go a little lower level?

jrochkind · 2023-12-18T23:27:22Z

jrochkind
Dec 18, 2023
Collaborator

Hey @ibrahima , I've been thinking about these issues for a while too. To be clear, I am not a maintainer of shrine, just another user like you, but a long-time user.

I don't really have An Answer, but I've spent a while thinking and working on what you're talking about too, so wanted to respond to share what I know, and agree that you are on the right track and I think basically correct in your understanding.

Yes, that mutex in derivatives doesn't help with cross-process stuff. I'm not sure what it's doing there... I guess just to catch the simple case when you are using mutliple threads in a process to do concurrent derivative processing? But you are right it doesn't even handle multi-process derivative processing in bg workers.

Shrine does have another mechanism in place to do a kind of "optimistic locking" on changes to underlying files. It wasn't initially documented very well (the fact that ti's defined kind of abstractly depending on DB adapter doesn't help), so some time ago after trying to figure out how it worked, I actually contributed some documentation here: https://shrinerb.com/docs/plugins/atomic_helpers

I wrote most of that! Take a look and see what you think?

I agree with you that database-level pessimistic locking is a performance risk. Some kind of "optimistic locking" is better... those "atomic helpers" are meant to give you some tools for that... but it does get really confusing quickly.

I think the basic answer at this point is that shrine itself does not have a built-in concurrency-safe solution for concurrent derivative generation, you are on your own, although it does have those helpers that could be a start...

In my own code (not part of shrine), I tried to build out some additional logic tooling on top for your specific use case of adding derivatives... and I tried to do it as a shrine plugin... I think it works, but it got pretty convoluted and confusing, check out my code here: https://github.com/sciencehistory/kithe/blob/master/lib/shrine/plugins/kithe_persisted_derivatives.rb

I do think I solve the problem you are talking about, but I guess I'm not 100% sure of that, and it's pretty wacky.

I am not sure how/if shrine could use the postgres jsonb_set operators... I've spent some time thinking through that too. Shrine is already trying to handle multiple different ORMs, then to try to add support for multiple underlying databases too (those operators are pg-specific).... I'm not sure.

Another project where I am the maintainer, attr_json is all about using jsonb with ActiveRecord, and I've tried to figure out some way to hack postgres jsonb set operators into ActiveRecord in a generalized/automatic way for partial updates of attributes serialized into json... but have not been able to figure it out! It might be more possible to get shrine to do it with very specific use-cases, but it's hard to get it to play well with ActiveRecord.

If you are using ActiveRecord, one solution would be using ActiveRecord's own built-in "optimistic locking" feature. https://api.rubyonrails.org/classes/ActiveRecord/Locking/Optimistic.html. Instead of the db-level pessimistic locking you are doing. It shouldn't be too hard to wire up derivative updating involving catching AR optimistic locking failures and reloading and reapplying changes... but the way the AR feature works, you'd have to turn on Optimistic Locking for the entire model shrine is in, and deal with it generally for that model, which I haven't wanted to do yet, but keep considering it.

The idea of shrine using multiple columns or a separate table instead of a json column... the single json column approach is pretty baked into shrine's architecture at this point. I don't think it would be easy to make flexible, or to change, and changing it would involve some pretty serious trade-offs too -- whole new problems with keeping the multiple columns consistent with each other!

Although just for derivatives... you certainly could ignore shrine's derivative feature entirely, and just implement your own based on a separate associated table where each row has it's own shrine original file, that serves as a derivative. but just using shrine single-files, that you arrange yourself into columns/tables/associations how you want. I've thought of that, but not done it either. It would also have it's own tricks.

i think basically it's a problem that shrine doesn't give you a solution for, but is flexible enough to let you try to solve it in various differnet ways yourself, depending on your specific needs and nature of your use... but it's not that easy!

3 replies

ibrahima Dec 18, 2023
Author

Hey, thanks for the detailed response! That is very helpful, it's nice to see that someone else has thought through these problems. I will need to look into your plugin code later, perhaps it could be useful for us. I think the optimistic locking approaches could also be useful.

Yeah, it doesn't surprise me that the single JSON column is pretty core to the architecture of Shrine. I was mostly thinking out loud there. Though like you said, maybe it would be feasible to write an alternative derivatives plugin that tackles that. I think (again, based on my very limited recent experience with Shrine) that for derivatives specifically, a multi-record approach could be implemented relatively easily without having to worry too much about keeping columns consistent (e.g. if we only allow one of each type of derivative per file, upserts should be able to take care of that I think?).

This is all very helpful, and I'll be sure to keep it in mind if we need to address this in the future. Thank you very much!

jrochkind Dec 19, 2023
Collaborator

For the multi-record derivatives approach, i think I wouldn't bother writing a shrine plugin, I'd just write application-level code to DO it. Like do derivatives without shrine at all, as just your app keeping track of multiple shrine single files and how they relate to each other (some of them are derivaties of others), and it's your apps responsibility to keep them in sync.

A lot of this gets harder when we try to make it generalizable abstract magic API, and gets easier if you just... write the code how you want it.

ibrahima Dec 19, 2023
Author

That's a good point, makes a lot of sense to do it that way!

janko · 2023-12-20T20:34:53Z

janko
Dec 20, 2023
Maintainer

There is a guide section showing how to process derivatives concurrently. It relies on database locks, Shrine::Attacher#atomic_persist uses it internally, which you already found.

A Postgres-specific operator such as jsonb_set could probably be used, but it would have to be a special case. I feel like it would be hard to add support for it now, especially since it would only cover the Postgres + JSONB scenario.

Honestly, storing files as separate records would probably be a great idea, as it would eliminate the need for DB locking. It's something I've admired in Active Storage design and wished I did in Shrine. Unfortunately, at this point it would be too difficult to shoehorn this into Shrine, as a lot of the existing functionality relies on a JSON column. So, if you would like this, I would recommend using Active Storage instead.

2 replies

jrochkind Dec 21, 2023
Collaborator

Nice, I hadn't seen that guide.

I didn't realize or forgot that Shrine::Attacher#atomic_persist used database locks... but/and I always have trouble actually finding the (database-specific?) actual source code logic that winds up being executed for atomic_persist... if it's convenient for you, could you link to the line that calls a database lock in, say, the activerecord implementation? I have not managed to find it!

Ah, wait I think I found it. Here?

shrine/lib/shrine/plugins/activerecord.rb

Line 108 in bf43c6d

record.transaction { yield record.clone.reload(lock: true) }

jrochkind Dec 21, 2023
Collaborator

At the time we were discussing shrine 3, I think you were really taken with the idea that the derivatives implementation allowed multiple-levels of nesting of derivatives, like a derivative key could itself point to a hash or array of file values... I doubt ActiveStorage implementation can do that!

Before we get to derivatives, I really like that Shrine (like paperclip, carrierwave, etc., before it) stores the file reference in place, not in a separate table that other models need to associate to, like activestorage does. But it does come with implementation tradeoffs for sure.

If you didn't like the tradeoffs and prefered activestorage layout, It would not be too hard to implement that as a layer on top of shrine, a Files table where each row has one json file reference, with a to-many to a FileDerivatives where each row ALSO has one json file reference... you could easily use shrine to build something which had a rdbms layout more like activestorage (but still with json... just not with derivatives nested in the json). Shrine is actually more general/flexible in that way, the activestorage table-relation layout is a specialization of it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent updates to file data field (e.g. derivatives) #665

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Concurrent updates to file data field (e.g. derivatives) #665

ibrahima Dec 18, 2023

Replies: 2 comments · 5 replies

jrochkind Dec 18, 2023 Collaborator

ibrahima Dec 18, 2023 Author

jrochkind Dec 19, 2023 Collaborator

ibrahima Dec 19, 2023 Author

janko Dec 20, 2023 Maintainer

jrochkind Dec 21, 2023 Collaborator

jrochkind Dec 21, 2023 Collaborator

ibrahima
Dec 18, 2023

Replies: 2 comments 5 replies

jrochkind
Dec 18, 2023
Collaborator

ibrahima Dec 18, 2023
Author

jrochkind Dec 19, 2023
Collaborator

ibrahima Dec 19, 2023
Author

janko
Dec 20, 2023
Maintainer

jrochkind Dec 21, 2023
Collaborator

jrochkind Dec 21, 2023
Collaborator