Files marked .bad after "Too many open files" read error #1

rdingwall · 2017-04-18T12:04:04Z

The current diskqueue implementation does not discriminate between different types of errors returned by os.OpenFile(). We experienced a problem in production where an nsqd host ran out of file descriptors and os.OpenFile() returned a "Too many open files" error, so a number of NSQ diskqueue files were incorrectly marked as .bad, requiring manual repair (by renaming them and updating the cursor position in the metadata files). It would be good if NSQ was able to discriminate between transient "try again" read errors vs other more serious errors.

I don't have a PR to fix unfortunately, I just thought I would flag up.

The text was updated successfully, but these errors were encountered:

mreiferson · 2017-04-22T22:37:55Z

Hi, thanks for the feedback.

I'm not sure what you would expect the behavior to be in that case. The only possible behavior I can think of that would avoid needing to manually repair files would be if nsqd refused to proceed until it was restarted — is that desirable vs. current behavior?

rdingwall · 2017-04-30T09:24:34Z

If nsqd encountered EMFILE or ENFILE on a diskqueue file, I would expect the behaviour to be something along the lines of:

Client publishes a message: return an error to the client
New client subscribing: return an error to the client
Existing client subscription: either graceful recovery (subscription goes quiet until the file is readable again) or terminate the subscription and let the client reconnect

I think?

mreiferson · 2017-05-03T00:08:37Z

Unfortunately, given nsqd's implementation (memory and disk) this is a little more complicated:

Client publishes a message: return an error to the client

Not all messages reach disk and so it would only make sense to error when a message "overflowed" to disk — which is basically how nsqd currently behaves.

New client subscribing: return an error to the client

Existing client subscription: either graceful recovery (subscription goes quiet until the file is readable again) or terminate the subscription and let the client reconnect

As per above, since there are commonly cases where most messages don't reach disk, it doesn't make sense to penalize all clients of the topic.

If all messages went to disk, e.g. if nsqd moved to a "log" style architecture, it would be far easier to make decisions about what appropriate behavior is for situations like this.

mreiferson added the question label May 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files marked .bad after "Too many open files" read error #1

Files marked .bad after "Too many open files" read error #1

rdingwall commented Apr 18, 2017

mreiferson commented Apr 22, 2017

rdingwall commented Apr 30, 2017

mreiferson commented May 3, 2017

Files marked .bad after "Too many open files" read error #1

Files marked .bad after "Too many open files" read error #1

Comments

rdingwall commented Apr 18, 2017

mreiferson commented Apr 22, 2017

rdingwall commented Apr 30, 2017

mreiferson commented May 3, 2017