-
Notifications
You must be signed in to change notification settings - Fork 151
Moped is utilizing 'bad' Connections from the connection Pool #346
Comments
I also opened #348 which seems possibly related. |
I'm not sure how much can help, but most of my errors are related and are this:
I've change my pool timeout but seems to still be using 0.5
Not sure if is because authenticate_user_from_token is the most used one but seems that the error came mainly from here. Wich is a look for a user in the db and sign_in with devise. Two more things that can be related with the error is the way i look for the user: def find_by_username username
Rails.cache.fetch(username, :expires_in => 30.seconds) do
self.where(username: /^#{username}$/i).first
end
end |
I was able to fixed, In my case i found out that the error it was 2 things. First the pool_size, and pool_timeout wasn't enough and second, for some reason my postfix queue was sending old messages as new ones, so from my point of view none of my modifications had any effect. Anyways with a small pool_size and pool_timeout you should have those errors, i think that is the expected behaviour. |
@argami What are the values you use? |
The "not master and slaveOK=false" after a stepdown or restart of the primary was fixed for us by #351 . |
@dblock this are my values: options:
pool_size: 200
pool_timeout: 15
retry_interval: 1
timeout: 60
max_retries: 20
refresh_interval: 30 But beause i had a lot of noise on the mail stuff probably the last value i used (100) probably was good too Im using Puma 3 workers 0,8 threads |
This is fixed in #351 |
Thanks @mateusdelbianco! Looking forward to a moped release soon with these. |
You should thank @wandenberg for his patch! :) |
So are BOTH issues fixed? The master/slave issue AND the one where the ConnectionPool gem throws the ConnectionPool::PoolShuttingDownError when you restart a VM where a mongo primary is running? If so when will a new release of Moped be cut? Thanks! |
Please see my PR #348 |
Still seeing this error when a mongo db node is killed. Moped should evict this bad connection. I am assuming since this issue is still open that this was NOT resolved with 2.0.4. I am using Mongoid 4.0.2 and Moped 2.0.4 and Connection_Pool 2.1.1. Any ideas when this will be addressed? thanks! 2015-02-23T15:06:11-07:00 INFO : [ebdb5491-645e-49b7-9535-07ec2412573f] [ 10.0.32.86] [TBD] [visitor_id: 140555603] Completed 500 Internal Server E rror in 21ms |
Still seeing it with Moped 2.0.4 as well |
To provide more information on this issue, I have a 3 node replica set ( node1, node2 and node3 ). Node1 is the primary. When node2 (secondary node ) is shutdown via the instructions on Mongodbs website for upgrading a replicaset, Mongoid/Moped throws a ConnectionPool::PoolShuttingDownError when it attempts to grab a connection. WHY would it throw this error if this error is generated from a SECONDARY node that should NOT be in the pool anyway???? The mongoid.yml default read option is primary. It should never be creating pool connections to SECONDARY nodes. It appears to me that Mongoid/Moped is throwing SECONDARY node exceptions to the client when the read preference is PRIMARY. And if this error is thrown from logic that is monitoring the replica set, it should NEVER be returned to the client. Mongoid/Moped should always return a connection to the PRIMARY node if the primary node is UP and AVAILABLE. Seems like a fairly MAJOR bug. |
Can you try it using my branch and let me know the results? I've added more logging to moped and integrated a few pull requests that have not been integrated into Moped master/gem. We are using this in production without issue, although there are still some wonky things with IP changes etc:
|
Will do. Thx. From: Joel Niedfeldt [mailto:[email protected]] @steve-rodriguezhttps://github.com/steve-rodriguez Can you try it using my branch and let me know the results? I've added more logging to moped and integrated a few pull requests that have not been integrated into Moped master/gem. We are using this in production without issue, although there are still some wonky things with IP changes etc: gem 'moped', :git=>'git://github.com/niedfelj/moped.git', :branch=>'all_patches' — This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies. OppenheimerFunds may, at its sole discretion, monitor, review, retain and/or disclose the content of all email communications. |
@niedfelj I tried your |
@niedfelj I tried this branch and it appears the ConnectionPool::PoolShuttingDownError has gone away, BUT when stepping down the primary, I see these errors now: Moped::Errors::ConnectionFailure (Could not connect to a primary node for replica set #<Moped::Cluster:1553548 @seeds=[<Moped::Node resolved_address="ip1:33478">, <Moped::Node resolved_address="ip2:33478">, <Moped::Node resolved_address="ip3:33478">]> where ip1, ip2 and ip3 are our internal ipaddresses for each Mongodb node in the replica set. So to test, I had our DBAs shutdown each secondary node independently and restart them. I saw no connection pool shutting down errors. I then had all nodes online, and had them do a stepdown for the primary node. Then I started getting the Could not connect to primary node in the replica set. Version 2.0.4 of Moped appears to have solved this issue, but your branch/patches contains the could not connect to primary issue I believe. Please advise. Thanks for your help! I think this is getting closer to working... So I think if you merge your changes with the 2.0.4 changes, we may have something that is close to working... |
@steve-rodriguez moped 2.0.4 without @niedfelj's patch also throws |
The 2.0.4 branch does throws those, BUT it actually does reconnect when retrying I believe. I will do some more testing to verify |
Ok, heres the scoop to reproduce this issues. You have a 3 node replicaset ( n1-primary, n2-secondary and n3-secondary ) TEST1 - Good Behavior: If you have just restarted your app and have ALL good connections in your connection pool, you can do an rs.stepDown() to elect a new primary. The app will work properly and you get no errors. BUT, if you do a shutdown of one of the secondaries ( n2 ) and then restart it, AND then do an rs.stepDown(), where n2 is now the primary ( n2-primary, n1-secondary, n3-secondary ), it will attempt to talk to the primary and will throw the Moped::Errors::ConnectionFailure: Could not connect to a primary node error. I have my number of retries set to 20 and I can see it count down retrying every second and then it fails after 20 tries. So I suspect that somehow moped is mistakenly keeping around connections that still think that n1 is primary when it actually is now a secondary. |
Hi @steve-rodriguez , I'm trying to reproduce the problem and solve it but I wasn't able to. |
@wandenberg Here is our environment:
Here is my Mongoid.yml settings:
If you need anything else, let me know.. Will be doing more testing today to see if I can further isolate the issue.. Thanks! -Steve |
Here are some screen prints of the steps to reproduce the issue. Read preference is primary. Hope my description and screen prints help. Each screen print starts with a number corresponding to the 7 steps in sequential order to reproduce the issue. I think that somehow Moped is creating pool connections to the secondary node, but ONLY gives back a connection(s) to the Primary node. I believe that moped keeps track of the 3 nodes ( 1 primary and 2 secondary ). when a secondary is brought down, all is good as Moped is reading from the primaries. But somehow moped keeps state that the secondary node is unavailable. When the secondary node is brought back up, all is good as it is still a secondary node. Once you do a stepDown and that secondary node that was JUST down becomes primary, you start seeing occassional issues where it can not connect to the Primary node. I think this occurs when moped returns a connection that was ORIGINALLY a connection to the secondary, but now that node is primary, but since is was brought down, the status of the node remains down I believe. |
Can you make sure you have your Rails logger set to debug and make sure Moped/Mongoid are setup like this in an initializer:
You should be able to see MOPED asking for primaries, which I don't see in your output:
I'm also running a separate version of Mongoid, because I wanted to be able to see the exceptions during operations, which both mongoid/moped often just swallow: https://github.com/mongoid/mongoid/pull/3884/files I would set your pool_timeout to 0.5 ... 15 is too high and it's only the setting for "checking out" a connection (not connecting or making a connection). And refresh_interval should be higher also...I would set it to at least 120 (the default is 300) And I would lower timeout to 5. I have this PR out for showing more configs (you may be interested in tuning down_interval) |
@niedfelj Will do this today and let you know the results and will post some log files.. |
Can you test the code at this branch on your server? It will produce some log messages like these bellow.
|
@wandenberg Yep give me a bit and I will post the results. Didn't have a chance to post results yesterday... |
@niedfelj Here is a RECAP of my testing with your branch with the following mongoid.yml settting: production:
I changed my read preference to be primary_preferred. Ideally a read preference of primary should work, but I believe something in moped caused it to NOT work. I had a 3 node replicaset that looks like this to start ( n1-secondary, n2-primary and n3-secondary ). I brought down n1 and tested for a bit with no errors available to the app, but moped errors in the log which is fine. I brought n1 back up and took down n3. Same behavior. I brought n3 back up and stepped down n2. Then I took down n2 and brought it back up. All of this was done to simulate doing a mongo db upgrade. No errors visible to the app, but the attached moped errors occurred in the log. I started the test at 1:34 p.m. and ended it around 1:38 p.m. Please advise on this TEST and getting your niedfelj branch changes into the main Moped code base and a NEW release cut. It would be nice if the read preference could be primary instead of primary_preferred. Thanks! -Steve |
at some point on your application the logger was set to nil and the Moped.logger do not create the default logger again.
I changed the Moped.logger to always return an instance. Sorry for the problem I could have caused. |
did you have opportunity to test the code? |
@wandenberg Sorry for the delay. I had to get some QA resources to help test and they were unavailable at the end of last week. Will be doing some testing today and respond by the end of the day. thx! |
@wandenberg Here is a snippet of my mongoid.yml and the log file from the app that shows the ConnectionPool::PoolShuttingDownError. All I basically did to generate this error was shut down a secondary node in the replica set and then clicked around the app a few times and this error would show up occasionally. Mongoid.yml snippet: production:
|
@niedfelj Any feedback on what I mentioned to you 8 days ago? thanks! |
@steve-rodriguez Those log messages aren't necessarily "errors". I just expanded the logging facilities in my branch because I wanted to have more visibility into the processes occurring in moped/mongoid (IMHO, the current implementations take the stance of showing almost no logging, and that's not helpful). I'm glad my branch worked for you, unfortunately I don't have any ability to get them into the main codebase. My PR's are sitting there waiting... |
Are there any progress on this issue? It's really a plague for production application running with it. Thx @steve-rodriguez @niedfelj @wandenberg for the work and please @durran @arthurnn could we incorporate the fixes into a new release ASAP ? Thx. |
The little bird tells me we're going to get relatively little support for this because Mongoid 5 + the office MongoDB Ruby driver 2.0 are in a fairly advanced shape. It does seem like @arthurnn is willing to merge commits though and release things - lets all try to give him some PRs he can vote yes/no to? |
We're seeing this issue as well when one of the secondaries is not reachable at all (this is with To reproduce (in a 3 node replicaset)
When running this in Sidekiq, it won't recognise this as an error and drop the job completely, causing lost data from the queue. |
So I did a little digging today and this is the relevant Stack:
The issue lies in the retry beeing attempted when a node is down, (I use the @wandenberg debug branch to add even more logging.
The retry fails (from what I have gathered), because it tries to disconnect the node from a connection that is already being shutdown. When I change the Failover strategy (lib/moped/failover.rb) from: Errors::ConnectionFailure => Retry, to Errors::ConnectionFailure => Disconnect, The issue disappears. I tried rescue-ing the connection pool error, but I had little luck (lib/moped/failover/retry.rb) def execute(exception, node)
begin
node.disconnect unless exception.is_a?(Errors::PoolTimeout)
rescue => e
node.down!
raise(e)
end
begin
node.connection do |conn|
yield(conn) if block_given?
end
rescue Errors::PoolTimeout => e
raise Errors::ConnectionFailure.new e
rescue Exception => e
node.down!
raise(e)
end
end It would still raise the connectionpool error. So my question is, why would we want to retry the |
+1, @arthurnn any update on this one? |
+1 right now, we have a some issues in productions in dozens of server and websites + api that is used by many vendors, movie studios and our apps. Right now, if any thing happens to a node in the replication , we are getting |
@steve-rodriguez , @matsimitsu solution seems fair, what is the final say? it will be great if we can merge and dump the version |
@arthurnn @durran @sahin @matsimitsu As you can see, I have been pushing to get this fixed for months and for some reason there seems to be no activity on this project with respect to this issue and getting various pull requests committed. Quite a few developers have provided potential solutions, but there seems to be nothing getting pulled into the code base. Can one of the committers address this issue and why there seems to be no pull requests or ideas being committed/implemented? Thanks! |
@arthurnn @durran , the moped is in production and powering some super famous movie sites, I cant tell which one, but I can give hints, skynet :) , so can go and check the site and see movielala.com logo in the videos. So I can say there are so many people all over the world are affecting from this issue. It is also affecting our production api + web site. I know that you guys are busy and moped is moving to offical mongo support :) but before that. We are having crazy issues with this error , it will be super great. You guys can consider @steve-rodriguez 's solution and other solutions. |
@arthurnn @durran @matsimitsu @steve-rodriguez , I just had around 5 K errors during the primary election. |
@arthurnn @durran @matsimitsu @steve-rodriguez any updates? |
+1 |
+1 |
Can everyone comment on what they are running in production right now (and with what effect)? Maybe you can push a version of the moped gem to Rubygems under a different name (eg. moped-pool-shutting-down-fixes)? |
#351 has been merged, can people on this thread please try that solution? |
@dblock Didn't get a chance to get to this yesterday. Will test it this afternoon and provide feedback. Thanks! |
use @argami refer config seems not appear the issue |
I have a 3 node replica set with 1 primary and 2 secondaries. We read from Primary. We are using the 4.0.1 Mongoid gem, 2.0.3 Moped gem and the 2.1.1 connection pool gem. When we directly kill a mongo node and a new Primary is selected, 'broken' connections remain in the pool and when the app grabs one it throws the ConnectionPool::ShuttingDownError. When we gracefully step_down the primary and a new primary is selected, 'old' connections to the 'old' primary still exist in the pool and the app throws a 13435 error of "not master and slaveOK=false". Any time a connection is 'broken' or not valid useable anymore, it should NOT be utilized by Mongoid. Attaching screenprints of the error and the mongodb rest interface status ( for the 'killed' scenario ). Please advise ASAP. Thx!
The text was updated successfully, but these errors were encountered: