Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection Backup does not finish #4

Closed
ko-christ opened this issue Jun 30, 2016 · 5 comments
Closed

Collection Backup does not finish #4

ko-christ opened this issue Jun 30, 2016 · 5 comments

Comments

@ko-christ
Copy link

Hi,

A collection Backup does not seem to finish on some nodes unless running backupindex with --parallel argument.

  --parallel        Don't wait to confirm each replica backup succeeds. Default: false

Sample Output

java -jar ~/solrcloud_manager-assembly-latest.jar backupindex -c collection_201606220638 --dir /opt/backups/dev_20160630-12345 -z deb21:2178,deb22:2178,deb23:2178/dev

...

Waiting for backup request of col_201606220638_shard1_replica1 to complete
Waiting for backup request of col_201606220638_shard1_replica1 to complete
Waiting for backup request of col_201606220638_shard1_replica1 to complete
Waiting for backup request of col_201606220638_shard1_replica1 to complete


#debug
REQUEST: /collection_201606220638_shard1_replica1/replication command=backup&location=/opt/backups/dev_dev_20160630-12345/collection_201606220638/shard1&numberToKeep=2
RESPONSE: GenericSolrResponse({responseHeader={status=0,QTime=8},status=OK})
REQUEST: /collection_201606220638_shard1_replica1/replication command=details
RESPONSE: GenericSolrResponse({responseHeader={status=0,QTime=3},details={indexSize=9.26 KB,indexPath=/opt/solr/index/dev/collection_201606220638/restore.20160630064126164,commits=[{indexVersion=1467210833288,generation=2,filelist=[_0.fdt, _0.fdx, _0.fnm, _0.nvd, _0.nvm, _0.si, _0_Lucene50_0.doc, _0_Lucene50_0.pos, _0_Lucene50_0.tim, _0_Lucene50_0.tip, _0_Lucene54_0.dvd, _0_Lucene54_0.dvm, segments_2]}],isMaster=true,isSlave=false,indexVersion=1467210833288,generation=2,master={replicateAfter=[commit],replicationEnabled=true,replicableVersion=1467210833288,replicableGeneration=2},slave={masterDetails={indexSize=9.26 KB,indexPath=/opt/solr/index/dev/collection_201606220638/restore.20160630064133869,commits=[{indexVersion=1467210833288,generation=2,filelist=[_0.fdt, _0.fdx, _0.fnm, _0.nvd, _0.nvm, _0.si, _0_Lucene50_0.doc, _0_Lucene50_0.pos, _0_Lucene50_0.tim, _0_Lucene50_0.tip, _0_Lucene54_0.dvd, _0_Lucene54_0.dvm, segments_2]}],isMaster=true,isSlave=false,indexVersion=1467210833288,generation=2,master={replicateAfter=[commit],replicationEnabled=true},slave={masterUrl=http://deb21:7637/solr-search/collection_201606220638_shard1_replica1/,indexReplicatedAt=Thu Jun 30 06:02:59 PDT 2016,indexReplicatedAtList=[Thu Jun 30 06:02:59 PDT 2016],timesIndexReplicated=1,lastCycleBytesDownloaded=9648,previousCycleTimeInSeconds=0,currentDate=Thu Jun 30 10:44:56 PDT 2016,isPollingDisabled=false,isReplicating=false}},masterUrl=http://deb23:7637/solr-search/collection_201606220638_shard1_replica3/,indexReplicatedAt=Thu Jun 30 05:54:11 PDT 2016,indexReplicatedAtList=[Thu Jun 30 05:54:11 PDT 2016],timesIndexReplicated=1,lastCycleBytesDownloaded=9648,previousCycleTimeInSeconds=0,currentDate=Thu Jun 30 10:44:54 PDT 2016,isPollingDisabled=false,isReplicating=false},backup={startTime=Thu Jun 30 10:44:54 PDT 2016,fileCount=13,status=success,snapshotCompletedAt=Thu Jun 30 10:44:54 PDT 2016,snapshotName=null}}})
Waiting for backup request of collection_201606220638_shard1_replica1 to complete
REQUEST: /collection_201606220638_shard1_replica1/replication command=details
RESPONSE: GenericSolrResponse({responseHeader={status=0,QTime=3},details={indexSize=9.26 KB,indexPath=/opt/solr/index/dev/collection_201606220638/restore.20160630064126164,commits=[{indexVersion=1467210833288,generation=2,filelist=[_0.fdt, _0.fdx, _0.fnm, _0.nvd, _0.nvm, _0.si, _0_Lucene50_0.doc, _0_Lucene50_0.pos, _0_Lucene50_0.tim, _0_Lucene50_0.tip, _0_Lucene54_0.dvd, _0_Lucene54_0.dvm, segments_2]}],isMaster=true,isSlave=false,indexVersion=1467210833288,generation=2,master={replicateAfter=[commit],replicationEnabled=true,replicableVersion=1467210833288,replicableGeneration=2},slave={masterDetails={indexSize=9.26 KB,indexPath=/opt/solr/index/dev/collection_201606220638/restore.20160630064133869,commits=[{indexVersion=1467210833288,generation=2,filelist=[_0.fdt, _0.fdx, _0.fnm, _0.nvd, _0.nvm, _0.si, _0_Lucene50_0.doc, _0_Lucene50_0.pos, _0_Lucene50_0.tim, _0_Lucene50_0.tip, _0_Lucene54_0.dvd, _0_Lucene54_0.dvm, segments_2]}],isMaster=true,isSlave=false,indexVersion=1467210833288,generation=2,master={replicateAfter=[commit],replicationEnabled=true},slave={masterUrl=http://deb21:7637/solr-search/collection_201606220638_shard1_replica1/,indexReplicatedAt=Thu Jun 30 06:02:59 PDT 2016,indexReplicatedAtList=[Thu Jun 30 06:02:59 PDT 2016],timesIndexReplicated=1,lastCycleBytesDownloaded=9648,previousCycleTimeInSeconds=0,currentDate=Thu Jun 30 10:44:59 PDT 2016,isPollingDisabled=false,isReplicating=false}},masterUrl=http://deb23:7637/solr-search/collection_201606220638_shard1_replica3/,indexReplicatedAt=Thu Jun 30 05:54:11 PDT 2016,indexReplicatedAtList=[Thu Jun 30 05:54:11 PDT 2016],timesIndexReplicated=1,lastCycleBytesDownloaded=9648,previousCycleTimeInSeconds=0,currentDate=Thu Jun 30 10:44:57 PDT 2016,isPollingDisabled=false,isReplicating=false},backup={startTime=Thu Jun 30 10:44:54 PDT 2016,fileCount=13,status=success,snapshotCompletedAt=Thu Jun 30 10:44:54 PDT 2016,snapshotName=null}}})
Waiting for backup request of collection_201606220638_shard1_replica1 to complete
...



### solr log
4314707 INFO  (qtp2109957412-159) [c:collection_201606220638 s:shard1 r:deb23:7637_solr-search_collection_201606220638_shard1_replica3 x:collection_201606220638_shard1_replica3] o.a.s.c.S.Request [collection_201606220638_shard1_replica3]  webapp=/solr-search path=/replication params={slave=false&qt=/replication&wt=javabin&version=2&command=details} status=0 QTime=0

I can see that the index data are written on disk.
What exactly prevents the backup from stopping successfully. Is it expecting something specific from the replication?command=details page?
And last but not least, is using --parallel argument and ignoring the backup status confirmation somehow risky and not suggested?

Any advice is welcome. Thanks!

@randomstatistic
Copy link
Contributor

Hm, I successfully used this against my 5.4 cluster just last week without the --parallel. What Solr version is your cluster? It's possible something in the replication handler's api changed, especially with the backup/restore Solr patches that've been in flight lately.

The --parallel switch in this context (and in others, like fill) generally means "make sure the command got a success response, but don't wait to confirm that the command actually did anything". I encountered a lot of Collections API bugs when I was starting work on this tool back around Solr v4.8, so I liked having an extra confirmation. Also, the confirmation-wait effectively forces multiple requests to happen one at a time, which is useful if you're concerned about the performance impact of what you're doing on your cluster.

In this case, the confirmation is polling the replication handler, which has a report about the last backup time and whether it succeeded. The gate is seeing a success there. (Which I do see, glancing at your debug output. I'll have to look closer.)
With --parallel, this means a success from solrcloud_manager means it successfully submitted the backup request to Solr, but if Solr fails to perform that operation at some point down the line (runs out of disk space, perhaps), you won't know unless you have some other mechanism to check with. Watching error logs, perhaps.

@ko-christ
Copy link
Author

Thanks for your input. Version is 5.5.0.

I restarted Solr replica1 and I cannot reproduce it anymore.

I will not use --parallel for now, I find it very important to get API's feedback regarding the backup operation. If it happens again I will get some more verbose Solr logs reopen/update if needed.

@randomstatistic
Copy link
Contributor

Just looked at the code. To concretely answer your question, the confirmation polls the command=details every three seconds looking for both of two things in the response:

  1. A details.backup.status=success
  2. A details.backup.snapshotCompletedAt with a datetime after a datetime that solrcloud_manager recorded just before submitting the backup request.

Both of those paths look like they're present and expected in your debug output, so it seems likely it's the second condition that's blocking things. I'm not sure why that'd be though.
One thought is that the time or timezone where you're running solrcloud_manager is different than the one that the Solr servers have set? If so, that might make datetime comparison difficult.

@ko-christ
Copy link
Author

I see. Yes [2] and different timezones must be the root cause.

@randomstatistic
Copy link
Contributor

Filed Issue #5, since that seems like a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants