Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression] security.json is not uploaded during the first initialization of SolrCloud #720

Open
erwanval opened this issue Aug 23, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@erwanval
Copy link

erwanval commented Aug 23, 2024

This is a regression caused by #660.

During the first initialization of SolrCloud, the security.json doesn't exists in Zookeeper, which cause an exception during the setup-zk initContainer zkcli.sh command
With the change introduced in above PR, when there is an error, it now completely skip the upload. Then the SolR just starts with no security.json at all.
It seems an empty one is created during the main container initialization, so upon manual restart (or if another Solr pod takes longer to start), setup-zk run as expected and the real security.json is uploaded.

I'm not sure how to solve this while retaining the reason why this change has been introduced in the first place. Maybe the zkcli.sh could return a different error code depending on the kind of exception it encounters (missing security.json, or everything else)?

setup-zk logs during first init:

ERROR: KeeperErrorCode = NoNode for /solr/solr

Creating ZooKeeper path /solr/solr on ZooKeeper at zookeeper.solr.svc:2181
INFO - 2024-08-23 08:25:26.748; org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect to ZooKeeper
INFO - 2024-08-23 08:25:26.786; org.apache.solr.common.cloud.ConnectionManager; zkClient has connected
INFO - 2024-08-23 08:25:26.786; org.apache.solr.common.cloud.ConnectionManager; Client is connected to ZooKeeper
Exception in thread "main" org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /security.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2358)
at org.apache.solr.common.cloud.SolrZkClient.lambda$getData$6(SolrZkClient.java:349)
at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:79)
at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:349)
at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:328)

@MitchIonascu
Copy link

MitchIonascu commented Sep 23, 2024

Hi there!

We get the same issue when initializing a new solr cloud cluster with version 9.7.0.

image

Solr is then exposed to the internet but with no basic authentication(or any authentication), consistent with security.json not being loaded at all.

We cannot bypass this by restarting solr, as the file is never uploaded.

A workaround that I've found to work, is to initialize the cluster with Solr 9.6, then upgrade in place to 9.7. However I am not sure about this approach and its implications.

Thank you!

@kbarnesMCC
Copy link

Just chiming in; we're also unable to leverage 9.7 until this can get resolved.

@mcarroll1
Copy link

This is also blocking an upgrade for our team. Per @erwanval , it does seem like the zkcli returns error code 1 regardless of the issue. Would it be acceptable to rely on the string error messages (e.g. NoNode for /security.json)? I'm interested in working on this next week if that solution is acceptable, or if we can find a reasonable alternative.

@janhoy janhoy added the bug Something isn't working label Dec 6, 2024
@janhoy
Copy link
Contributor

janhoy commented Dec 6, 2024

I tagged it as a bug, feel free to work on a Pull Request trying to fix the issue.

I did a quick test of the first command line on a 9.7 docker:

ZK_SECURITY_JSON=$(/opt/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd get /security.json || echo 'failed-to-get-security.json');

I have an empty security.json file, so I'd expect the variable to be {}, but instead it is

WARNING: The zkcli.sh script has been deprecated in favour of the bin/solr equivalent commands. {}

When we deprecated the script in v9.7 we added an echo at line 3 printing the deprecation warning. But it prints to stdout, meaning that it will be part of the response always, breaking all scripts that rely on -cmd get. So in any case that script should be fixed for 9.8 @epugh by prefixing the echo line with >&2, i.e. >&2 echo "WARNING:...".

This will fix the case where we have a security.json file which is empty {} from next solr version. But it won't fix the case where the file does not exist in ZK. In that case the same script would still return the string "failed-to-get-security.json" which is longer than 2 chars, causing the same effect, i.e. not uploading security.json since we believe the existing file exists and is non-empty. The fix for this will be in operator to return empty JSON object if file not found, triggering the upload: || echo '{}'.

@janhoy
Copy link
Contributor

janhoy commented Dec 6, 2024

I guess it would be possible to patch solr 9.7 docker container by mounting a fixed script as a volume on the POD at /opt/solr/server/scripts/cloud-scripts/zkcli.sh and thus work around the issue until 9.8 is out.

@janhoy
Copy link
Contributor

janhoy commented Dec 7, 2024

Committed the fix to solr repo. It will (likely) make things work again with Solr 9.8, just like it worked with 9.6. Unfortunately there was also a need for a java level fix, so the proposed workaround mounting a patched script won't work since there was more deprecation noise ending up in the output, see SOLR-17586 for details. If you want to test this right now, you'd need to build a custom docker image for branch_9x and use that as your solr image.

Easiest way is to wait for Jenkins to build the next nightly version ofapache/solr-nightly:9.8.0-SNAPSHOT. I triggered a build, and once it completes (see https://ci-builds.apache.org/job/Solr/job/Solr-Docker-Nightly-9.x/) you will be able to pull it for testing.

@janhoy
Copy link
Contributor

janhoy commented Dec 7, 2024

Another way to fix the issue is to change the operator code so that it uses bin/solr zk cp instead of the deprecated zkcli.sh, which is actually removed in 10.0. So we need to make this move anyway, and it would also work for 9.7 and 10.x. Here is a script that does about the same:

solr zk cp zk:/security.json /tmp/current_security.json >/dev/null 2>&1
if [ $? -eq 0 ]; then
    if [ ! -s /tmp/current_security.json ] || grep -q '^{}$' /tmp/current_security.json; then
        echo $SECURITY_JSON > /tmp/security.json
        solr zk cp /tmp/security.json zk:/security.json >/dev/null 2>&1
        echo "put security.json in ZK"
    fi
fi

I think perhaps that if /security.json does NOT exist in ZK, we should perform the bootstrap instead of silently continue?

@janhoy
Copy link
Contributor

janhoy commented Dec 9, 2024

I created #731 to track the need for a change to support Solr 10.0.

@mcarroll1
Copy link

Thanks @janhoy. I'll move forward with the bin/solr zk cp solution then. Makes sense to get off the deprecated zkcli entirely.

@gerlowskija
Copy link
Contributor

+1 for the bin/solr zk cp approach - seems like that solves two problems for us at once!

@iskandar
Copy link

If you replace the usage of zkcli.sh (which makes sense), please also mount TLS volumes (e.g. /var/solr/tls) on the setup-zk initContainer so connections to TLS-enabled Zookeeper will be able to use custom keystores/truststores.

We are currently overwriting zkcli.sh with a wrapper that does a no op and then pushing security.json to ZK in a custom initContainer (with TLS keystore/truststores mounted).

@HoustonPutman
Copy link
Contributor

If you replace the usage of zkcli.sh (which makes sense), please also mount TLS volumes (e.g. /var/solr/tls) on the setup-zk initContainer so connections to TLS-enabled Zookeeper will be able to use custom keystores/truststores.

Maybe for this, we have flags in our extra volume specification that says neededForSolr and neededForZookeeper, which will mount them to any initContainers or sidecarContainers that use ZK or Solr connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants