Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: After Milvus recovered from many components failure chaos, search failed with an error message failed to search: segment lacks #37703

Open
1 task done
zhuwenxing opened this issue Nov 15, 2024 · 2 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20241114-cd181e4c-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-11-14T09:49:18.081Z] [2024-11-14 09:48:12 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=503, message=failed to search: segment lacks[segment=453922023426637922]; segment lacks[segment=453922023426636382]: channel not available[channel=by-dev-rootcoord-dml_10_453922023426834071v1])>, <Time:{'RPC start': '2024-11-14 09:48:03.697027', 'RPC error': '2024-11-14 09:48:12.722128'}> (decorators.py:140)

[2024-11-14T09:49:18.081Z] [2024-11-14 09:48:12 - DEBUG - ci_test]: [test][start time:2024-11-14 09:48:03.525741][time cost:9.16970466s][operation_name:query][collection name:Checker__w6D93jMU] -> True (checker.py:274)

[2024-11-14T09:49:18.081Z] [2024-11-14 09:48:12 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: segment lacks[segment=453922023426637310]; segment lacks[segment=453922023426636382]: channel not available[channel=by-dev-rootcoord-dml_10_453922023426834071v1])>, <Time:{'RPC start': '2024-11-14 09:48:03.774335', 'RPC error': '2024-11-14 09:48:12.829537'}> (decorators.py:140)

[2024-11-14T09:49:18.081Z] [2024-11-14 09:48:12 - ERROR - ci_test]: Traceback (most recent call last):

[2024-11-14T09:49:18.081Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-11-14T09:49:18.081Z]     res = func(*args, **_kwargs)

[2024-11-14T09:49:18.081Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-11-14T09:49:18.081Z]     return func(*arg, **kwargs)

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 801, in search

[2024-11-14T09:49:18.081Z]     resp = conn.search(

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler

[2024-11-14T09:49:18.081Z]     raise e from e

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler

[2024-11-14T09:49:18.081Z]     return func(*args, **kwargs)

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler

[2024-11-14T09:49:18.081Z]     return func(self, *args, **kwargs)

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 116, in handler

[2024-11-14T09:49:18.081Z]     raise e from e

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-11-14T09:49:18.081Z]     return func(*args, **kwargs)

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 806, in search

[2024-11-14T09:49:18.081Z]     return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 747, in _execute_search

[2024-11-14T09:49:18.081Z]     raise e from e

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 736, in _execute_search

[2024-11-14T09:49:18.081Z]     check_status(response.status)

[2024-11-14T09:49:18.081Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 63, in check_status

[2024-11-14T09:49:18.081Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-11-14T09:49:18.081Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: segment lacks[segment=453922023426637922]; segment lacks[segment=453922023426636382]: channel not available[channel=by-dev-rootcoord-dml_10_453922023426834071v1])>

[2024-11-14T09:49:18.081Z]  (api_request.py:45)

[2024-11-14T09:49:18.082Z] [2024-11-14 09:48:12 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to search: segment lacks[segment=453922023426637922]; segment lacks[segment=453922023426636382]: channel not available[channel=by-dev-rootcoord-dml_10_453922023426834071v1])> (api_request.py:46)

[2024-11-14T09:49:18.082Z] [2024-11-14 09:48:12 - ERROR - ci_test]: Traceback (most recent call last):

[2024-11-14T09:49:18.082Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-11-14T09:49:18.082Z]     res = func(*args, **_kwargs)

[2024-11-14T09:49:18.082Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-11-14T09:49:18.082Z]     return func(*arg, **kwargs)

[2024-11-14T09:49:18.082Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 1076, in query

[2024-11-14T09:49:18.082Z]     return conn.query(

[2024-11-14T09:49:18.082Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler

[2024-11-14T09:49:18.082Z]     raise e from e

[2024-11-14T09:49:18.082Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler

[2024-11-14T09:49:18.082Z]     return func(*args, **kwargs)

[2024-11-14T09:49:18.082Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler

[2024-11-14T09:49:18.082Z]     return func(self, *args, **kwargs)

[2024-11-14T09:49:18.082Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 116, in handler

[2024-11-14T09:49:18.082Z]     raise e from e

[2024-11-14T09:49:18.082Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-11-14T09:49:18.082Z]     return func(*args, **kwargs)

[2024-11-14T09:49:18.082Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1542, in query

[2024-11-14T09:49:18.082Z]     check_status(response.status)

[2024-11-14T09:49:18.082Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 63, in check_status

[2024-11-14T09:49:18.082Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-11-14T09:49:18.082Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to query: segment lacks[segment=453922023426637310]; segment lacks[segment=453922023426636382]: channel not available[channel=by-dev-rootcoord-dml_10_453922023426834071v1])>

collection name: Checker__w6D93jMU

[2024-11-14T09:49:18.082Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__w6D93jMU] - pytest_assume.plugin.FailedAssumption: 

[2024-11-14T09:49:18.082Z] 3 Failed Assumptions:

[2024-11-14T09:49:18.082Z] 

[2024-11-14T09:49:18.082Z] chaos_commons.py:124: AssumptionFailure

[2024-11-14T09:49:18.082Z] >>	pytest.assume(

[2024-11-14T09:49:18.082Z] AssertionError: Expect Succ: Op.search succ rate 0.6933333333333334, total: 75, average time: 0.4280

[2024-11-14T09:49:18.082Z] assert False

[2024-11-14T09:49:18.082Z] 

[2024-11-14T09:49:18.082Z] chaos_commons.py:124: AssumptionFailure

[2024-11-14T09:49:18.082Z] >>	pytest.assume(

[2024-11-14T09:49:18.082Z] AssertionError: Expect Succ: Op.full_text_search succ rate 0.5833333333333334, total: 48, average time: 1.8086

[2024-11-14T09:49:18.082Z] assert False

[2024-11-14T09:49:18.082Z] 

[2024-11-14T09:49:18.082Z] chaos_commons.py:124: AssumptionFailure

[2024-11-14T09:49:18.082Z] >>	pytest.assume(

[2024-11-14T09:49:18.082Z] AssertionError: Expect Succ: Op.hybrid_search succ rate 0.6933333333333334, total: 75, average time: 0.4264

[2024-11-14T09:49:18.082Z] assert False

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/18551/pipeline

log:
artifacts-etcd-followers-pod-failure-18551-server-logs.tar.gz

pod info:

[2024-11-14T09:39:14.962Z] + kubectl get pods -o wide

[2024-11-14T09:39:14.964Z] + grep etcd-followers-pod-failure-18551

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-0                                1/1     Running            2 (13m ago)      34m     10.104.27.207   4am-node31   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-1                                1/1     Running            0                34m     10.104.26.12    4am-node32   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-2                                1/1     Running            0                34m     10.104.21.185   4am-node24   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-kafka-0                          2/2     Running            0                34m     10.104.27.208   4am-node31   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-kafka-1                          2/2     Running            0                34m     10.104.26.13    4am-node32   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-kafka-2                          2/2     Running            0                34m     10.104.15.188   4am-node20   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-kafka-exporter-5b566998f6cb55d   1/1     Running            4 (34m ago)      34m     10.104.27.197   4am-node31   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-datanode-c84b575df6jg57   1/1     Running            2 (34m ago)      34m     10.104.27.196   4am-node31   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-datanode-c84b575df9dfdh   1/1     Running            2 (34m ago)      34m     10.104.16.58    4am-node21   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-indexnode-7778b987bltx2   1/1     Running            2 (34m ago)      34m     10.104.32.172   4am-node39   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-indexnode-7778b987djc2q   1/1     Running            2 (34m ago)      34m     10.104.25.242   4am-node30   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-indexnode-7778b987ff85f   1/1     Running            2 (34m ago)      34m     10.104.4.182    4am-node11   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-mixcoord-7d9647cf6bbbjl   1/1     Running            2 (34m ago)      34m     10.104.27.198   4am-node31   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-proxy-5c8ffb5b68-wb5rp    1/1     Running            1 (34m ago)      34m     10.104.1.226    4am-node10   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-querynode-784c5c44fxpxt   1/1     Running            2 (34m ago)      34m     10.104.25.243   4am-node30   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-querynode-784c5c44mnnv5   1/1     Running            2 (34m ago)      34m     10.104.27.200   4am-node31   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-milvus-querynode-784c5c44rgfqc   1/1     Running            1 (34m ago)      34m     10.104.1.225    4am-node10   <none>           <none>

[2024-11-14T09:39:15.254Z] etcd-followers-pod-failure-18551-minio-0                          1/1     Running            0                34m     10.104.26.11    4am-node32   <none>           <none>

[2024-11-14T09:39:15.255Z] etcd-followers-pod-failure-18551-minio-1                          1/1     Running            0                34m     10.104.27.209   4am-node31   <none>           <none>

[2024-11-14T09:39:15.255Z] etcd-followers-pod-failure-18551-minio-2                          1/1     Running            0                34m     10.104.15.187   4am-node20   <none>           <none>

[2024-11-14T09:39:15.255Z] etcd-followers-pod-failure-18551-minio-3                          1/1     Running            0                34m     10.104.21.187   4am-node24   <none>           <none>

[2024-11-14T09:39:15.255Z] etcd-followers-pod-failure-18551-zookeeper-0                      1/1     Running            0                34m     10.104.27.206   4am-node31   <none>           <none>

[2024-11-14T09:39:15.255Z] etcd-followers-pod-failure-18551-zookeeper-1                      1/1     Running            0                34m     10.104.15.184   4am-node20   <none>           <none>

[2024-11-14T09:39:15.255Z] etcd-followers-pod-failure-18551-zookeeper-2                      1/1     Running            0                34m     10.104.26.14    4am-node32   <none>           <none>

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/18547/pipeline

log:
artifacts-querynode-pod-failure-18547-server-logs.tar.gz

pod info

[2024-11-14T08:56:04.239Z] + kubectl get pods -o wide

[2024-11-14T08:56:04.240Z] + grep querynode-pod-failure-18547

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-etcd-0                                1/1     Running            0                  32m     10.104.26.194   4am-node32   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-etcd-1                                1/1     Running            0                  32m     10.104.21.136   4am-node24   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-etcd-2                                1/1     Running            0                  32m     10.104.24.105   4am-node29   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-kafka-0                               2/2     Running            2 (31m ago)        32m     10.104.21.138   4am-node24   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-kafka-1                               2/2     Running            1 (31m ago)        32m     10.104.26.198   4am-node32   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-kafka-2                               2/2     Running            1 (31m ago)        32m     10.104.20.236   4am-node22   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-kafka-exporter-876d94678-l4mgv        1/1     Running            4 (31m ago)        32m     10.104.24.97    4am-node29   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-milvus-datanode-574d85ddc8-fhzq6      1/1     Running            2 (31m ago)        32m     10.104.5.219    4am-node12   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-milvus-datanode-574d85ddc8-pdmxv      1/1     Running            2 (31m ago)        32m     10.104.6.224    4am-node13   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-milvus-indexnode-5cb6b86bfc-5c9xr     1/1     Running            2 (31m ago)        32m     10.104.17.232   4am-node23   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-milvus-indexnode-5cb6b86bfc-njfjp     1/1     Running            2 (31m ago)        32m     10.104.24.99    4am-node29   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-milvus-indexnode-5cb6b86bfc-q9zdk     1/1     Running            2 (31m ago)        32m     10.104.14.121   4am-node18   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-milvus-mixcoord-6698c57c65-gpznz      1/1     Running            2 (31m ago)        32m     10.104.24.98    4am-node29   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-milvus-proxy-8dbccb459-xczg2          1/1     Running            2 (31m ago)        32m     10.104.24.100   4am-node29   <none>           <none>

[2024-11-14T08:56:04.495Z] querynode-pod-failure-18547-milvus-querynode-85456b679f-67pm6     1/1     Running            5 (8m29s ago)      32m     10.104.25.238   4am-node30   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-milvus-querynode-85456b679f-7d4dz     1/1     Running            6 (6m5s ago)       32m     10.104.23.232   4am-node27   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-milvus-querynode-85456b679f-7vtsv     1/1     Running            5 (8m29s ago)      32m     10.104.24.101   4am-node29   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-minio-0                               1/1     Running            0                  32m     10.104.26.195   4am-node32   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-minio-1                               1/1     Running            0                  32m     10.104.15.166   4am-node20   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-minio-2                               1/1     Running            0                  32m     10.104.18.51    4am-node25   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-minio-3                               1/1     Running            0                  32m     10.104.21.143   4am-node24   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-zookeeper-0                           1/1     Running            0                  32m     10.104.18.49    4am-node25   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-zookeeper-1                           1/1     Running            0                  32m     10.104.26.197   4am-node32   <none>           <none>

[2024-11-14T08:56:04.496Z] querynode-pod-failure-18547-zookeeper-2                           1/1     Running            0                  32m     10.104.20.235   4am-node22   <none>           <none>

This issue has appeared in many component chaos tests, including MinIO QueryNode, but Etcd Follower does not cause any Milvus component to restart, which is somewhat puzzling. This issue may occur even without chaos under certain concurrent request conditions.

Anything else?

This issue was not reproduced in the image master-20241112-f5b06a3c-amd64

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 15, 2024
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Nov 15, 2024
@zhuwenxing zhuwenxing added this to the 2.5.0 milestone Nov 15, 2024
@yanliang567
Copy link
Contributor

/assign @liliu-z
/unassign

@sre-ci-robot sre-ci-robot assigned liliu-z and unassigned yanliang567 Nov 15, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 15, 2024
@liliu-z
Copy link
Member

liliu-z commented Nov 15, 2024

/assign @weiliu1031

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants