Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The deleteBufferSizeProtection is enabled, but cold write rate is not triggered when lowWaterLevel is reached #37723

Open
1 task done
ThreadDao opened this issue Nov 15, 2024 · 1 comment
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20241114-d23da2db-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server config

  • qn: 5*8c32g
  • config
  config:
    dataCoord:
      enableActiveStandby: true
      segment:
        expansionRate: 1.15
        maxSize: 2048
        sealProportion: 0.12
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    minio:
      accessKeyID: miniozong
      bucketName: bucket-zong
      rootPath: compact_3
      secretAccessKey: miniozong
    queryCoord:
      enableActiveStandby: true
    queryNode:
      levelZeroForwardPolicy: RemoteLoad
      streamingDeltaForwardPolicy: FilterByBF
    quotaAndLimits:
      dml:
        deleteRate:
          max: 2
        enabled: true
        insertRate:
          max: 16
      limitWriting:
        deleteBufferRowCountProtection:
          enabled: true
          highWaterLevel: 25000000
          lowWaterLevel: 12000000
        deleteBufferSizeProtection:
          enabled: true
          highWaterLevel: 1073741824
          lowWaterLevel: 268435456
        growingSegmentsSizeProtection:
          enabled: true
          highWaterLevel: 0.2
          lowWaterLevel: 0.1
          minRateRatio: 0.5
        l0SegmentsRowCountProtection:
          enabled: true
          highWaterLevel: 50000000
          lowWaterLevel: 25000000
        memProtection:
          dataNodeMemoryHighWaterLevel: 0.85
          dataNodeMemoryLowWaterLevel: 0.75
          queryNodeMemoryHighWaterLevel: 0.85
          queryNodeMemoryLowWaterLevel: 0.75
      limits:
        complexDeleteLimitEnable: true
    rootCoord:
      enableActiveStandby: true
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces"
      sampleFraction: 1
  • refresh quotaLimitsConfig bt eycdctl put
$ ./etcdctl --endpoints 10.104.16.184:2379 get --prefix compact-opt-100m-3/config
compact-opt-100m-3/config/dataCoord.compaction.taskPrioritizer
level
compact-opt-100m-3/config/quotaandlimits.limitwriting.deleteBufferRowCountProtection.enabled
true
compact-opt-100m-3/config/quotaandlimits.limitwriting.deleteBufferSizeProtection.enabled
true
compact-opt-100m-3/config/quotaandlimits.limitwriting.deleteBufferSizeProtection.highWaterLevel
536870912
compact-opt-100m-3/config/quotaandlimits.limitwriting.deleteBufferSizeProtection.lowWaterLevel
134217728
compact-opt-100m-3/config/quotaandlimits.limitwriting.l0SegmentsRowCountProtection.enabled
true
compact-opt-100m-3/config/quotaandlimits.limitwriting.l0SegmentsRowCountProtection.lowWaterLevel
12000000

test

  • collection schema
{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 64}, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}}], 'enable_dynamic_field': False}
  • delete 60m of 100m varchar pks by 'in' during concurrent search

results

  1. After deleted 37890000 pks, delete failed due to memory deny. Theory 5 querynodes oomkilled and 4 of them can't recovery to running
compact-opt-100m-3-milvus-datanode-7cdc6d474c-xjlpm            1/1     Running            0                25h     10.104.15.161   4am-node20   <none>           <none>
compact-opt-100m-3-milvus-indexnode-6ccb5877c6-2zr87           1/1     Running            0                25h     10.104.5.215    4am-node12   <none>           <none>
compact-opt-100m-3-milvus-indexnode-6ccb5877c6-fcq9c           1/1     Running            1 (24h ago)      25h     10.104.9.102    4am-node14   <none>           <none>
compact-opt-100m-3-milvus-indexnode-6ccb5877c6-fdl67           1/1     Running            0                25h     10.104.13.161   4am-node16   <none>           <none>
compact-opt-100m-3-milvus-mixcoord-5b8cc74bb5-nhxdh            1/1     Running            0                25h     10.104.4.101    4am-node11   <none>           <none>
compact-opt-100m-3-milvus-proxy-8687885695-t7lgz               1/1     Running            0                25h     10.104.18.44    4am-node25   <none>           <none>
compact-opt-100m-3-milvus-querynode-0-67754bc4b-4kxff          0/1     CrashLoopBackOff   16 (101s ago)    25h     10.104.34.25    4am-node37   <none>           <none>
compact-opt-100m-3-milvus-querynode-0-67754bc4b-7p9pb          0/1     CrashLoopBackOff   16 (114s ago)    25h     10.104.14.119   4am-node18   <none>           <none>
compact-opt-100m-3-milvus-querynode-0-67754bc4b-7x554          0/1     CrashLoopBackOff   16 (79s ago)     25h     10.104.6.222    4am-node13   <none>           <none>
compact-opt-100m-3-milvus-querynode-0-67754bc4b-g66bb          1/1     Running            2 (72m ago)      7h5m    10.104.13.112   4am-node16   <none>           <none>
compact-opt-100m-3-milvus-querynode-0-67754bc4b-h95z6          0/1     CrashLoopBackOff   16 (3m36s ago)   25h     10.104.9.83     4am-node14   <none>           <none>
  1. The deleteBufferSizeProtection is enabled, but cold write rate is not triggered when lowWaterLevel is reached
    图片
    图片
    图片

Expected Behavior

  • deleteBufferSizeProtection works well
  • no queryNode oom or recover afte oom

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 15, 2024
@ThreadDao ThreadDao added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 15, 2024
@ThreadDao ThreadDao added this to the 2.4.16 milestone Nov 15, 2024
@yanliang567
Copy link
Contributor

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants