Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC using reflect race on registering types #3961

Open
jason-speck opened this issue Sep 18, 2024 · 4 comments
Open

gRPC using reflect race on registering types #3961

jason-speck opened this issue Sep 18, 2024 · 4 comments

Comments

@jason-speck
Copy link

jason-speck commented Sep 18, 2024

Brief summary

We have tests that use constant-arrival-rate executor against a number of grpc applications. We are using reflection. Most of these run fine. Against some applications, we occasionally receive a panic:

panic: proto: message proto.com.XXXXX.XXXXX.XXXXX.MyService1 is already registered

This is not completely deterministic. However, for applications where it occurs, the probability increases as we increase rate.

As a workaround we are switching these tests to ramping-arrival-rate, but ideally we would like to be able to use constant-arrival-rate.

(I've obfuscated actual host and service names in all output)

k6 version

v0.53.0

OS

Mariner 2.0.20240628

Docker version and image (if applicable)

n/a

Steps to reproduce the problem

version:

$ k6 --version
k6 v0.53.0 (go1.23.1, linux/amd64)
Extensions:
  github.com/grafana/xk6-sql v0.4.1, k6/x/sql [js]
  github.com/mostafa/xk6-kafka v0.27.0, k6/x/kafka [js]
  github.com/thotasrinath/xk6-couchbase v0.0.8, k6/x/couchbase [js]

OS:

$ cat lsb-release 
DISTRIB_ID="Mariner"
DISTRIB_RELEASE="2.0.20240628"
DISTRIB_CODENAME=Mariner
DISTRIB_DESCRIPTION="CBL-Mariner 2.0.20240628"

$ cat mariner-release 
CBL-Mariner 2.0.20240628
MARINER_BUILD_NUMBER=baf1849

$ uname -a
Linux XXXX-XX999.XXX.XXXX.com 5.15.160.1-1.cm2 #1 SMP Fri Jun 28 10:31:37 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Simple test script. See attached for script and output from several runs. Note the panic line may show different grpc endpoints, it doesn't always show the same one (I gather this is related to reflection).

$ /path/to/k6/intel/k6 run   /tmp/XXXXX/k6.js  

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: /tmp/XXXXX/k6.js
        output: -

     scenarios: (100.00%) 1 scenario, 50 max VUs, 35s max duration (incl. graceful stop):
              * constant-qps: 20000.00 iterations/s for 5s (maxVUs: 50, gracefulStop: 30s)

WARN[0000] Insufficient VUs, reached 50 active VUs and cannot initialize more  executor=constant-arrival-rate scenario=constant-qps
panic: proto: message proto.com.XXXX.XXXX.XXXX.SomeServiceName is already registered

Expected behaviour

Test executes.

Actual behaviour

See attached output files.
k6.js.txt
k6.out.1.txt
k6.out.2.txt
k6.out.3.txt

@mstoykov
Copy link
Contributor

Hi @jason-speck, thanks for reporting this 🙇

This doesn't really have anything to do with the executors apart from likely making this more common or not.

By the looks of this is due to a race in the registration of types when using reflect

_, errFind := protoregistry.GlobalTypes.FindMessageByName(message.FullName())
if errors.Is(errFind, protoregistry.NotFound) {
err = protoregistry.GlobalTypes.RegisterMessage(dynamicpb.NewMessageType(message))
if err != nil {
return false
}
}

While the underlying code

// RegisterMessage registers the provided message type.
//
// If a naming conflict occurs, the type is not registered and an error is returned.
func (r *Types) RegisterMessage(mt protoreflect.MessageType) error {
// Under rare circumstances getting the descriptor might recursively
// examine the registry, so fetch it before locking.
md := mt.Descriptor()
if r == GlobalTypes {
globalMutex.Lock()
defer globalMutex.Unlock()
}
if err := r.register("message", md, mt); err != nil {
return err
}
r.numMessages++
return nil
}

Has mutex it is after we do the Not found check so 2 different VUs doing gRPC requests can get to this point together. And then one will register it holding the lock while the other one will wait and then try to do it again. The latter will get the error you are seeing.

I think that :

  1. we probably shouldn't be using GlobalTypes to begin with
  2. have the lock around the not found check.

cc @olegbespalov

@jason-speck you can probably try the propose from the faq

of running with GOLANG_PROTOBUF_REGISTRATION_CONFLICT=warn k6 ... to not get this as an error.

@mstoykov mstoykov changed the title panic on some constant-arrival-rate tests on grpc services gRPC using reflect race on registering types Sep 30, 2024
@mstoykov mstoykov added this to the v0.55.0 milestone Sep 30, 2024
@mstoykov mstoykov removed their assignment Oct 3, 2024
@sharathjag
Copy link

sharathjag commented Oct 30, 2024

@mstoykov will this be part of Nov-11 release?

Also:

I think that :

1. we probably shouldn't be using GlobalTypes to begin with
2. have the lock around the not found check.
  • Do we need both 1 and 2? or just 2?

@mstoykov
Copy link
Contributor

Hi @sharathjag, nobody has picked this up so it seems like currently it won't be part of it.

I would argue both 1 and 2, but at least 2 to fix the issue. I will argue though that it should be both so that we do not mix types.

If you want you are free to work on this, but we might decide to still postpone merging this PR after the v0.55.0 release. Especially if the changes seem too big and the PR comes next week.

@oleiade
Copy link
Member

oleiade commented Nov 4, 2024

With the release being one week away, I'll move this to the next release (to avoid slowing down the release process, and risking introducing last minute issues) 🙇🏻

@oleiade oleiade modified the milestones: v0.55.0, v0.56.0 Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants