test: binary/binary-boundary.test.py test gets stuck from time to time #73

ylobankov · 2021-11-16T11:16:32Z

From time to time the binary/binary-boundary.test.py test gets stuck:

$ make test-memcached
[  3%] Built target generate_proto_txt_parser_c
[ 11%] Built target msgpuck
[ 55%] Built target small
[100%] Built target internalso
Started ./test-run.py --force
Collecting tests in 'binary': memcached binary tests.
Found 4 tests.
Collecting tests in 'capable': memcapable integration.
Found 2 tests.
Collecting tests in 'sasl': memcached binary sasl tests.
Found 1 tests.
Collecting tests in 'text': memcached text tests.
Found 10 tests.
Installing the server ...
    Found executable at /usr/bin/tarantool
    Found tarantoolctl at  /usr/bin/tarantoolctl
    Creating and populating working directory in /home/ubuntu/actions-runner/_work/tarantool/tarantool/test/var ...
    Found old vardir, deleting ...
Starting the server ...
Starting /home/ubuntu/actions-runner/_work/tarantool/tarantool/test/var/binary.lua
Tarantool 2.10.0-beta1-225-g816e1dbd1
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/lib/ccache/g++
C_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1.225.dev=. -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type
CXX_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1.225.dev=. -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type
================================================================================
TEST                                            PARAMS		RESULT
---------------------------------------------------------------------------
binary/binary-boundary.test.py

Looks like we are hanging at some iteration because the test/var/binary-boundary.result file shows as follows:

#---------------------# test protocol boundary overruns #---------------------#
iteration 1900
iteration 1901
iteration 1902
iteration 1903
iteration 1904
iteration 1905
iteration 1906
iteration 1907
iteration 1908
iteration 1909
iteration 1910
iteration 1911
iteration 1912
iteration 1913
iteration 1914
iteration 1915
iteration 1916
iteration 1917
iteration 1918
iteration 1919
iteration 1920
iteration 1921
iteration 1922
iteration 1923
iteration 1924
iteration 1925
iteration 1926
iteration 1927
iteration 1928
iteration 1929
iteration 1930
iteration 1931
iteration 1932
iteration 1933
iteration 1934
iteration 1935
iteration 1936
iteration 1937
iteration 1938
iteration 1939
iteration 1940
iteration 1941
iteration 1942
iteration 1943
iteration 1944
iteration 1945
iteration 1946
iteration 1947
iteration 1948
iteration 1949
iteration 1950
iteration 1951
iteration 1952
iteration 1953
iteration 1954

Unfortunately, I don't have exact repro steps to make this situation happen. But running tests in the loop will reproduce this issue with high probability:

for i in {1..50}; do make test-memcached; done

I did a small debugging and found that we hang in this loop and if to be more accurate here. If we dig deeper, it turns out that we get stuck here actually.

Please pay attention that #51 may happen because of the same problem.

The text was updated successfully, but these errors were encountered:

ligurio · 2021-11-18T16:37:30Z

But running tests in the loop will reproduce this issue with high probability:

How many times out of 50?

ylobankov · 2021-11-18T18:08:11Z

But running tests in the loop will reproduce this issue with high probability:

How many times out of 50?

Sorry, I didn't do such measurements.

ligurio · 2021-11-30T19:39:10Z

Need more details about environment. With script that runs test in loop (for i in {1..50}; do make test-memcached; done), all 50 iterations passed without fails.

sergeyb@pony:~/sources/MRG/memcached$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
$ tarantool --version
Tarantool 2.10.0-beta1-0-g7da4b1438
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/lib/ccache/g++
C_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1=. -specs=/usr/share/dpkg/no-pie-compile.specs -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type
CXX_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1=. -specs=/usr/share/dpkg/no-pie-compile.specs -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type
sergeyb@pony:~/sources/MRG/memcached$

tarantool-dev 2.10.0~beta1-1 amd64
libevent-dev 2.1.11-stable-1
libsasl2-dev 2.1.27+dfsg-2 amd64

@ylobankov, please assist with repro.

Test is unstable and to setup simple CI (#75) we need to disable it. Fix planned in (#73).

ylobankov · 2021-12-03T09:54:33Z

Need more details about environment. With script that runs test in loop (for i in {1..50}; do make test-memcached; done), all 50 iterations passed without fails.

sergeyb@pony:~/sources/MRG/memcached$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
$ tarantool --version
Tarantool 2.10.0-beta1-0-g7da4b1438
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/lib/ccache/g++
C_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1=. -specs=/usr/share/dpkg/no-pie-compile.specs -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type
CXX_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1=. -specs=/usr/share/dpkg/no-pie-compile.specs -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type
sergeyb@pony:~/sources/MRG/memcached$

tarantool-dev 2.10.0~beta1-1 amd64 libevent-dev 2.1.11-stable-1 libsasl2-dev 2.1.27+dfsg-2 amd64

@ylobankov, please assist with repro.

I reproduced this bug on Ubuntu 20.04 quite often. If you still have any issues with that, ping me, I will provide the server.

ligurio · 2021-12-13T12:04:27Z

How to reproduce the problem:

Apply the patch below to test:

--- test/binary/binary-boundary.test.py	2021-12-02 19:09:10.852429691 +0300
+++ test/binary/binary-boundary.py	2021-12-11 10:45:11.327712154 +0300
@@ -4,19 +4,19 @@
 import inspect
 import traceback
 
-saved_path = sys.path[:]
-sys.path.append(os.path.dirname(os.path.abspath(inspect.getsourcefile(lambda:0))))
+root = os.getcwd()
+sys.path.append(os.path.join(root, 'test-run'))
 
 from internal.memcached_connection import MemcachedBinaryConnection
 from internal.memcached_connection import STATUS, COMMANDS
 
-mc = MemcachedBinaryConnection("127.0.0.1", iproto.py_con.port)
+mc = MemcachedBinaryConnection("127.0.0.1", 8080)
 
 def iequal(left, right, level = 1):
     if (left != right):
         tb = traceback.extract_stack()[-(level + 1)]
-        print "Error on line %s:%d: %s not equal %s" % (tb[0], tb[1],
-                repr(left), repr(right))
+        print("Error on line %s:%d: %s not equal %s" % (tb[0], tb[1],
+                repr(left), repr(right)))
         if (isinstance(left, basestring)):
             if (len(left) != len(right)):
                 print("length is different")
@@ -28,28 +28,29 @@
             iequal(res.get('val', val), val, level + 1)
 
 def check(key, flags, val, level = 0):
+    print("Get", key)
     res = mc.get(key)
-    __check(res[0], flags, val, level + 1)
+    #__check(res[0], flags, val, level + 1)
 
 print("""#---------------------# test protocol boundary overruns #---------------------#""")
 
-for i in range(1900, 2100):
+for i in range(1950, 2000):
     print ("iteration %d" % i)
     key  = "test_key_%d" % i
     val  = "x" * i
     mc.setq(key, val, flags=82, nosend=True)
     mc.setq("alt_%s" % key, "blah", flags=82, nosend=True)
-    data = "".join(mc.commands)
+    data = b"".join(mc.commands)
     mc.commands = []
 
     if (len(data) > 2024):
         for j in range(2024, min(2096, len(data))):
+            print("send", data[:j])
             mc.socket.sendall(data[:j])
             time.sleep(0.00001)
+            print("send", data[j:])
             mc.socket.sendall(data[j:])
     else:
         mc.socket.sendall(data)
     check(key, 82, val)
-    check("alt_%s" % key, 82, "blah")
-
-sys.path = saved_path
+    #check("alt_%s" % key, 82, "blah")

Create a Tarantool instance file repro.lua with content below:

box.cfg{
    log_level = 7,
    feedback_enabled = false,
    --log = 'tarantool.log',
    --background = true,
    --pid_file = '/home/sergeyb/sources/MRG/memcached/pid',
}
mc = require('memcached')
m = mc.create('instance_1', '8080', {
    expire_enabled = false,
    protocol = 'binary',
    verbosity = 3
})

require('console'):start()

Start Tarantool: tarantool repro.lua
Run test: python test/binary/binary-boundary.py
Test stuck with message: ('Get', 'test_key_1952')

What's going on

test sends two commands at once (setq and setq):

    mc.setq(key, val, flags=82, nosend=True)                                     
    mc.setq("alt_%s" % key, "blah", flags=82, nosend=True)                       
    data = b"".join(mc.commands)                                                 
    mc.commands = []                                                             
                                                                                 
    if (len(data) > 2024):                                                       
        for j in range(2024, min(2096, len(data))):                              
            print("send", data[:j])                                              
            mc.socket.sendall(data[:j])                                          
            time.sleep(0.00001)                                                  
            print("send", data[j:])                                              
            mc.socket.sendall(data[j:])

memcached successfully processed them (check it with command box.space.__mc_instance_1:get('test_key_1952') in Tarantool's console) and reads a request with Getfor key test_key_1952:

2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> tuple_delete(0x7f4a0e00f704)                                                              
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memtx_tuple_delete(0x7f4a0e00f704)                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:205                                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:210                                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> ibuf_used 2087, rc 0, batch_count 19                                                      
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:222                                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:176 parse_request                                                          
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:202                                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_bin_process
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_bin_process_set
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> SETQ 'test_key_1952' '<DATA>'
375731712, expire - 0                                                                                                                                         
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> opaque - 7, cas - 0
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memtx_tuple_new(1992) = 0x7f4a0e0c2428
2021-12-13 14:55:08.799 [568488] main D> cpipe_flush_cb: locking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] main D> cpipe_flush_cb: unlocking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] wal/101/main D> cbus_endpoint_fetch: locking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] wal/101/main D> cbus_endpoint_fetch: unlocking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] wal D> cpipe_flush_cb: locking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] wal D> cpipe_flush_cb: unlocking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] main D> cbus_endpoint_fetch: locking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] main D> cbus_endpoint_fetch: unlocking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> tuple_delete(0x7f4a0e0c141c)
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D>memtx_tuple_delete(0x7f4a0e0c141c)
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memtx_tuple_delete(0x7f4a0e0c141c)
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:205
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:210
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> ibuf_used 90, rc 0, batch_count 20
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> ibuf_used 90
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop: 230 memcached_flush
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:232
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop_read to_read 1
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> mnet_read_ahead() start
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> mnet_read_ahead: nrd -1
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> mnet_read_ahead: coio_wait begin

batch_count is set to 20 by default, and on condition below we go out of if..then..else and go to next memcached_loop_read() at the beginning of the loop, see internal/memcached.c:

		} else if (rc == 0 && ibuf_used(con->in) > 0 &&
			   batch_count < con->cfg->batch_count) {
			batch_count++;
			goto next;
		}

Test is unstable and to setup simple CI (#75) we need to disable it. Fix planned in (#73).

Related to #73

ligurio · 2021-12-16T17:26:59Z

Closed accidentally.

ligurio · 2021-12-17T16:55:46Z

memcached creates an endless read-parse-process-flush loop for each network connection.
Patch below adds debug printing that helps to understand what operation is executing in a loop, what a current size of ibuf with raw data and current value of a batcn_count:

diff --git a/memcached/internal/memcached.c b/memcached/internal/memcached.c
index 3f760c7..33b667b 100644
--- a/memcached/internal/memcached.c
+++ b/memcached/internal/memcached.c
@@ -154,6 +154,7 @@ memcached_loop(struct memcached_connection *con)
 	int batch_count = 0;
 
 	for (;;) {
+		say_info("%s, ibuf_used %d, batch_count %d", __func__, ibuf_used(con->in), batch_count);
 		rc = memcached_loop_read(con, to_read);
 		if (rc == -1) {
 			/**
@@ -166,6 +167,7 @@ memcached_loop(struct memcached_connection *con)
 next:
 		con->noreply = false;
 		con->noprocess = false;
+		say_info("%s, ibuf_used %d, batch_count %d", __func__, ibuf_used(con->in), batch_count);
 		rc = con->cb.parse_request(con);
 		if (rc == -1) {
 			memcached_loop_error(con);
@@ -177,6 +179,7 @@ next:
 			} else {
 				memcached_skip_request(con);
 			}
+		        say_info("%s, ibuf_used %d, batch_count %d", __func__, ibuf_used(con->in), batch_count);
 			memcached_flush(con);
 			batch_count = 0;
 			continue;
@@ -188,6 +191,7 @@ next:
 		assert(!con->close_connection);
 		rc = 0;
 		if (!con->noprocess) {
+		        say_info("%s, ibuf_used %d, batch_count %d", __func__, ibuf_used(con->in), batch_count);
 			rc = con->cb.process_request(con);
 			memcached_connection_gc(con);
 		}
@@ -205,11 +209,13 @@ next:
 		}
 		/* Write back answer */
 		if (!con->noreply)
+		        say_info("%s, ibuf_used %d", __func__, ibuf_used(con->in));
 			memcached_flush(con);
 		fiber_reschedule();
 		batch_count = 0;
 		continue;
 	}
+	say_info("%s, ibuf_used %d", __func__, ibuf_used(con->in));
 	memcached_flush(con);
 }

One can apply a patch and start Tarantool with memcached that listen on port 8080 and accept connections on memcached text protocol:

tarantool -l 'memcached' -e "box.cfg{} memcached.create('instance_1', '8080', { protocol = 'text', verbosity = 3 }) require('console'):start()"

We can send commands to memcached via text commands. For example command below we set key tarantool to value database and read it back with command get:

$ printf "set tarantool 0 0 8\r\ndatabase\r\nget tarantool\r\n" | nc 127.0.0.1 8080
STORED
VALUE tarantool 0 8
database
END

Tarantool will print debug prints:

tarantool> 2021-12-17 19:06:45.961 [1350583] main/121/server/127.0.0.1:37730 I> memcached_flush(), ibuf_used 0
2021-12-17 19:06:47.545 [1350583] main/123/server/127.0.0.1:37756 I> memcached_loop_read(), ibuf_used 0, batch_count 0
2021-12-17 19:06:47.545 [1350583] main/123/server/127.0.0.1:37756 I> parse_request(), ibuf_used 46, batch_count 0
2021-12-17 19:06:47.545 [1350583] main/123/server/127.0.0.1:37756 I> process_request(), ibuf_used 46, batch_count 0
2021-12-17 19:06:47.546 [1350583] main/123/server/127.0.0.1:37756 I> parse_request(), ibuf_used 15, batch_count 1
2021-12-17 19:06:47.546 [1350583] main/123/server/127.0.0.1:37756 I> process_request(), ibuf_used 15, batch_count 1
2021-12-17 19:06:47.546 [1350583] main/123/server/127.0.0.1:37756 I> memcached_flush(), ibuf_used 0
2021-12-17 19:06:47.546 [1350583] main/123/server/127.0.0.1:37756 I> memcached_loop_read(), ibuf_used 0, batch_count 0

Bug with stalled get command can be reproduced with a sequence of 21 (default value of batch_count is 21, but due to one-by-off error we need to send 21 instead of 20) set commands and a final get:

printf "set foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nget foo\r\n" | nc 127.0.0.1 8080
<'STORED' repeats 20 times>                                                                                                                                  
STORED                                                                                                                                  
^C

In Tarantool session:

2021-12-17 19:41:08.778 [1356087] main/121/server/127.0.0.1:37948 I> memcached_loop_read(), ibuf_used 0, batch_count 0                  
2021-12-17 19:41:08.778 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 387, batch_count 0       
2021-12-17 19:41:08.778 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 387, batch_count 0                    
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 369, batch_count 1                      
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 369, batch_count 1                    
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 351, batch_count 2                      
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 351, batch_count 2                    
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 333, batch_count 3                      
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 333, batch_count 3                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 315, batch_count 4                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 315, batch_count 4                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 297, batch_count 5                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 297, batch_count 5                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 279, batch_count 6                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 279, batch_count 6                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 261, batch_count 7                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 261, batch_count 7                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 243, batch_count 8                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 243, batch_count 8                    
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 225, batch_count 9                      
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 225, batch_count 9                    
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 207, batch_count 10                     
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 207, batch_count 10    
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 189, batch_count 11                   
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 171, batch_count 12                     
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 171, batch_count 12                   
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 153, batch_count 13                     
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 153, batch_count 13                   
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 135, batch_count 14                     
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 135, batch_count 14                   
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 117, batch_count 15                     
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 117, batch_count 15                   
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 99, batch_count 16                      
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 99, batch_count 16  
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 81, batch_count 17
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 81, batch_count 17
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 63, batch_count 18
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 63, batch_count 18
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 45, batch_count 19
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 45, batch_count 19
2021-12-17 19:41:08.783 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 27, batch_count 20
2021-12-17 19:41:08.783 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 27, batch_count 20
2021-12-17 19:41:08.783 [1356087] main/121/server/127.0.0.1:37948 I> memcached_flush() (write back answer), ibuf_used 9
2021-12-17 19:41:08.783 [1356087] main/121/server/127.0.0.1:37948 I> memcached_loop_read(), ibuf_used 9, batch_count 0
2021-12-17 19:43:28.736 [1356087] main/121/server/127.0.0.1:37948 I> memcached_flush(), ibuf_used 9

In Tarantool's log we see that memcached processed all set commands and when batch_count become equal to 21 we executes memcached_flush() and with get command in buffer we went to network read. NOTE: memcached_flush usually sends answers to the client. However in our case it sends confirmations to the client (it prints STORED on console after that), in original test we use setq which doesn't requre answers, but setq is implemented only in binary protocol and I cannot use it for demonstration of bug using nc.

According to description of memcached protocol there is a number of retrieval commands and all of them variations of get command: "get", "gets", "gat", and "gats". I have successfully reproduced issue with gat instead of get.

ligurio · 2021-12-18T09:41:52Z

States transitions in processing loop:

GraphViz text (dot -Tpng -o states.png states.gv.txt)

digraph {
    START [peripheries=2]
    memcached_loop_read
    parse_request
    process_request
    close_connection
 
    START -> memcached_loop_read
    memcached_loop_read -> close_connection [label="err"]
    memcached_loop_read -> parse_request [label="no err"]
    parse_request -> close_connection [label="rc == -1, close_connection == true"]
    parse_request -> memcached_loop_read [label="rc == -1, close_connection == false"]
    parse_request -> memcached_loop_read [label="rc > 0"]
    parse_request -> process_request [label="rc == 0, noprocess == false"]
    parse_request -> close_connection [label="rc == 0, close_connection == true"]
    parse_request -> parse_request [label="rc == 0, ibuf_used > 0, batch_count < 20"]
    parse_request -> memcached_flush
    memcached_flush -> memcached_loop_read
}

Summary: memcached stuck on processing request when number of processed commands is exceeded of (batch_count + 1) and buffer has a request with retrieval command ("get", "gets", "gat", and "gats"). For each incoming connection memcached starts endless loop in function memcached_loop() where it reads network data to buffer (memcached_loop_read) , parse commands from a buffer one by one (parse_request), process commands (process_request) and write answers to network socket (memcached_flush). Imagine we have a sequence of (batch_count + 1) commands with SETQ (SETQ is the same as a SET command but it is not required confirmation) and GET command that should return a value set by one of the previous SETQ command. Processing of SETQ commands finished successfully. After processing the latest request with SETQ command we have a GET command in a buffer and batch_count < 20. In such case control flow returns to start of the read-parse-process-write loop where we stuck in memcached_loop_read() and wait for next requests instead of replying to client with value of requested key. rc = con->cb.process_request(con); ... if (rc == -1) memcached_loop_error(con); if (con->close_connection) { break; } else if (rc == 0 && ibuf_used(con->in) > 0 && batch_count < con->cfg->batch_count) { goto next; } ... continue; Fixes #73

Summary: memcached stuck on processing request when number of processed commands is exceeded of (batch_count + 1) and buffer has a request with retrieval command ("get", "gets", "gat", and "gats"). For each incoming connection memcached starts endless loop in function memcached_loop() where it reads network data to buffer (memcached_loop_read), parse commands from a buffer one by one (parse_request), process commands (process_request) and write answers to network socket (memcached_flush). Imagine we have a sequence of (batch_count + 1) commands with SETQ (SETQ is the same as a SET command but it is not required confirmation) and GET command that should return a value set by one of the previous SETQ command. Processing of SETQ commands finished successfully. After processing the latest request with SETQ command we have a GET command in a buffer and batch_count < 20. In such case control flow returns to start of the read-parse-process-write loop where we stuck in memcached_loop_read() and wait for next requests instead of replying to client with value of requested key. rc = con->cb.process_request(con); ... if (rc == -1) memcached_loop_error(con); if (con->close_connection) { break; } else if (rc == 0 && ibuf_used(con->in) > 0 && batch_count < con->cfg->batch_count) { goto next; } ... continue; Fixes #73

Summary: memcached stuck on processing request when number of processed commands is exceeded of (batch_count + 1) and buffer has a request with retrieval command ("get", "gets", "gat", and "gats"). For each incoming connection memcached starts endless loop in function memcached_loop() where it reads network data to buffer (memcached_loop_read), parse commands from a buffer one by one (parse_request), process commands (process_request) and write answers to network socket (memcached_flush). Imagine we have a sequence of requests equal to con->cfg->batch_count (it is a constant value and equal to 20) + 1 with SETQ (SETQ is the same as a SET command, but it is not required confirmation) commands and a single GET command that should return a value set by one of the previous SETQ command. Processing of SETQ commands finished successfully. After processing the latest request with SETQ command we have a GET command in a buffer and batch_count counter become equal to 20 (batch_count initial value is 0). In such case control flow returns to start of the read-parse-process-write loop where we stuck in memcached_loop_read() and wait for next requests instead of replying to client with value of requested key. rc = con->cb.process_request(con); ... if (rc == -1) memcached_loop_error(con); if (con->close_connection) { break; } else if (rc == 0 && ibuf_used(con->in) > 0 && batch_count < con->cfg->batch_count) { goto next; } ... continue; Fixes #73

ylobankov added bug Something isn't working teamE labels Nov 16, 2021

ylobankov mentioned this issue Nov 16, 2021

Verify integration with memcached module in tarantool development cycle tarantool/tarantool#6563

Closed

Totktonada assigned ligurio Nov 18, 2021

ligurio added a commit that referenced this issue Dec 2, 2021

Disable test binary/binary-boundary.test.py

f0139f2

Test is unstable and to setup simple CI (#75) we need to disable it. Fix planned in (#73).

ligurio added a commit that referenced this issue Dec 2, 2021

Disable test binary/binary-boundary.test.py

c8e61a0

Test is unstable and to setup simple CI (#75) we need to disable it. Fix planned in (#73).

ligurio added a commit that referenced this issue Dec 2, 2021

Disable test binary/binary-boundary.test.py

88f3444

Test is unstable and to setup simple CI (#75) we need to disable it. Fix planned in (#73).

ligurio added a commit that referenced this issue Dec 15, 2021

Disable test binary/binary-boundary.test.py

661c8d7

Test is unstable and to setup simple CI (#75) we need to disable it. Fix planned in (#73).

ligurio mentioned this issue Dec 16, 2021

Setup simple CI based on GitHub Actions and fix flaky test #79

Merged

ligurio added a commit that referenced this issue Dec 16, 2021

test: disable binary/binary-boundary.test.py

1c51a01

Related to #73

ligurio closed this as completed in #79 Dec 16, 2021

ligurio added a commit that referenced this issue Dec 16, 2021

test: disable binary/binary-boundary.test.py

297e6de

Related to #73

ligurio reopened this Dec 16, 2021

ligurio mentioned this issue Dec 18, 2021

memcached: fix stuck on processing GET request #94

Merged

ligurio closed this as completed in #94 Dec 28, 2021

ligurio mentioned this issue Jan 14, 2022

test: flaky fail on text/getset.test.py #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: binary/binary-boundary.test.py test gets stuck from time to time #73

test: binary/binary-boundary.test.py test gets stuck from time to time #73

ylobankov commented Nov 16, 2021 •

edited

Loading

ligurio commented Nov 18, 2021

ylobankov commented Nov 18, 2021

ligurio commented Nov 30, 2021 •

edited

Loading

ylobankov commented Dec 3, 2021

ligurio commented Dec 13, 2021

ligurio commented Dec 16, 2021

ligurio commented Dec 17, 2021

ligurio commented Dec 18, 2021 •

edited

Loading

test: binary/binary-boundary.test.py test gets stuck from time to time #73

test: binary/binary-boundary.test.py test gets stuck from time to time #73

Comments

ylobankov commented Nov 16, 2021 • edited Loading

ligurio commented Nov 18, 2021

ylobankov commented Nov 18, 2021

ligurio commented Nov 30, 2021 • edited Loading

ylobankov commented Dec 3, 2021

ligurio commented Dec 13, 2021

How to reproduce the problem:

What's going on

ligurio commented Dec 16, 2021

ligurio commented Dec 17, 2021

ligurio commented Dec 18, 2021 • edited Loading

ylobankov commented Nov 16, 2021 •

edited

Loading

ligurio commented Nov 30, 2021 •

edited

Loading

ligurio commented Dec 18, 2021 •

edited

Loading