Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: binary/binary-boundary.test.py test gets stuck from time to time #73

Closed
ylobankov opened this issue Nov 16, 2021 · 8 comments · Fixed by #79 or #94
Closed

test: binary/binary-boundary.test.py test gets stuck from time to time #73

ylobankov opened this issue Nov 16, 2021 · 8 comments · Fixed by #79 or #94
Assignees
Labels
bug Something isn't working

Comments

@ylobankov
Copy link
Member

ylobankov commented Nov 16, 2021

From time to time the binary/binary-boundary.test.py test gets stuck:

$ make test-memcached
[  3%] Built target generate_proto_txt_parser_c
[ 11%] Built target msgpuck
[ 55%] Built target small
[100%] Built target internalso
Started ./test-run.py --force
Collecting tests in 'binary': memcached binary tests.
Found 4 tests.
Collecting tests in 'capable': memcapable integration.
Found 2 tests.
Collecting tests in 'sasl': memcached binary sasl tests.
Found 1 tests.
Collecting tests in 'text': memcached text tests.
Found 10 tests.
Installing the server ...
    Found executable at /usr/bin/tarantool
    Found tarantoolctl at  /usr/bin/tarantoolctl
    Creating and populating working directory in /home/ubuntu/actions-runner/_work/tarantool/tarantool/test/var ...
    Found old vardir, deleting ...
Starting the server ...
Starting /home/ubuntu/actions-runner/_work/tarantool/tarantool/test/var/binary.lua
Tarantool 2.10.0-beta1-225-g816e1dbd1
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/lib/ccache/g++
C_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1.225.dev=. -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type
CXX_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1.225.dev=. -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type
================================================================================
TEST                                            PARAMS		RESULT
---------------------------------------------------------------------------
binary/binary-boundary.test.py

Looks like we are hanging at some iteration because the test/var/binary-boundary.result file shows as follows:

#---------------------# test protocol boundary overruns #---------------------#
iteration 1900
iteration 1901
iteration 1902
iteration 1903
iteration 1904
iteration 1905
iteration 1906
iteration 1907
iteration 1908
iteration 1909
iteration 1910
iteration 1911
iteration 1912
iteration 1913
iteration 1914
iteration 1915
iteration 1916
iteration 1917
iteration 1918
iteration 1919
iteration 1920
iteration 1921
iteration 1922
iteration 1923
iteration 1924
iteration 1925
iteration 1926
iteration 1927
iteration 1928
iteration 1929
iteration 1930
iteration 1931
iteration 1932
iteration 1933
iteration 1934
iteration 1935
iteration 1936
iteration 1937
iteration 1938
iteration 1939
iteration 1940
iteration 1941
iteration 1942
iteration 1943
iteration 1944
iteration 1945
iteration 1946
iteration 1947
iteration 1948
iteration 1949
iteration 1950
iteration 1951
iteration 1952
iteration 1953
iteration 1954

Unfortunately, I don't have exact repro steps to make this situation happen. But running tests in the loop will reproduce this issue with high probability:

for i in {1..50}; do make test-memcached; done

I did a small debugging and found that we hang in this loop and if to be more accurate here. If we dig deeper, it turns out that we get stuck here actually.

Please pay attention that #51 may happen because of the same problem.

@ligurio
Copy link
Member

ligurio commented Nov 18, 2021

But running tests in the loop will reproduce this issue with high probability:

How many times out of 50?

@ylobankov
Copy link
Member Author

But running tests in the loop will reproduce this issue with high probability:

How many times out of 50?

Sorry, I didn't do such measurements.

@ligurio
Copy link
Member

ligurio commented Nov 30, 2021

Need more details about environment. With script that runs test in loop (for i in {1..50}; do make test-memcached; done), all 50 iterations passed without fails.

sergeyb@pony:~/sources/MRG/memcached$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
$ tarantool --version
Tarantool 2.10.0-beta1-0-g7da4b1438
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/lib/ccache/g++
C_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1=. -specs=/usr/share/dpkg/no-pie-compile.specs -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type
CXX_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1=. -specs=/usr/share/dpkg/no-pie-compile.specs -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type
sergeyb@pony:~/sources/MRG/memcached$ 

tarantool-dev 2.10.0~beta1-1 amd64
libevent-dev 2.1.11-stable-1
libsasl2-dev 2.1.27+dfsg-2 amd64

@ylobankov, please assist with repro.

ligurio added a commit that referenced this issue Dec 2, 2021
Test is unstable and to setup simple CI (#75) we need to disable it. Fix
planned in (#73).
ligurio added a commit that referenced this issue Dec 2, 2021
Test is unstable and to setup simple CI (#75) we need to disable it. Fix
planned in (#73).
ligurio added a commit that referenced this issue Dec 2, 2021
Test is unstable and to setup simple CI (#75) we need to disable it. Fix
planned in (#73).
@ylobankov
Copy link
Member Author

Need more details about environment. With script that runs test in loop (for i in {1..50}; do make test-memcached; done), all 50 iterations passed without fails.

sergeyb@pony:~/sources/MRG/memcached$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
$ tarantool --version
Tarantool 2.10.0-beta1-0-g7da4b1438
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/lib/ccache/g++
C_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1=. -specs=/usr/share/dpkg/no-pie-compile.specs -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type
CXX_FLAGS:-g -O2 -fdebug-prefix-map=/build/tarantool-2.10.0~beta1=. -specs=/usr/share/dpkg/no-pie-compile.specs -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type
sergeyb@pony:~/sources/MRG/memcached$ 

tarantool-dev 2.10.0~beta1-1 amd64 libevent-dev 2.1.11-stable-1 libsasl2-dev 2.1.27+dfsg-2 amd64

@ylobankov, please assist with repro.

I reproduced this bug on Ubuntu 20.04 quite often. If you still have any issues with that, ping me, I will provide the server.

@ligurio
Copy link
Member

ligurio commented Dec 13, 2021

How to reproduce the problem:

  1. Apply the patch below to test:
--- test/binary/binary-boundary.test.py	2021-12-02 19:09:10.852429691 +0300
+++ test/binary/binary-boundary.py	2021-12-11 10:45:11.327712154 +0300
@@ -4,19 +4,19 @@
 import inspect
 import traceback
 
-saved_path = sys.path[:]
-sys.path.append(os.path.dirname(os.path.abspath(inspect.getsourcefile(lambda:0))))
+root = os.getcwd()
+sys.path.append(os.path.join(root, 'test-run'))
 
 from internal.memcached_connection import MemcachedBinaryConnection
 from internal.memcached_connection import STATUS, COMMANDS
 
-mc = MemcachedBinaryConnection("127.0.0.1", iproto.py_con.port)
+mc = MemcachedBinaryConnection("127.0.0.1", 8080)
 
 def iequal(left, right, level = 1):
     if (left != right):
         tb = traceback.extract_stack()[-(level + 1)]
-        print "Error on line %s:%d: %s not equal %s" % (tb[0], tb[1],
-                repr(left), repr(right))
+        print("Error on line %s:%d: %s not equal %s" % (tb[0], tb[1],
+                repr(left), repr(right)))
         if (isinstance(left, basestring)):
             if (len(left) != len(right)):
                 print("length is different")
@@ -28,28 +28,29 @@
             iequal(res.get('val', val), val, level + 1)
 
 def check(key, flags, val, level = 0):
+    print("Get", key)
     res = mc.get(key)
-    __check(res[0], flags, val, level + 1)
+    #__check(res[0], flags, val, level + 1)
 
 print("""#---------------------# test protocol boundary overruns #---------------------#""")
 
-for i in range(1900, 2100):
+for i in range(1950, 2000):
     print ("iteration %d" % i)
     key  = "test_key_%d" % i
     val  = "x" * i
     mc.setq(key, val, flags=82, nosend=True)
     mc.setq("alt_%s" % key, "blah", flags=82, nosend=True)
-    data = "".join(mc.commands)
+    data = b"".join(mc.commands)
     mc.commands = []
 
     if (len(data) > 2024):
         for j in range(2024, min(2096, len(data))):
+            print("send", data[:j])
             mc.socket.sendall(data[:j])
             time.sleep(0.00001)
+            print("send", data[j:])
             mc.socket.sendall(data[j:])
     else:
         mc.socket.sendall(data)
     check(key, 82, val)
-    check("alt_%s" % key, 82, "blah")
-
-sys.path = saved_path
+    #check("alt_%s" % key, 82, "blah")
  1. Create a Tarantool instance file repro.lua with content below:
box.cfg{
    log_level = 7,
    feedback_enabled = false,
    --log = 'tarantool.log',
    --background = true,
    --pid_file = '/home/sergeyb/sources/MRG/memcached/pid',
}
mc = require('memcached')
m = mc.create('instance_1', '8080', {
    expire_enabled = false,
    protocol = 'binary',
    verbosity = 3
})

require('console'):start()
  1. Start Tarantool: tarantool repro.lua
  2. Run test: python test/binary/binary-boundary.py
  3. Test stuck with message: ('Get', 'test_key_1952')

What's going on

  • test sends two commands at once (setq and setq):
    mc.setq(key, val, flags=82, nosend=True)                                     
    mc.setq("alt_%s" % key, "blah", flags=82, nosend=True)                       
    data = b"".join(mc.commands)                                                 
    mc.commands = []                                                             
                                                                                 
    if (len(data) > 2024):                                                       
        for j in range(2024, min(2096, len(data))):                              
            print("send", data[:j])                                              
            mc.socket.sendall(data[:j])                                          
            time.sleep(0.00001)                                                  
            print("send", data[j:])                                              
            mc.socket.sendall(data[j:])  
  • memcached successfully processed them (check it with command box.space.__mc_instance_1:get('test_key_1952') in Tarantool's console) and reads a request with Getfor key test_key_1952:
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> tuple_delete(0x7f4a0e00f704)                                                              
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memtx_tuple_delete(0x7f4a0e00f704)                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:205                                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:210                                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> ibuf_used 2087, rc 0, batch_count 19                                                      
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:222                                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:176 parse_request                                                          
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:202                                                                        
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_bin_process
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_bin_process_set
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> SETQ 'test_key_1952' '<DATA>'
375731712, expire - 0                                                                                                                                         
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> opaque - 7, cas - 0
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memtx_tuple_new(1992) = 0x7f4a0e0c2428
2021-12-13 14:55:08.799 [568488] main D> cpipe_flush_cb: locking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] main D> cpipe_flush_cb: unlocking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] wal/101/main D> cbus_endpoint_fetch: locking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] wal/101/main D> cbus_endpoint_fetch: unlocking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] wal D> cpipe_flush_cb: locking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] wal D> cpipe_flush_cb: unlocking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] main D> cbus_endpoint_fetch: locking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] main D> cbus_endpoint_fetch: unlocking &endpoint->mutex
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> tuple_delete(0x7f4a0e0c141c)
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D>memtx_tuple_delete(0x7f4a0e0c141c)
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memtx_tuple_delete(0x7f4a0e0c141c)
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:205
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:210
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> ibuf_used 90, rc 0, batch_count 20
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> ibuf_used 90
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop: 230 memcached_flush
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop:232
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> memcached_loop_read to_read 1
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> mnet_read_ahead() start
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> mnet_read_ahead: nrd -1
2021-12-13 14:55:08.799 [568488] main/113/server/127.0.0.1:36442 D> mnet_read_ahead: coio_wait begin
  • batch_count is set to 20 by default, and on condition below we go out of if..then..else and go to next memcached_loop_read() at the beginning of the loop, see internal/memcached.c:
		} else if (rc == 0 && ibuf_used(con->in) > 0 &&
			   batch_count < con->cfg->batch_count) {
			batch_count++;
			goto next;
		}

ligurio added a commit that referenced this issue Dec 15, 2021
Test is unstable and to setup simple CI (#75) we need to disable it. Fix
planned in (#73).
ligurio added a commit that referenced this issue Dec 16, 2021
ligurio added a commit that referenced this issue Dec 16, 2021
@ligurio ligurio reopened this Dec 16, 2021
@ligurio
Copy link
Member

ligurio commented Dec 16, 2021

Closed accidentally.

@ligurio
Copy link
Member

ligurio commented Dec 17, 2021

memcached creates an endless read-parse-process-flush loop for each network connection.
Patch below adds debug printing that helps to understand what operation is executing in a loop, what a current size of ibuf with raw data and current value of a batcn_count:

diff --git a/memcached/internal/memcached.c b/memcached/internal/memcached.c
index 3f760c7..33b667b 100644
--- a/memcached/internal/memcached.c
+++ b/memcached/internal/memcached.c
@@ -154,6 +154,7 @@ memcached_loop(struct memcached_connection *con)
 	int batch_count = 0;
 
 	for (;;) {
+		say_info("%s, ibuf_used %d, batch_count %d", __func__, ibuf_used(con->in), batch_count);
 		rc = memcached_loop_read(con, to_read);
 		if (rc == -1) {
 			/**
@@ -166,6 +167,7 @@ memcached_loop(struct memcached_connection *con)
 next:
 		con->noreply = false;
 		con->noprocess = false;
+		say_info("%s, ibuf_used %d, batch_count %d", __func__, ibuf_used(con->in), batch_count);
 		rc = con->cb.parse_request(con);
 		if (rc == -1) {
 			memcached_loop_error(con);
@@ -177,6 +179,7 @@ next:
 			} else {
 				memcached_skip_request(con);
 			}
+		        say_info("%s, ibuf_used %d, batch_count %d", __func__, ibuf_used(con->in), batch_count);
 			memcached_flush(con);
 			batch_count = 0;
 			continue;
@@ -188,6 +191,7 @@ next:
 		assert(!con->close_connection);
 		rc = 0;
 		if (!con->noprocess) {
+		        say_info("%s, ibuf_used %d, batch_count %d", __func__, ibuf_used(con->in), batch_count);
 			rc = con->cb.process_request(con);
 			memcached_connection_gc(con);
 		}
@@ -205,11 +209,13 @@ next:
 		}
 		/* Write back answer */
 		if (!con->noreply)
+		        say_info("%s, ibuf_used %d", __func__, ibuf_used(con->in));
 			memcached_flush(con);
 		fiber_reschedule();
 		batch_count = 0;
 		continue;
 	}
+	say_info("%s, ibuf_used %d", __func__, ibuf_used(con->in));
 	memcached_flush(con);
 }

One can apply a patch and start Tarantool with memcached that listen on port 8080 and accept connections on memcached text protocol:

tarantool -l 'memcached' -e "box.cfg{} memcached.create('instance_1', '8080', { protocol = 'text', verbosity = 3 }) require('console'):start()"

We can send commands to memcached via text commands. For example command below we set key tarantool to value database and read it back with command get:

$ printf "set tarantool 0 0 8\r\ndatabase\r\nget tarantool\r\n" | nc 127.0.0.1 8080
STORED
VALUE tarantool 0 8
database
END

Tarantool will print debug prints:

tarantool> 2021-12-17 19:06:45.961 [1350583] main/121/server/127.0.0.1:37730 I> memcached_flush(), ibuf_used 0
2021-12-17 19:06:47.545 [1350583] main/123/server/127.0.0.1:37756 I> memcached_loop_read(), ibuf_used 0, batch_count 0
2021-12-17 19:06:47.545 [1350583] main/123/server/127.0.0.1:37756 I> parse_request(), ibuf_used 46, batch_count 0
2021-12-17 19:06:47.545 [1350583] main/123/server/127.0.0.1:37756 I> process_request(), ibuf_used 46, batch_count 0
2021-12-17 19:06:47.546 [1350583] main/123/server/127.0.0.1:37756 I> parse_request(), ibuf_used 15, batch_count 1
2021-12-17 19:06:47.546 [1350583] main/123/server/127.0.0.1:37756 I> process_request(), ibuf_used 15, batch_count 1
2021-12-17 19:06:47.546 [1350583] main/123/server/127.0.0.1:37756 I> memcached_flush(), ibuf_used 0
2021-12-17 19:06:47.546 [1350583] main/123/server/127.0.0.1:37756 I> memcached_loop_read(), ibuf_used 0, batch_count 0

Bug with stalled get command can be reproduced with a sequence of 21 (default value of batch_count is 21, but due to one-by-off error we need to send 21 instead of 20) set commands and a final get:

printf "set foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nset foo 0 0 1\r\n1\r\nget foo\r\n" | nc 127.0.0.1 8080
<'STORED' repeats 20 times>                                                                                                                                  
STORED                                                                                                                                  
^C

In Tarantool session:

2021-12-17 19:41:08.778 [1356087] main/121/server/127.0.0.1:37948 I> memcached_loop_read(), ibuf_used 0, batch_count 0                  
2021-12-17 19:41:08.778 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 387, batch_count 0       
2021-12-17 19:41:08.778 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 387, batch_count 0                    
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 369, batch_count 1                      
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 369, batch_count 1                    
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 351, batch_count 2                      
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 351, batch_count 2                    
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 333, batch_count 3                      
2021-12-17 19:41:08.779 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 333, batch_count 3                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 315, batch_count 4                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 315, batch_count 4                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 297, batch_count 5                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 297, batch_count 5                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 279, batch_count 6                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 279, batch_count 6                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 261, batch_count 7                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 261, batch_count 7                    
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 243, batch_count 8                      
2021-12-17 19:41:08.780 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 243, batch_count 8                    
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 225, batch_count 9                      
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 225, batch_count 9                    
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 207, batch_count 10                     
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 207, batch_count 10    
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 189, batch_count 11                   
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 171, batch_count 12                     
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 171, batch_count 12                   
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 153, batch_count 13                     
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 153, batch_count 13                   
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 135, batch_count 14                     
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 135, batch_count 14                   
2021-12-17 19:41:08.781 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 117, batch_count 15                     
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 117, batch_count 15                   
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 99, batch_count 16                      
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 99, batch_count 16  
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 81, batch_count 17
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 81, batch_count 17
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 63, batch_count 18
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 63, batch_count 18
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 45, batch_count 19
2021-12-17 19:41:08.782 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 45, batch_count 19
2021-12-17 19:41:08.783 [1356087] main/121/server/127.0.0.1:37948 I> parse_request(), ibuf_used 27, batch_count 20
2021-12-17 19:41:08.783 [1356087] main/121/server/127.0.0.1:37948 I> process_request(), ibuf_used 27, batch_count 20
2021-12-17 19:41:08.783 [1356087] main/121/server/127.0.0.1:37948 I> memcached_flush() (write back answer), ibuf_used 9
2021-12-17 19:41:08.783 [1356087] main/121/server/127.0.0.1:37948 I> memcached_loop_read(), ibuf_used 9, batch_count 0
2021-12-17 19:43:28.736 [1356087] main/121/server/127.0.0.1:37948 I> memcached_flush(), ibuf_used 9

In Tarantool's log we see that memcached processed all set commands and when batch_count become equal to 21 we executes memcached_flush() and with get command in buffer we went to network read. NOTE: memcached_flush usually sends answers to the client. However in our case it sends confirmations to the client (it prints STORED on console after that), in original test we use setq which doesn't requre answers, but setq is implemented only in binary protocol and I cannot use it for demonstration of bug using nc.

According to description of memcached protocol there is a number of retrieval commands and all of them variations of get command: "get", "gets", "gat", and "gats". I have successfully reproduced issue with gat instead of get.

@ligurio
Copy link
Member

ligurio commented Dec 18, 2021

States transitions in processing loop:
states

GraphViz text (dot -Tpng -o states.png states.gv.txt)
digraph {
    START [peripheries=2]
    memcached_loop_read
    parse_request
    process_request
    close_connection
 
    START -> memcached_loop_read
    memcached_loop_read -> close_connection [label="err"]
    memcached_loop_read -> parse_request [label="no err"]
    parse_request -> close_connection [label="rc == -1, close_connection == true"]
    parse_request -> memcached_loop_read [label="rc == -1, close_connection == false"]
    parse_request -> memcached_loop_read [label="rc > 0"]
    parse_request -> process_request [label="rc == 0, noprocess == false"]
    parse_request -> close_connection [label="rc == 0, close_connection == true"]
    parse_request -> parse_request [label="rc == 0, ibuf_used > 0, batch_count < 20"]
    parse_request -> memcached_flush
    memcached_flush -> memcached_loop_read
}

ligurio added a commit that referenced this issue Dec 18, 2021
Summary: memcached stuck on processing request when number of
processed commands is exceeded of (batch_count + 1) and buffer has a
request with retrieval command ("get", "gets", "gat", and "gats").

For each incoming connection memcached starts endless loop in function
memcached_loop() where it reads network data to buffer
(memcached_loop_read) , parse commands from a buffer one by one
(parse_request), process commands (process_request) and write answers to
network socket (memcached_flush).

Imagine we have a sequence of (batch_count + 1) commands with SETQ (SETQ
is the same as a SET command but it is not required confirmation) and
GET command that should return a value set by one of the previous SETQ
command. Processing of SETQ commands finished successfully. After
processing the latest request with SETQ command we have a GET command in
a buffer and batch_count < 20. In such case control flow returns to
start of the read-parse-process-write loop where we stuck in
memcached_loop_read() and wait for next requests instead of replying to
client with value of requested key.

    rc = con->cb.process_request(con);
    ...
    if (rc == -1)
            memcached_loop_error(con);
    if (con->close_connection) {
            break;
    } else if (rc == 0 && ibuf_used(con->in) > 0 &&
               batch_count < con->cfg->batch_count) {
            goto next;
    }
    ...
    continue;

Fixes #73
ligurio added a commit that referenced this issue Dec 20, 2021
Summary: memcached stuck on processing request when number of
processed commands is exceeded of (batch_count + 1) and buffer has a
request with retrieval command ("get", "gets", "gat", and "gats").

For each incoming connection memcached starts endless loop in function
memcached_loop() where it reads network data to buffer
(memcached_loop_read) , parse commands from a buffer one by one
(parse_request), process commands (process_request) and write answers to
network socket (memcached_flush).

Imagine we have a sequence of (batch_count + 1) commands with SETQ (SETQ
is the same as a SET command but it is not required confirmation) and
GET command that should return a value set by one of the previous SETQ
command. Processing of SETQ commands finished successfully. After
processing the latest request with SETQ command we have a GET command in
a buffer and batch_count < 20. In such case control flow returns to
start of the read-parse-process-write loop where we stuck in
memcached_loop_read() and wait for next requests instead of replying to
client with value of requested key.

    rc = con->cb.process_request(con);
    ...
    if (rc == -1)
            memcached_loop_error(con);
    if (con->close_connection) {
            break;
    } else if (rc == 0 && ibuf_used(con->in) > 0 &&
               batch_count < con->cfg->batch_count) {
            goto next;
    }
    ...
    continue;

Fixes #73
ligurio added a commit that referenced this issue Dec 20, 2021
Summary: memcached stuck on processing request when number of
processed commands is exceeded of (batch_count + 1) and buffer has a
request with retrieval command ("get", "gets", "gat", and "gats").

For each incoming connection memcached starts endless loop in function
memcached_loop() where it reads network data to buffer
(memcached_loop_read), parse commands from a buffer one by one
(parse_request), process commands (process_request) and write answers to
network socket (memcached_flush).

Imagine we have a sequence of (batch_count + 1) commands with SETQ (SETQ
is the same as a SET command but it is not required confirmation) and
GET command that should return a value set by one of the previous SETQ
command. Processing of SETQ commands finished successfully. After
processing the latest request with SETQ command we have a GET command in
a buffer and batch_count < 20. In such case control flow returns to
start of the read-parse-process-write loop where we stuck in
memcached_loop_read() and wait for next requests instead of replying to
client with value of requested key.

    rc = con->cb.process_request(con);
    ...
    if (rc == -1)
            memcached_loop_error(con);
    if (con->close_connection) {
            break;
    } else if (rc == 0 && ibuf_used(con->in) > 0 &&
               batch_count < con->cfg->batch_count) {
            goto next;
    }
    ...
    continue;

Fixes #73
ligurio added a commit that referenced this issue Dec 27, 2021
Summary: memcached stuck on processing request when number of
processed commands is exceeded of (batch_count + 1) and buffer has a
request with retrieval command ("get", "gets", "gat", and "gats").

For each incoming connection memcached starts endless loop in function
memcached_loop() where it reads network data to buffer
(memcached_loop_read), parse commands from a buffer one by one
(parse_request), process commands (process_request) and write answers to
network socket (memcached_flush).

Imagine we have a sequence of requests equal to con->cfg->batch_count
(it is a constant value and equal to 20) + 1 with SETQ (SETQ is the same
as a SET command, but it is not required confirmation) commands and a
single GET command that should return a value set by one of the previous
SETQ command. Processing of SETQ commands finished successfully. After
processing the latest request with SETQ command we have a GET command in
a buffer and batch_count counter become equal to 20 (batch_count initial
value is 0). In such case control flow returns to start of the
read-parse-process-write loop where we stuck in memcached_loop_read()
and wait for next requests instead of replying to client with value of
requested key.

    rc = con->cb.process_request(con);
    ...
    if (rc == -1)
            memcached_loop_error(con);
    if (con->close_connection) {
            break;
    } else if (rc == 0 && ibuf_used(con->in) > 0 &&
               batch_count < con->cfg->batch_count) {
            goto next;
    }
    ...
    continue;

Fixes #73
ligurio added a commit that referenced this issue Dec 27, 2021
Summary: memcached stuck on processing request when number of
processed commands is exceeded of (batch_count + 1) and buffer has a
request with retrieval command ("get", "gets", "gat", and "gats").

For each incoming connection memcached starts endless loop in function
memcached_loop() where it reads network data to buffer
(memcached_loop_read), parse commands from a buffer one by one
(parse_request), process commands (process_request) and write answers to
network socket (memcached_flush).

Imagine we have a sequence of requests equal to con->cfg->batch_count
(it is a constant value and equal to 20) + 1 with SETQ (SETQ is the same
as a SET command, but it is not required confirmation) commands and a
single GET command that should return a value set by one of the previous
SETQ command. Processing of SETQ commands finished successfully. After
processing the latest request with SETQ command we have a GET command in
a buffer and batch_count counter become equal to 20 (batch_count initial
value is 0). In such case control flow returns to start of the
read-parse-process-write loop where we stuck in memcached_loop_read()
and wait for next requests instead of replying to client with value of
requested key.

    rc = con->cb.process_request(con);
    ...
    if (rc == -1)
            memcached_loop_error(con);
    if (con->close_connection) {
            break;
    } else if (rc == 0 && ibuf_used(con->in) > 0 &&
               batch_count < con->cfg->batch_count) {
            goto next;
    }
    ...
    continue;

Fixes #73
ligurio added a commit that referenced this issue Dec 28, 2021
Summary: memcached stuck on processing request when number of
processed commands is exceeded of (batch_count + 1) and buffer has a
request with retrieval command ("get", "gets", "gat", and "gats").

For each incoming connection memcached starts endless loop in function
memcached_loop() where it reads network data to buffer
(memcached_loop_read), parse commands from a buffer one by one
(parse_request), process commands (process_request) and write answers to
network socket (memcached_flush).

Imagine we have a sequence of requests equal to con->cfg->batch_count
(it is a constant value and equal to 20) + 1 with SETQ (SETQ is the same
as a SET command, but it is not required confirmation) commands and a
single GET command that should return a value set by one of the previous
SETQ command. Processing of SETQ commands finished successfully. After
processing the latest request with SETQ command we have a GET command in
a buffer and batch_count counter become equal to 20 (batch_count initial
value is 0). In such case control flow returns to start of the
read-parse-process-write loop where we stuck in memcached_loop_read()
and wait for next requests instead of replying to client with value of
requested key.

    rc = con->cb.process_request(con);
    ...
    if (rc == -1)
            memcached_loop_error(con);
    if (con->close_connection) {
            break;
    } else if (rc == 0 && ibuf_used(con->in) > 0 &&
               batch_count < con->cfg->batch_count) {
            goto next;
    }
    ...
    continue;

Fixes #73
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants