Replies: 20 comments 3 replies
-
Seems that it will be good if box will have a method like box.save_info() (or smth like this) which will save whole info from different places into one plaintext/yaml/json and also a tool that can analyze whole data from it and give some recomendations, for example: too many connections - tune sysctl or too big tuples - increase memtx_min_tuple_size and so on. |
Beta Was this translation helpful? Give feedback.
-
Also
|
Beta Was this translation helpful? Give feedback.
-
I want to see all of this counters just from iproto thread without tx. |
Beta Was this translation helpful? Give feedback.
-
It would be nice to be able to build histograms for some types of statistics, for example, for the maximum length of the stream queue . |
Beta Was this translation helpful? Give feedback.
-
Please add coio thread pool monitoring |
Beta Was this translation helpful? Give feedback.
-
The Things I Want To FeelAccess to srv.info from anywhere (not only tx thread)There ara cases when I (as a chief incident officer) want to know about Tarantool:
Or other words
The CasesThere are incidents when user do not understand what's wrong. They are from different sides of Tarantool. I want to describe cases in separate sections. I want that we understand every case using srv.info for crashed and for running Tarantool |
Beta Was this translation helpful? Give feedback.
-
Userstory The Readaheache CaseUser wants to suppress readahead limit reached and this way increase perfomance.
local ffi = require("ffi")
local errno = require('errno')
ffi.cdef[[
typedef uint64_t rlim_t;
typedef struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
} rlimit;
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
]]
local RLIMIT_CORE = 4
local RLIMIT_NOFILE
if ffi.os == 'Linux' then
RLIMIT_NOFILE = 7
elseif ffi.os == 'OSX' then
RLIMIT_NOFILE = 8
end
local RLIM_INFINITY = bit.lshift(1ULL, 63) - 1
local OPEN_MAX = 10240
local function setrlimit(resource, limit)
local rlimit = ffi.new('rlimit')
rlimit.rlim_cur = limit.rlim_cur
rlimit.rlim_max = limit.rlim_max
local rc = ffi.C.setrlimit(resource, rlimit)
if rc ~= 0 then
return nil, errno.strerror()
end
return {rlim_cur = rlimit.rlim_cur,
rlim_max = rlimit.rlim_max}
end
setrlimit(RLIMIT_NOFILE, {rlim_cur=10240, rlim_max=10240})
box.cfg{listen=3301, readahead=1024*1024*20}
box.schema.user.grant('guest', 'super', nil, nil, {if_not_exists=true})
require('console').start() os.exit()
local netbox = require('net.box')
local ffi = require("ffi")
local errno = require('errno')
ffi.cdef[[
typedef uint64_t rlim_t;
typedef struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
} rlimit;
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
]]
local RLIMIT_CORE = 4
local RLIMIT_NOFILE
if ffi.os == 'Linux' then
RLIMIT_NOFILE = 7
elseif ffi.os == 'OSX' then
RLIMIT_NOFILE = 8
end
local RLIM_INFINITY = bit.lshift(1ULL, 63) - 1
local OPEN_MAX = 10240
local function setrlimit(resource, limit)
local rlimit = ffi.new('rlimit')
rlimit.rlim_cur = limit.rlim_cur
rlimit.rlim_max = limit.rlim_max
local rc = ffi.C.setrlimit(resource, rlimit)
if rc ~= 0 then
return nil, errno.strerror()
end
return {rlim_cur = rlimit.rlim_cur,
rlim_max = rlimit.rlim_max}
end
setrlimit(RLIMIT_NOFILE, {rlim_cur=10240, rlim_max=10240})
local clients = {}
for i = 1, 1e3 do
local c = netbox.connect('127.0.0.1:3301')
c:eval([[ return 1 ]])
table.insert(clients, c)
end
require('console').start() os.exit() |
Beta Was this translation helpful? Give feedback.
-
The Infinite Loop
|
Beta Was this translation helpful? Give feedback.
-
The Memory UsageAfter OOM killer shooted, it's hard to know what subsystem is reason.
|
Beta Was this translation helpful? Give feedback.
-
The Worst Fullscansfor i=1, #box.space.test:select() do
box.space.other.get(i)
end |
Beta Was this translation helpful? Give feedback.
-
Problem description: Proposal:
One of the advantages of this metrics - we will easily get amplimification - a number of internal operations per one business request. |
Beta Was this translation helpful? Give feedback.
-
The Fullscan UserstoryEnvironment:
Situation:
Po konyam:
Result: I want to understand in fiber.top(), what request is load storage system:
|
Beta Was this translation helpful? Give feedback.
-
The Сrashdump UserstoryEnv:
Situation:
Result: I want to see lua backtraces without any tls magic. I have luajit.py and do not have LuaState for every fiber. |
Beta Was this translation helpful? Give feedback.
-
Additional info about «applier in separate thread»
|
Beta Was this translation helpful? Give feedback.
-
Crashdump Userstory 2Prerequisites
Problem
Hypotesis
|
Beta Was this translation helpful? Give feedback.
-
It would be good to see number of transactions (box.commit calls). Or better commits and rollbacks of txn, because each transaction may contains many modifications |
Beta Was this translation helpful? Give feedback.
-
box.snapshot() can hang current caller fiber for a while. Its convenient to know who is caller of snapshot. Sometimes it is migration or ddl tools/scripts. For e.g. etcd loggin message source every line. |
Beta Was this translation helpful? Give feedback.
-
These day there are several reason for instance to be readonly. Seems that it's convenient about one field: "reason instance to be read only":
|
Beta Was this translation helpful? Give feedback.
-
There is runtime arena other place for allocating tuples outside readahead and memtx and vinyl. Sometimes router with large workloads make runtime arena bigger and bigger. |
Beta Was this translation helpful? Give feedback.
-
incident
|
Beta Was this translation helpful? Give feedback.
-
Tarantool stat/info interface needs redesign
Current contents of box.stat* is overflowed. Rather simple and compact in early versions it became almost unmaintainable and unreadable monster now. There are a lot of entry points for different kinds of information and I believe there is no person that could name them all without a doc or console.
What we have for now:
box.stat()
with box modifications(INSERT REPLACE UPSERT UPDATE DELETE), box queries (SELECT), iproto calls (EVAL CALL), weird ERROR (which counts... exceptions), SQL (PREPARE and EXECUTE) and low level AUTHbox.stat.net()
with similar tobox.stat()
interface and info about networking: connections, requests and bytes transmittedbox.stat.sql()
with strange format, not like in other.stat()
'sbox.info.sql()
. contains info about caches of sql?box.stat.vinyl()
with also different and hierarchical format (but worth a note: it's easy to read for a human). Better call it counters. and also it is accessible frombox.info.vinyl()
box.info.memory()
— memory counters.box.runtime.info()
- some kind of lua memory.box.info.gc()
— information about snapshot/xlog garbage collection. not lua gc.box.info.election
— info about state of raftThere are also several entry points, that could be used to obtain information about internals, like
box.slab.info()
/box.slab.stats()
,fiber.info()
,fiber.top()
,require 'tarantool'
and maybe something I forget now. BTW: need for explicitrequire
is very unfriendly for maintenance.The documentation states
But this is far from reality.
And with this load if information we have a lack for:
index:pairs()
)/scans(next()
)error()
'sAlso there were several discussions:
So, I have propositions:
human
(admin or developer) ormachine
(internal or external metrics or monitoring){ human = true }
)THe proposition is to create another global entry point for maintenance (not
box.
).I propose namespace
srv.
.Then, differentiate entry points for human and for machines by simple optional arguments:
-
srv.info
— for humans-
srv.info({ format='yaml' | 'json' | 'rows' | ... })
— for machinesUse the following hierarchy:
A lot of tickets related to this:
Also it's worth looking into closed tickets:
Keywords used to search issues:
box.info
,box.stat
,fiber.info
,slab
Beta Was this translation helpful? Give feedback.
All reactions