-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce LDMS Profiling Capability #1460
Conversation
c6fb393
to
36ae4a1
Compare
The PR includes #1463 because LDMS could crash without it. |
ldms/src/core/ldms.c
Outdated
-ENOSYS, /* receive */ | ||
0, /* stream_publish */ | ||
-ENOSYS, /* stream_subscribe */ | ||
-ENOSYS /* stream_unsubscribe */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these error codes negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the error code negative because I used 1 as enabled and 0 as disabled. I'll add a comment that includes the meanings of 0 and 1.
ldms/src/core/ldms.c
Outdated
if (ops_err) | ||
bzero(ops_err, sizeof(int) * ops_cnt); | ||
for (i = 0; i < ops_cnt; i++) { | ||
if (enable_profiling[ops[i]] == -ENOSYS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we are mixing -error and error in this logic. Note that the whole -error convention comes from the Linux kernel to distinguish between an "error code" and a "pointer". This convention was bad then and it still is, so IMHO we should avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm addressing this.
const char *ldms_sockaddr_ntop(struct sockaddr *sa, char *buff, size_t sz); | ||
const char *ldms_addr_ntop(struct ldms_addr *addr, char *buff, size_t sz); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both of these interfaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The profiling code calls ldms_addr_ntop()
. After investigation, no paths call ldms_sockaddr_ntop()
, so I will remove the API. I also found that ldms_addr_ntop()
also declared in ldms_rail.h
. I'll remove the one in ldms_rail.h
. Thanks for pointing this out.
* | ||
* curr_updt_ctxt is NULL when there is no outstanding update on the set. | ||
* ldms doesn't update the set while there is an outstanding update on the set. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Mon, but I have no clue what is happening here based on the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll improve the comment. Currently, curr_updt_ctxt
contains profiling timestamps of the ongoing update operation on the set. The field is NULL when no update is in progress.
@@ -76,6 +76,9 @@ extern ovis_log_t xlog; | |||
ovis_log(xlog, OVIS_LERROR, fmt, ## __VA_ARGS__); \ | |||
} while (0); | |||
|
|||
/* The definition is in ldms.c. */ | |||
extern int enable_profiling[LDMS_XPRT_OP_COUNT]; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this it not a public interface, then the convention is to put an underscore or two in front of it. This indicates to the user that they should not use this symbol or its contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on the change.
@@ -86,13 +89,14 @@ static int __rail_sockaddr(ldms_t _r, struct sockaddr *local_sa, | |||
struct sockaddr *remote_sa, | |||
socklen_t *sa_len); | |||
static void __rail_close(ldms_t _r); | |||
static int __rail_send(ldms_t _r, char *msg_buf, size_t msg_len); | |||
static int __rail_send(ldms_t _r, char *msg_buf, size_t msg_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a style comment, but I don't understand the logic of putting '' in front of a static symbol. '' is meant to imply that the symbol is not a published interface, however, but definition, a 'static' symbol is not public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see __
and static
serving different purposes. __
is for readability, and static
is for compilation/symbol access control. They're complementary rather than redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nichamon, by that logic every non-static symbol inside a .c file would be proceeded by an __, and that is not descriptive at all in my opinion. BTW, I am guilty of this too ... and recently.
@nichamon, this is a convention that is used throughout the kernel for interfaces that return pointers to determine whether it's an error value or a valid pointer. In the kernel everyone uses the same convention. In our logic, anyone interpreting the return value would have to do so by interface. In other words, it is not consistent. In user mode, the convention for interfaces returning integers: -1 means error, and then the application consults errno for more information. |
36ae4a1
to
a77173a
Compare
@@ -616,6 +617,7 @@ class LDMSD_Request(object): | |||
SET_SEC_MOD = 0x600 + 19 | |||
LOG_STATUS = 0x600 + 20 | |||
STATS_RESET = 0x600 + 21 | |||
PROFILING = 0x600 + 31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nichamon, is there some reasons we are bumping by 31 instead of 22?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nichamon, is there some reasons we are bumping by 31 instead of 22?
It's the same number defined in ldmsd_request.h:168 in this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tom95858 I added the comment to explain why the ID skips the range.
a77173a
to
83d2b77
Compare
This commit introduces profiling capabilities to LDMS operations, including lookup, update, set_delete, and stream_publish. It collects timestamps for key events, such as API calls, request/response exchanges, and completion notifications. The feature is disabled by default but can be configured using the 'profilng' configuration command to enable, disable, or reset data collection. This enhancement will aid in performance analysis and optimization.
83d2b77
to
63ff4f5
Compare
No description provided.