Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce LDMS Profiling Capability #1460

Merged
merged 2 commits into from
Dec 12, 2024
Merged

Conversation

nichamon
Copy link
Collaborator

@nichamon nichamon commented Oct 8, 2024

No description provided.

@nichamon
Copy link
Collaborator Author

nichamon commented Oct 9, 2024

The PR includes #1463 because LDMS could crash without it.

-ENOSYS, /* receive */
0, /* stream_publish */
-ENOSYS, /* stream_subscribe */
-ENOSYS /* stream_unsubscribe */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these error codes negative?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the error code negative because I used 1 as enabled and 0 as disabled. I'll add a comment that includes the meanings of 0 and 1.

if (ops_err)
bzero(ops_err, sizeof(int) * ops_cnt);
for (i = 0; i < ops_cnt; i++) {
if (enable_profiling[ops[i]] == -ENOSYS) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are mixing -error and error in this logic. Note that the whole -error convention comes from the Linux kernel to distinguish between an "error code" and a "pointer". This convention was bad then and it still is, so IMHO we should avoid it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm addressing this.

const char *ldms_sockaddr_ntop(struct sockaddr *sa, char *buff, size_t sz);
const char *ldms_addr_ntop(struct ldms_addr *addr, char *buff, size_t sz);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both of these interfaces?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profiling code calls ldms_addr_ntop(). After investigation, no paths call ldms_sockaddr_ntop(), so I will remove the API. I also found that ldms_addr_ntop() also declared in ldms_rail.h. I'll remove the one in ldms_rail.h. Thanks for pointing this out.

*
* curr_updt_ctxt is NULL when there is no outstanding update on the set.
* ldms doesn't update the set while there is an outstanding update on the set.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Mon, but I have no clue what is happening here based on the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll improve the comment. Currently, curr_updt_ctxt contains profiling timestamps of the ongoing update operation on the set. The field is NULL when no update is in progress.

@@ -76,6 +76,9 @@ extern ovis_log_t xlog;
ovis_log(xlog, OVIS_LERROR, fmt, ## __VA_ARGS__); \
} while (0);

/* The definition is in ldms.c. */
extern int enable_profiling[LDMS_XPRT_OP_COUNT];

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this it not a public interface, then the convention is to put an underscore or two in front of it. This indicates to the user that they should not use this symbol or its contents.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on the change.

@@ -86,13 +89,14 @@ static int __rail_sockaddr(ldms_t _r, struct sockaddr *local_sa,
struct sockaddr *remote_sa,
socklen_t *sa_len);
static void __rail_close(ldms_t _r);
static int __rail_send(ldms_t _r, char *msg_buf, size_t msg_len);
static int __rail_send(ldms_t _r, char *msg_buf, size_t msg_len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a style comment, but I don't understand the logic of putting '' in front of a static symbol. '' is meant to imply that the symbol is not a published interface, however, but definition, a 'static' symbol is not public.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see __ and static serving different purposes. __ is for readability, and static is for compilation/symbol access control. They're complementary rather than redundant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nichamon, by that logic every non-static symbol inside a .c file would be proceeded by an __, and that is not descriptive at all in my opinion. BTW, I am guilty of this too ... and recently.

@tom95858
Copy link
Collaborator

@nichamon, this is a convention that is used throughout the kernel for interfaces that return pointers to determine whether it's an error value or a valid pointer. In the kernel everyone uses the same convention. In our logic, anyone interpreting the return value would have to do so by interface. In other words, it is not consistent. In user mode, the convention for interfaces returning integers: -1 means error, and then the application consults errno for more information.

@nichamon nichamon added this to the v4.5.1 milestone Oct 23, 2024
@nichamon nichamon requested a review from tom95858 October 29, 2024 19:16
@@ -616,6 +617,7 @@ class LDMSD_Request(object):
SET_SEC_MOD = 0x600 + 19
LOG_STATUS = 0x600 + 20
STATS_RESET = 0x600 + 21
PROFILING = 0x600 + 31
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nichamon, is there some reasons we are bumping by 31 instead of 22?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nichamon, is there some reasons we are bumping by 31 instead of 22?

It's the same number defined in ldmsd_request.h:168 in this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tom95858 I added the comment to explain why the ID skips the range.

This commit introduces profiling capabilities to LDMS operations,
including lookup, update, set_delete, and stream_publish. It collects
timestamps for key events, such as API calls, request/response
exchanges, and completion notifications. The feature is disabled by
default but can be configured using the 'profilng' configuration command
to enable, disable, or reset data collection. This enhancement will aid
in performance analysis and optimization.
@nichamon nichamon requested a review from tom95858 December 12, 2024 15:13
@tom95858 tom95858 merged commit 5160ffc into ovis-hpc:main Dec 12, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants