Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid segment fault in ASIC/SDK health event handling when vendor SAI passes an invalid timestamp #3446

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Jan 1, 2025

What I did

Prevent orchagent from being segment fault when it receives a timestamp indicating a time in the far future (2^31 years later) in the ASIC/SDK health event from the vendor SAI.
It's vendor SAI's failure to pass such a large timestamp but we need to protect such invalid input.

Why I did it

How I verified it

Mock test

Details if related

In case vendor SAI passed a very large timestamp, put_time can cause segment fault which can not be caught by try/catch infra
We check the difference between the timestamp from SAI and the current time and force to use current time if the gap is too large
By doing so, we can avoid the segment fault

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

The coverage report is inaccurate. The mock test covers all lines. see below back trace

#11 0x000055555573b2a0 in main ()
(gdb) l
1108                                                sai_switch_asic_sdk_health_severity_t severity,
1109                                                sai_timespec_t timestamp,
1110                                                sai_switch_asic_sdk_health_category_t category,
1111                                                sai_switch_health_data_t data,
1112                                                const sai_u8_list_t &description)
1113    {
1114        std::vector<swss::FieldValueTuple> values;
1115        const string &severity_str = switch_asic_sdk_health_event_severity_reverse_map.at(severity);
1116        const string &category_str = switch_asic_sdk_health_event_category_reverse_map.at(category);
1117        string description_str;
(gdb) 
1118        std::time_t t = (std::time_t)timestamp.tv_sec;
1119        const std::time_t now = std::time(0);
1120        const double year_in_seconds = 86400 * 365;
1121        stringstream time_ss;
1122
1123        /*
1124         * In case vendor SAI passed a very large timestamp, put_time can cause segment fault which can not be caught by try/catch infra
1125         * We check the difference between the timestamp from SAI and the current time and force to use current time if the gap is too large
1126         * By doing so, we can avoid the segment fault
1127         */
(gdb) 
1128        if (difftime(t, now) > year_in_seconds)
1129        {
1130            SWSS_LOG_ERROR("Invalid timestamp second %" PRIx64 " in received ASIC/SDK health event, reset to current time", timestamp.tv_sec);
1131            t = now;
1132        }
1133
1134        time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
1135
1136        switch (data.data_type)
1137        {
(gdb) b 1130
Breakpoint 3 at 0x555555af3ade: file ../../orchagent/switchorch.cpp, line 1130.
(gdb) c
Continuing.

Thread 1 "tests" hit Breakpoint 3, SwitchOrch::onSwitchAsicSdkHealthEvent (this=0x55555661aee0, switch_id=<optimized out>, 
    severity=severity@entry=SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL, timestamp=..., category=category@entry=SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW, data=..., description=...)
    at ../../orchagent/switchorch.cpp:1130
1130            SWSS_LOG_ERROR("Invalid timestamp second %" PRIx64 " in received ASIC/SDK health event, reset to current time", timestamp.tv_sec);
(gdb) p t
$1 = 172479515853275099
(gdb) p now
$2 = 1735861776
(gdb) n
1131            t = now;
(gdb) 
1134        time_ss << std::put_time(std::localtime(&t), "%Y-%m-%d %H:%M:%S");
(gdb) p t
$3 = 1735861776
(gdb) bt
#0  SwitchOrch::onSwitchAsicSdkHealthEvent (this=0x55555661aee0, switch_id=<optimized out>, severity=severity@entry=SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL, timestamp=..., 
    category=category@entry=SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW, data=..., description=...) at ../../orchagent/switchorch.cpp:1134
#1  0x00005555559477bb in on_switch_asic_sdk_health_event (switch_id=<optimized out>, severity=severity@entry=SAI_SWITCH_ASIC_SDK_HEALTH_SEVERITY_FATAL, timestamp=..., 
    category=category@entry=SAI_SWITCH_ASIC_SDK_HEALTH_CATEGORY_FW, data=..., description=...) at ../../orchagent/notifications.cpp:79
#2  0x00005555558da537 in switchorch_test::SwitchOrchTest::checkAsicSdkHealthEvent (timestamp=..., expected_key="", this=<optimized out>) at switchorch_ut.cpp:138
#3  0x00005555558dabc8 in switchorch_test::SwitchOrchTest_SwitchOrchTestHandleEventInvalidTimeStamp_Test::TestBody (this=<optimized out>) at switchorch_ut.cpp:337
#4  0x0000555555e10f17 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#5  0x0000555555e02cde in testing::Test::Run() ()
#6  0x0000555555e02e95 in testing::TestInfo::Run() ()
#7  0x0000555555e03439 in testing::TestSuite::Run() ()
#8  0x0000555555e087df in testing::internal::UnitTestImpl::RunAllTests() ()
#9  0x0000555555e11487 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#10 0x0000555555e02f51 in testing::UnitTest::Run() ()
#11 0x000055555573b2a0 in main ()
(gdb) 

@stephenxs stephenxs requested review from kcudnik and prsunny January 2, 2025 23:52
@stephenxs stephenxs marked this pull request as ready for review January 3, 2025 05:02
@prsunny
Copy link
Collaborator

prsunny commented Jan 6, 2025

@stephenxs , able to fix coverage?

Comment on lines +1128 to +1132
if (difftime(t, now) > year_in_seconds)
{
SWSS_LOG_ERROR("Invalid timestamp second %" PRIx64 " in received ASIC/SDK health event, reset to current time", timestamp.tv_sec);
t = now;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this could be make as static helper function to correct_time etc, and it would be easier to test something like this

@stephenxs stephenxs force-pushed the handle-invalid-timestamp branch from 608b1f2 to 4bab3b2 Compare January 9, 2025 10:39
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

@stephenxs , able to fix coverage?

@prsunny I think this is an issue in the coverage tool itself.
A new test case I added in the PR covers the lines and a gdb backtrace also proves that.
could someone from your team help to check the coverage issue?

… passes an invalid timestamp

Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
Signed-off-by: Stephen Sun <[email protected]>
@stephenxs stephenxs force-pushed the handle-invalid-timestamp branch from 4bab3b2 to 81fdf54 Compare January 14, 2025 02:12
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants