Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this AMF completely fault-tolerant ? #283

Open
dark-astra opened this issue Aug 18, 2024 · 5 comments
Open

Is this AMF completely fault-tolerant ? #283

dark-astra opened this issue Aug 18, 2024 · 5 comments

Comments

@dark-astra
Copy link

dark-astra commented Aug 18, 2024

The 5G UE Registration call flow consists of multiple exchange of Uplink and Downlink Transport Messages. So if the AMF pod fails at some stage between these message exchanges, will the UE registration fail and restart a new registration request with the new AMF or can the UE resume the registration at the very step where it failed, with the new AMF ?

I see you are saving the context in MongoDB, but I wanted to know, is the statelessness procedural, where after the registration procedure, the context is saved or is it saved after each message exchange with RAN all along the the registration procedure, so that the registration, need not be started from the beginning if it fails, somewhere in between ?

I did a simple experiment to test this:

I installed the SDCore using Aether OnRamp and configured the gnbsim to run REGISTRATION procedure for 10 UEs.
So when the gnbsim starts sending request, I delete the AMF pod, and this cause the registrations to fail.

ok: [node1] => {
"gNbsimPod.stdout_lines": [
"time="2024-08-18T17:50:22Z" level=info msg="Profile Name: profile1 , Profile Type: register" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=info msg="Ue's Passed: 6 , Ue's Failed: 4" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=info msg="Profile Errors:" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=error msg="imsi
, profile timeout" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=error msg="imsi
, profile timeout" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=error msg="imsi
, profile timeout" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=error msg="imsi
, profile timeout" category=Summary component=GNBSIM",
"time="2024-08-18T17:50:22Z" level=info msg="Profile Status: FAIL" category=Summary component=GNBSIM"
]
}

My gnbsim-default.yaml looks like the following:

profiles:

profileType: register # UE Registration
profileName: profile1
enable: true
gnbName: gnb1
execInParallel: false
startImsi: 208930100007487
ueCount: 10
defaultAs: "192.168.250.1"
perUserTimeout: 10
plmnId:
mcc: 208
mnc: 93
opc: "981d464c7c52eb6e5036234984ad0bcf"
key: "5122250214c33e723a5dd523fc145fc0"
sequenceNumber: "16f3b3f70fc2"
@thakurajayL
Copy link
Contributor

HI @dark-astra . Thank you for trying out SD-Core/Aether-onRamp.

"I see you are saving the context in MongoDB, but I wanted to know, is the statelessness procedural, where after the registration procedure, the context is saved or is it saved after each message exchange with RAN all along the the registration procedure, so that the registration, need not be started from the beginning if it fails, somewhere in between ?"

Details are stored after complete registration procedure.

@dark-astra
Copy link
Author

dark-astra commented Aug 20, 2024

@thakurajayL Deleting the AMF pod, makes all the consequent UE registrations fail. The gnbSim is not able to connect to the newer AMF after the older AMF is deleted. Shouldn't the gnbSim able to process the subsequent UE registrations with the new AMF pod ?

@thakurajayL
Copy link
Contributor

Good question. There are multiple things involved here,

  1. If you are running default gnbsim config then it means 1 request at a time and if gnbsim does not get any response then it gets stuck. You will see execInParallel configuration at 2 level. Top level if you set to true then it means run all profile in parallel. if you set execInParallel as true within profile then all the subscribers within profile are run in parallel.
  2. Now irrespective of if you are running execInParallel true or false, if AMF does not respond then signalling is stopped. There is PR available which needs some corrections to reconnect to new AMF. This needs code correction or updating the old PR.
  3. You can enable sctplb in the deployment and sctplb is stateless to handle the crash. if AMF crashes then newly restarted AMF work with sctplb as is.
  4. Of course in some cases sctplb needs to resend the message. We have support to retry service request, similar support needs to be added for other messages as well. We would be happy if you want to add the code.
    THanks

@dark-astra
Copy link
Author

dark-astra commented Aug 26, 2024

@thakurajayL, Thanks for the detailed response.

Enabling the SCTP load balancer using the configuration has resolved the issue where the new AMF was not handling subsequent requests. However, I'm still encountering a problem: when the old AMF fails, a few UEs (around 3-4) experience timeouts or failures before the new AMF starts registering them again.

Even though I'm running the UE registrations sequentially, I'm puzzled as to why there are multiple timeouts or failures before the new AMF takes over.

Is there native support at the gnbsim itself, for retrying of service request ?

Copy link

This issue has been stale for 120 days and will be closed in 15 days. Comment to keep it open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants