-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to remove dummy SAI objects after warm-reboot causing orchagent crash #1429
Comments
hey, what do you mean dummy objects? syncd is creating objects based on real object id that could be discovered during warm boot to new firmware, then those objects will be created in asic-db without any attributes. Since there is some issue here, it could be due to a bug in brcm or syncd, but i having a lot of similar issues with brcm, i would lean to that this is a bug on their side, probably, the same objects get different object IDs after warm boot from your description it seems like you are doing :
and that second warm boot fails. for better analyze can you attach syslog from entire process from first cold boot to the last ?, if it's confidential, please send that to me directly on my email or teams |
Regarding dummy objects, I was referring to the ones created in
The problem seems to be that after the warm-reboot from 201811 to 202012, apart from the 5 real objects that are same as what existed on 201811 prior to upgrade, there are 5 additional objects that are the dummy ones with no attributes. I shared the saidump output earlier (after step 2); asic_db output showed the same:
I assume the 5 objects with oid ending with c60, c61, c62, c63 and c64 are the real objects; the 5 ones with oid ending with bd3, bd4, bd5, bd6 and bd7 are the dummy ones that are supposed to be transient, but remain there after warm-reboot is complete. Yes, you got the steps all correct. I have records for all 3 steps (one cold boot followed by two warm boots). I'll get you the logs separately shortly. Thanks! |
my suspicion is those additional 5 objects, are the same as they were on 202018, but broadcom changed something that their OIDs are differenet then previously, this actually happened before, and im not sure whether that was BUFFER_PROFILE as well, so what was happening since those buffer profiles were created by orchagend and assigned to other objects, they maintain this OID value, but when doing GET after warm boot, other OID value was returned, im 99% sure that this is the case |
please also provide recordings from /var/log/swss/sairedis.rec* files as they will be essential to have configuration |
Appreciate your input! I've sent the sairedis.rec files. If it's something to do with Broadcom SAI, I wouldn't expect any change from them as the version is pretty old. In this situation, for the upgrade to proceed with warm-reboot, we would be looking for sort of a workaround. Is there any fix or workaround available as you know? Specifically:
|
ok, so i analyzed logs you sent me, and those logs confirms my suspicion:
you have 5 profiles ending from b10 to b14
you have now 10 buffer profiles, and since OA generated only 5 buffer profiles, they were matched from previous existed ones:
you can check that with matching RID values from table RIDTOVID and VIDTORID but those tables exists only on redisdb not in saidump since saidump dont dump any non SAI related objects VID value oid:0x19000000000b12 from 201811 will have the same RID as VID buffer 0x19000000000c62 for 202012 so objects from b10 to b14 on 201811 were correctly matched with temporary objects on 202012 and reassigned values from c60 to c64 (they could have different order, but the range is correct) now for remaining objects from 202012 warm boot from bd3 to bd7, those 5 objects with null values in s2_asic_db, those are objects which are discovered during 1st warm boot after switch create when we do discovery logic warm boot succeeded, previous 5 buffer profiles were remapped to new temporary 5 buffer profiles, and since previously those buffer profiles had RID values created by OA already have RID in database (RID from 201811) this RID is different from 202012 RID when broadcom changed OID encoding (or bug endoding) so comparison logic is reassigning old RID since it's exists on ASIC_DB, but syncd is discovering new RID which don't match anything in APP_DB and creates dummy objects. those newly discovered RID values are probably correct ones (if broadcom changed encoding) and those RID which were returned in 201811 branch are now incorrect (unless enoding bug and both them should be the same) of course from our perspective RID values for those buffers should be the same in 201811 and in 202012 branches on 1st warm boot you can see what discovery returned:
as you can see that it discovered 10 buffer profiles, this IS A BUG from broadcom, it happens because those buffer profiles are assigned on different places and in one place it returns one RID value on same buffer profiles and 2nd RID value on other places you can see NEW (invalid RIDS) disovered on buffer profiles:
at this point there is inconsistent state in ASIC DB and syncd since 10 RIDs point to the same 5 objects, because of brcm bug
discovery on syncd still discovers 10 buffer profiles:
but since OA created only 5 of them now there is discrepency:
you got 10 in DB but 5 created, since comparison logic correctly matched those 5 buffer profiles (non dummy ones) you can see that comparison logic scheduled remove for those 5 dummy ones:
and that operation failed: (s3_syslog)
causing OA to crash, BRCM is returning here SAI_STATUS_INVALID_PARAMETER, so it means its that NEW discovered RID in 1st warm boot that resulted in created dummy buffer profiles, is not VALID this is broadcom BUG, it's caused via different RID values between 2 different version of their SDK, it's hard to tell whether they changed encoding on RID (vendors usually encode some special values inside RID for convenience) or this is a bug, and they forget to return correct value of RID in place where is returned the one for dummy object what you can do is to ask BRCM whether NEW RID values ARE correct:
those are the one that are discovered and created a new dummy objects. if you do first cold boot, you can look into RIDTOVID table in redis ASIC_DB and check what are looking values for buffer_profiles after cold boot (they should contain 19 in hex value as the one above) but since those are not existing after warm boot, im guessing that after cold boot RID values will have extra bit set to 1 (or couple bits), or they could have continuous values and not it's to broadcom to figure out, whether they forget to set to 1 extra bit or whether they wanted to remove those bits from RID encoding Please take a look to actually any cold/warm boot db and do a dump from RIDTOVID table and post here RID VALUES |
current logging in syncd does not show at which attribute each OID is discovered, but it could help to understand where the bug occurs on which attribute, based on SAI headers buffer_profile could be set here:
but it;s not said that it's not returned on some other attribute as a bug my guess would be on port_attr_qos |
we probably can make workaround on our side, but it will be problematic, since it's not only that we keep those extra discovered objects in ASIC_DB, those are also kept in syncd memory, so most likely fix would need to be in the syncd code to make up for that, since those objects are discovered on 202012 warm boot branch and to make that code change we would need to have those invalid RID values how they look like, and where they are discovered we would need to have a log information added after that : if succeeded, then we need to loggedd attribute ID (as string enum) and all RID values to make sure where those values are discovered then we can prepare image with workaround for this specific invlaid RIDs for that specific branch question is whether this is only for your specific case, or you have a bunch of devices in that state with different configurations first we would need to add logging, do a warm boot, see where those rids are, and then add workaround based on received data of course we can also prepare generic woarkaround that will ignore ALL dummy discovered BUFFER_PROFILES, maybe that's better ?? but probably it would be more strict to just restrict those also to specific attribute |
We will debug further to understand the RIDs changes at different stage. I also agree ignoring the dummy discovered BUFFER_PROFILES would be a simpler solution, we can control that during the first warm-reboot , this will make the system clean for future upgrades and warm-reboots. Do you have any pointers which part of the code should be changed for this approach? |
@kcudnik Thanks a lot for your analysis and suggestions! A few things I've tried so far:
As a result, after warm-reboot from 201811 to 202012, the ASIC_DB didn't have the dummy objects, but the next warm-reboot still failed most likely due to the discovered objects still in syncd memory.
The existing RIDs that came from before warm-reboot (i.e., oid:0x1900000006 to oid:0x190000000a) also have one of these two attributes. Another observation is that, with code boot followed by warm-reboot on 202012 only, the RIDs oid:0x1900000001 to oid:0x1900000005 never existed (only oid:0x1900000006 to oid:0x190000000a before and after warm-reboot). I'll get you ASIC_DB dumps and syslog with the added logs shortly. |
yes, i already know what needs to be changed, but it's not that straight forward, that this change will fix this, since if you first warmboot and ignore dummy RIDs during discovery process, it may still happen that when OA will do GET operation on those you still get an invalid RID since it's still in vendor SDK memory so better approach would be to ignore them during discovery and do 2nd warm boot, since it seems like invalid RIDs are then gone after that, but this does not guarantee that system will be stable, since if those invalid values are discovered, then there is a bug in vendor code, and most likely in long system uptime will lead inconsistency |
@Stephenxf manual remove from ASIC_DB will not help, since entire db also sits in syncd memory, so when you will do warmboot agian you will still have invalid rids in asic db again putted by syncd. skipping those methods you pointed its not enoug, they should noe be touched, invalid rid's should be ignored in discovery process, but this may be not enough, probably those RID's should be translated, since previous values are in syncd memory and asic db, then values restored from discovery on correct attributres: SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE are those were then reassigned with new (dummy) ones because of vendor bug, it seems like vendor misscalculate those, so probably if they were from 0x6 to 0xa first then as a bug they are combined from 0x1 to 0x5, but they should stay at the same number, for this particular system we can just add 0x5 value to each to them and that should be enough, but in production that will be hard to estimate what happens actually and how many buffer profiles are there after deeper thought, just ignoring new dummy values in disocvery process ,maybe not enough, but we could try that if it will work, problem will be if we would need to modify any existing buffer profile after 1st warm boot, since previous RID seems to be invalid if you have easy way to modify discovery code, just skip adding discovered RID when the attribute is one of those 2 above to not discover old/new buffer profiles on those attributes and let see if this workaround will solve the problem but log those values with log level WARN to see if they will back to correct ones after 2nd warm boot in previous logs i saw that there was SET request on one of the buffer profiles in syslog, so this operation can fail since RID will not be valid let me know if you will be able to ignore those 2 attributes during discovery |
Sure, will try the change of skipping the discovered RIDs for attributes BTW, these 5 RIDs/attributes are supposed to be the ones created based on the RIDs prior to warm-reboot:
These are the 5 extra ones most likely introduced by bcmsai during warm-reboot to 202012:
|
this is interesting that in 2 cases first we have 3 ingress buffer profiles, and then 2 |
I analysed logs you provided and besides dummy buffer profiles, there is an extra operation:
but it's outside apply view, so we should be good here, i dont see any errors on that in syslog, so assummed OA changed this value, and RID value matches, so if you ignore those extra during discovery process, we should be good and from logs i found this:
it's weird that those values are in different places, but hey, its a bug :P |
@kcudnik it's looking promising now! I made the change to skip those two objects. It took me a few iterations to get the IDs and TYPEs correct. I guess I added too many logs so syslog started rate-limiting :) I just cleaned up the code a bit. A new image is being built. I'll get the syslog later. So basically this is the change:
Not sure why the attribute Following the first warm-reboot (upgrade from 201811 to 202012), I did warm-reboot a couple of more times, no issue identified, the number of Appreciate your great and consistent help! Let me get some logs and output for further analysis and confirmation. |
@kcudnik sent the logs and dumps separately. The workaround looks good so far with basic testing. Will verify further in the next few days. Now the discovered objects no longer have After warm-reboot from 201811 to 202012:
After one more warm-reboot from 202012 to 202012:
|
hey, not sure how you found this, because from all syslog analysis you provided all discover methods, return object type BUFFER_PROFILE on attribute SAI_QUEUE_ATTR_BUFFER_PROFILE_ID on object QUEUE, in my understanding you found a QUEUE, which on attribute BUFFER_PROFILE_ID returns object of type QUEUE instead of BUFFER_PROFILE ? from your syslog this is alll i found:
so at least object type matches on attribute, what don't match is the number of buffer profiles which is 10 |
ok, that's good, but that's not all, so because we ignored those buffer profiles, and ASIC_DB contains 5 buffer profiles as expected, the problem may still exists on those objects queues/ipg so what we want to achieve is this:
and those should be successful , but also if for example we remove that discovery workaorund, at 3rd warm boot to 202012, discovery will discover only 5 buffer profiles and those with RIDs that are in ASIC DB, but i actually doubt about that, i think the problem is still there and there is easy check for that actually, there is command that saiplayer can issue - it's "inspect asic" this will compare ASIC db with everything that is on the device and will report any errors in logs, you can probably run this on running system, by running "./saiplayer -i test.rec" command in syncd docker, test.rec can be empty file, but if everything is fine you should not see any error messages in syncd syslog, if there will be any missmatches between ASIC and ASIC_DB then you will see error messages in syslog giving you attr name and values with REDIS and actual SAI i highly recommend you to make this test after 2nd or 3rd warm boot, i probably expect that there will be errors on those buffer profiles not matching RID values since im some how not convinced this error would fix itself on vendor side please let me know how this test will go also test.rec can look like this:
and command can be like : ./saiplayer test.rec although i dont remember if 202012 branch contains inspect asic feature |
yeah, saiplayer is there in 202012. I ran
I also ran saiplayer on another device that is up from cold boot only. It showed similar errors with all objects, so I'm not sure if it can be used as an indicator if anything has gone wrong. The other thing is, if the workaround works for warm-reboot upgrade from 201811 to 202012, we may keep it in 202012 so additional warm-reboots will go through the same process. We also don't have plan to upgrade further beyond 202012 for these switches. If this is the case, do you still have the concern about the discovery? |
hmm, this code is not actually mine, but seems like there is some issue with translation, those values usually should match, i actually tested this on master on virtual switch and i don't get any failed matches, i checked 202012 version of inspect asic and it seems to be the same, but i think it's missing some translation, for example:
those 2 should be match but it's clearly that oid from REDIS is VID and from asic is RID, so there is translation step missing there, i will double check that once more and will prodice PR fixing that if necessary, not sure if should i make 202012 version too if we are not planning to move from 202012 to higher version then i guess we can keep workaround in place, but is there any official info on not planning to update 202012? since i work in this project too long to know that this can change any time XD |
Hey, there is issue with inspect assic RIDs are compared to VIDs, here is fix: you can try apply it and check again, you can still get some errors for pointers, since pointers will not match after warm boot, or at some point lists may not be equal, since some lists are allowed to be not in order, for example QOS information but this fix should be more than enough to find which attribures and objects ids report invalid OIDs for buffer profile |
@kcudnik Thanks for the patch for saiplayer! I applied it and did see some mismatches with warm-reboot upgrade 201811->202012, which were not seen with warm-reboot 202012->202012 after cold reboot. The mismatches are mainly on 3 objects on my setup:
I'll share more details and the output with you shortly. |
so i analyzed provided logs, and where we get and return not_implemented_1 it means that attribute at index 1 is not supported GET command, and in this case actually checking attributes is ignored, since code is doing GET for all attributes at once, so in no success operation, none of attributes are checked, this code should be updated to remove not supported attributes 1 by by 1 and check those where GET succeeded, and actually not implemented should be reported to brcm to fix this, since when we are doing GET on attribute X, that means this X attribute is in ASIC DB, and it was create or set by OA operations, so GET also should be supported on that operation. current logging is not enough so we don't even know which attribute is this. note here that getting attributes from ASIC_DB is in random order since this is a hash. from logs i see that not implemented attributes are on:
for those:
i;m guessing those values are changed because of counter counting those, but normally attributes create/set should not be changed only read_only attributes should be allowed to change this:
this is a bug definitely, there were 2 items on the list, there is 1 after
those i have no idea :( on warm boot:
those seems a bug also, returning not valid ethernet port name as empty string, and returning oid 0 on wred profile where it was set what is interesting, that none of those errors are actually reporting BUFFER_PROFILE OID issue, where i would expect that there will be 5 objects with different OID values, this is really interesting XD maybe they are not reported because you added code to ignore them on discovery ? but any way, even if you ignored them, then inspect ASIC should do a GET operation on all objects in ASCI_DB and they will return some of those invalid RIDs that's really interesting why we don't see it here hmmm i would suspect that those "dummy" ones are for example like on QUEUE and INGRESS_PRIORIRTY group which we see here:#1429 (comment) but those QUEUES and IPGS may be not actually configured by OA, i;m not sure if OA is configuring ALL visible QUEUES and IPGs on the device, in contrast discovery method id discovering all existing objects, so it could happen that let sya switch has 100 queues and OA configures only 80 of them, and then because of a bug 20 of not configured ones contains some bug and returns dome invalie buffer_profile ID, that would be mu suspicion |
SAI_STATUS_INVALID_ATTR_VALUE_MAX in terms of SAI codes is actually SAI_STATUS_ATTR_NOT_IMPLEMENTED_1 (because status codes are negative) |
hey, did brcm get back on this ? |
Hi @kcudnik sorry for the delayed response. I don't expect Broadcom to respond on this as the version in question is pretty dated. During the last few weeks we built a setup with traffic enabled to verify warm-reboot upgrade (201811->202012) with the workaround you helped us come up with. Some extensive tests were done and the results looked good. Last week we went ahead and did the same on multiple switches in production, and no issues were observed either. As we discussed earlier, shall I post the PR with this workaround? This is the diff (only applied to 202012):
|
Please move this work sound to workaround.cpp and use it from there |
Summary
We have a bunch of switches with old 201811-based image to be upgraded to our next available image based off 202012. The warm-reboot upgrade went through, but the next warm-reboot (same 202012-based image, no upgrade) would end up with orchagent crash. The cause of that is syncd trying to remove some dummy SAI objects created by the previous warm-reboot upgrade but failing with
SAI_STATUS_INVALID_PARAMETER
error.Details
1. With the 201811 image (before the warm-reboot upgrade), we have some QoS configs that create 5
SAI_OBJECT_TYPE_BUFFER_PROFILE
entries.2. After warm-reboot upgrade to 202012-based image, the 5 objects are still present (with different oid's), and 5 dummy objects are also created. This might be a result of BRCM SAI upgrade.
3. With one additional warm-reboot to the same 202012-based image, syncd detects discrepancy of
SAI_OBJECT_TYPE_BUFFER_PROFILE
between current view (10 objects) and temp view (5 objects).Then the attempt to remove the 5 dummy objects via
executeOperationsOnAsic
leads to SAI errors, eventually syncd stopped and orchagent exited.In this problem state, the ASIC_DB has 10 actual objects (including the 5 dummy ones) and 5 temp objects.
Next Steps
Can you please help us understand the potential issue here?
SAI_OBJECT_TYPE_BUFFER_PROFILE
objects created? This is different fromPORT_SERDES
objects, which were not there before upgrade but introduced as default objects after upgrade. Are these extra objects supposed to stay even after warm-reboot process is complete?remove()
API to remove those extra/dummy objects? If so, is it possibly a BRCM SAI bug that theremove()
operation fails?Thanks in advance for assistance.
The text was updated successfully, but these errors were encountered: