Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-388587: Fix filtering the xapi-clusterd db and log any errors #67

Merged
merged 5 commits into from
Feb 12, 2024

Conversation

bernhardkaindl
Copy link
Collaborator

@bernhardkaindl bernhardkaindl commented Feb 9, 2024

Fix filtering the XAPI Clustder DB as found by CA-388587 (found using testing by @ashwin9390)

  • Commit 1 and 1 setup unit testing:

    • Commit 1 enables enforced version of:
      1. Create a test and make it fail
      • Use Strict XFAIL only: An unexpected PASS is fail.
      1. Then fix the code.
      2. And remove the XFAIL marker at the same time (the strict XFAIL setting forces you to remove it)
  • The Fix 1 and Fix 2 commits reuse original patches by @edwintorok for this issue.

    • They fix the bugs and remove their respective XFAIL marker.
  • Fix 3 applies the lesson learnt:

    • to log errors during processing to the bug-tool logfile for analysis.

The tests that are created and fixed as described in the module docstring of the test case module.

Copy link

github-actions bot commented Feb 9, 2024

Coverage

Pytest Code coverage comment for Python 2.7
FileStmtsMissCoverMissing
xen-bugtool157727882%77–79, 84, 95, 100–101, 414, 523, 529, 552, 687, 705, 770–771, 785, 789, 804–805, 811–812, 817, 827, 836, 839–842, 848, 853–855, 858–859, 862–863, 866, 907–909, 1044–1058, 1060–1069, 1130, 1138, 1263, 1265, 1269–1270, 1293, 1311, 1314–1316, 1319–1323, 1326–1327, 1331–1333, 1339–1345, 1348–1352, 1354–1370, 1373–1381, 1384–1385, 1387, 1389–1390, 1392–1400, 1403, 1405–1408, 1410, 1412–1413, 1447, 1475, 1490, 1498–1499, 1562–1564, 1566–1571, 1574–1578, 1582–1583, 1585, 1602–1603, 1605, 1607–1608, 1610–1612, 1614–1621, 1623–1628, 1632–1633, 1635, 1637, 1639, 1675–1676, 1678, 1680–1682, 1685–1686, 1688, 1690–1692, 1695–1696, 1698, 1700–1703, 1706–1716, 1719–1723, 1732, 1745, 1755, 1759, 1794–1796, 1871, 1906, 1945, 2005, 2048, 2051–2052, 2090, 2097, 2103, 2105–2108, 2111–2121, 2212–2214, 2217–2219, 2250–2252, 2292–2296, 2308, 2326
integration
   __init__.py00100% 
   conftest.py180100% 
   namespace_container.py35197%85
   test_system_load.py180100% 
   test_xenserver_config.py150100% 
   utils.py711085%18, 31, 54–57, 61, 81, 84, 122
mocks/xen
   __init__.py00100% 
mocks/xen/lowlevel
   __init__.py00100% 
mocks/xen/lowlevel/xc
   __init__.py50100% 
unit
   __init__.py00100% 
   conftest.py501080%22, 51, 53–57, 59–61
   test_dir_list.py110100% 
   test_filter_xapi_clusterd_db.py780100% 
   test_fs_funcs.py100100% 
   test_load_plugins.py90100% 
   test_main.py119893%189, 271–273, 275–278
   test_output.py650100% 
   test_process_output.py110100% 
   test_snmp.py150100% 
   test_xapidb_filter.py210100% 
TOTAL212830785% 

Tests Skipped Failures Errors Time
33 1 💤 0 ❌ 0 🔥 2.421s ⏱️

Copy link

github-actions bot commented Feb 9, 2024

lindig
lindig previously approved these changes Feb 9, 2024
tests/unit/conftest.py Outdated Show resolved Hide resolved
@bernhardkaindl bernhardkaindl changed the title CA-388587: Fix capture of xapi-clusterd DB and log any errors CA-388587: Fix filtering the xapi-clusterd db and log any errors Feb 12, 2024
ashwin9390
ashwin9390 previously approved these changes Feb 12, 2024
Copy link
Collaborator

@ashwin9390 ashwin9390 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has refactoring of the code ,descriptive documentation and additional test case .Overall code looks sufficient and clean.

bernhardkaindl and others added 4 commits February 12, 2024 15:12
The 'token' is an API token that is used to authenticate all API
requests to the xapi-clusterd daemon.

- Apply the fix originally developed and tested by Edwin.
- Update the tests:

  - Ensure that the fix works and keeps working across future changes.

Co-authored-by: Bernhard Kaindl <[email protected]>
Signed-off-by: Edwin Török <[email protected]>
… instead

Handle the situations where configs doesn't have "pems" keys:

Yangtze does not have REQ-403 and the REQ-403 API changed later to pass
the PEMs in files, so the clusterd DB only has their path names:

Ignore the pems keys if not there and sanitize the rest of the config.

- Applies the fix originally developed and tested by Edwin.
- Update the tests:

  - Ensure that the fix works and keeps working across future changes.

Co-authored-by: Bernhard Kaindl <[email protected]>
Signed-off-by: Edwin Török <[email protected]>
@bernhardkaindl
Copy link
Collaborator Author

@lindig, @ashwin9390 : I have only done clean-ups, streamlined the test fixtures and their use.

Maybe review (and approve) again?

try:
with open(XAPI_CLUSTERD, 'r') as f:
clusterd_data = json.load(f)

except Exception as e: # pylint: disable=broad-exception-caught
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why split the try/with here? AFAICT the end result is the same: we log the exception and return the empty string. And the traceback should tell us exactly which line raised the error, but maybe I'm missing something here, is there a reason to keep try/with split like this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, the reason was rather long-winded with at least two aspects:

The original reason for the try-split split was:
I had set .pylintrc max-try-statements=11 and this was exceeded by the added lines.

I've set this limit with an intention:

  • If it is exceeded, I want developers to think:
    • Should any error in all of this big try-statement really result in the same error-handling?

The max-size of the try-statements was the lever that I found for this, to make the code at least not regress further.

So I want to give devs (including myself) a heads-up, when their try-statements grow beyond a certain size, aka "out of leaps and bounds".

So I do not want to raise that limit, and I also like to avoid t just add a comment to disable the warning:

Firstly, it gives code a bad look when disabling warnings.
Secondly, I think that pylint implements this limit for a good reason (or I found):

max-try-statements=11 is what works for the status-report tool, and adding the additional code inside would have made this the largest try-statement in the status-report script.

Because these try-statements are combined with broad-except, (except Exception) they catch everything in the try-statement.

These large try-statements with broad-except have been a major pain point for me for finding Python3 bugs - the actual cause of Python3 bugs.

Errors in them have not been straight-forward to find because nothing tells you that there was even an error:

Usually they are even combined with output_data = "", and an empty file (in the happy case) is the only indication that something is wrong.

And this is also what was the case in this function, the broad Exception handler
threw away the KeyError that the expected pems was no longer in the JSON data.

So when testing the xapi-clusterdb, the only error indication was that the output file was empty.

So what I'd like to go for is that when such an Exception happens, I want a more targeted handling of the exception.

But it's just preference that I'd like to have a more specific Exception handling when the sources are completely different causes.

And the size was my only handle, I had. I set it so that it would trigger when the existing try-statements grow even larger.

Here is where I wanted to aim for:

    try:        
        with open(XAPI_CLUSTERD, 'r') as f:
            clusterd_data = json.load(f)
    except Exception as e:  # pylint: disable=broad-exception-caught
        return log_internal_error(e)

Any Exception raised here means either invalid JSON - so something that should be an alert that something is bad.

This Exception should actually go into a separate bugball file like "system-errors.out" or so.

In the other case any error its definitely a bug in bugtool itself that must be fixed.

Since the data has been red, just some fields are not there, it's a different kind of error for me.

that should automatically be directed to a developer (since that looks like an unexpected change in the JSON).

OTOH, a non-existing clusterd-db file is a normal case in the non-clustered case.
For the usual case where the file does not exist, we can just check if it is there.

But if the file is missing in the clusterd-db case, it would a problem that would prevent clustering from working.

So I did this with a look towards enabling more diagnostics later,
make it possible to direct the exception causes towards different channels.

As it has 3 approvals now, I'd merge and move forward. Smaller tweaks are of course always possible.

Copy link
Contributor

@edwintorok edwintorok Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the real reason is to avoid broad try-excepts, can we do that here? There are only a limited number of exceptions that 'open' can raise, and KeyError is certainly not one we should catch, and I understand the desire to not mask these errors.

I think the reason for catching broad errors in status-report may be a good one though: if one of the status report gathering functions fail you don't want to give up on the entire script, there may still be valuable information you can gather from other files.

Perhaps a better approach (that could be addressed in a future PR) would be:

  • have some kind of top-level try/catch handler whenever we invoke a function to collect something for capability "X", and if that fails we log the exception somewhere (general xen-bugtool.out or the file that was supposed to contain our output, etc.), but keep going and capture other things. Then the broad try/except would only wrap a very small number of lines (i.e. 1 or 2, just invoke the gathering function for that capability)
  • you've already added a check on whether the file exists which avoids logging errors unnecessarily when they are not errors
  • status-report could keep a count of how many of these broad try/except it has caught and logged, and exit non-zero at the end, which should allow XenRT to detect this and fail (we should whether it'd still store the already created bugtool in that case, as that is still very useful in debugging what went wrong)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good proposals! Yes, they should be implemented!

I created #72 to implement them and added my answers!

I only added you as an additional assignee to show that we should do this, if it adds an entry into your GitHub notifications that you like to avoid, of course just remove yourself.

Copy link

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all looks reasonable but I'm not a good person to review this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants