Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix blackouts feature error and add new unit tests #41

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

alhasacademy96
Copy link
Contributor

@alhasacademy96 alhasacademy96 commented Sep 12, 2024

PR Description

From AAI 3.1.3 upwards it seems the ability to use blackouts with timeLimit = 0 is broken.

Specific Errors Produced:

  • (Primary) Invalid Blackout interval: The case where episodeLength (timeLimit) == 0 results in an exception throw, specifically in Lights.cs here:
foreach (var blackout in _blackouts)
                {
                    if (blackout < 0 || blackout >= _episodeLength)
                        throw new ArgumentException("Blackout interval is invalid.");
                }

In the LightStatus method, the step is compared against _nextFrameSwitch * agentDecisionInterval. If agentDecisionInterval is 0 (which is possible if timeLimit is 0), it will result in a division by zero error. Furthermore, the error occurs because If timeLimit is set to 0 and blackouts are specified (can be any value), it creates a contradiction. The blackouts are meant to occur at specific steps within the episode, but an episode with a timeLimit of 0 has no steps to begin with.

The error propagates to a division by zero error further in the stack trace, as described below.

  • (Secondary) Division by zero:
KeyNotFoundException: Tried to load arena 0 but it did not exist
AAI3EnvironmentManager.GetConfiguration (System.Int32 arenaID) (at Assets/Scripts/AAI3EnvironmentManager.cs:334)
TrainingArena.ResetArena () (at Assets/Scripts/TrainingArena.cs:143)
TrainingAgent.OnEpisodeBegin () (at Assets/Scripts/TrainingAgent.cs:753)
Unity.MLAgents.Agent._AgentReset () (at Library/PackageCache/[email protected]/Runtime/Agent.cs:1341)
Unity.MLAgents.Academy.ForcedFullReset () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:567)
Unity.MLAgents.Academy.EnvironmentStep () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:581)
Unity.MLAgents.AcademyFixedUpdateStepper.FixedUpdate () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:43)

DivideByZeroException: Attempted to divide by zero.
TrainingArena.SetNextArenaID () (at Assets/Scripts/TrainingArena.cs:210)
TrainingArena.ResetArena () (at Assets/Scripts/TrainingArena.cs:141)
TrainingAgent.OnEpisodeBegin () (at Assets/Scripts/TrainingAgent.cs:753)
Unity.MLAgents.Agent._AgentReset () (at Library/PackageCache/[email protected]/Runtime/Agent.cs:1341)
Unity.MLAgents.Academy.ForcedFullReset () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:567)
Unity.MLAgents.Academy.EnvironmentStep () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:581)
Unity.MLAgents.AcademyFixedUpdateStepper.FixedUpdate () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:43)

Indirectly, the division by zero is caused by the code in the LightStatus method - the condition step == _nextFrameSwitch * agentDecisionInterval can lead to a division by zero if agentDecisionInterval is zero (which was the case). Unity is simply unable to load the first arena into the arena configurations list as a result of the primary error.

Proposed change(s)

This PR fixes the error, introduces more robust code, and more tests for Lights.cs script (blackouts feature).

The proposed PR code in the Lights.cs file handles the scenario where timeLimit is 0 and blackouts are used correctly. It allows for infinite blackout patterns using negative values in the blackout sequence, and the episodeLength (timeLimit) does not impact the blackout functionality.

It adds robust code to handle edge cases such as:

  • Episode timeLimit is <0 (cannot be negative, must be >=0).
  • Blackout sequence must be in increasing order (i.e. blackouts: [20, 25] and NOT blackouts: [20, 5]). This is because blackouts operates with intervals (from frame 20 to frame 25 in increasing direction). An opposite case would defeat the purpose of the implementation.
  • Blackout values >0
  • Blackout values <= episode length.

Useful links (Github issues, discussion threads, etc.)

#42

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
    • [See full list of unit tests in comments section below]
  • Updated the changelog (if applicable)
  • Updated the documentation (animal-ai/docs/Arena-Environment-Guide/blackouts)
    • [Added more descriptive info on blackouts and limitations]
  • Updated the migration guide (if applicable)

Other comments (if any)

Tests:
InfiniteEnumerator_IncrementValueOnMoveNext
Tests if the InfiniteEnumerator correctly increments the value when MoveNext() is called.
InfiniteEnumerator_ResetToZero

Tests if the InfiniteEnumerator resets to zero when Reset() is called.
LightsSwitch_InitializeWithValidParameters

Tests if the LightsSwitch can be initialized with valid parameters without throwing an exception.
LightsSwitch_HandleInfiniteBlackouts

Tests if the LightsSwitch correctly handles infinite blackouts.
LightsSwitch_ResetBehavior

Tests if the LightsSwitch resets its state correctly when Reset() is called.
LightsSwitch_HandleEmptyBlackoutList

Tests if the LightsSwitch handles an empty blackout list correctly.
LightsSwitch_HandleDifferentAgentDecisionIntervals

Tests if the LightsSwitch handles different agent decision intervals correctly.
LightsSwitch_ThrowExceptionForInvalidBlackoutSequence

Tests if the LightsSwitch throws an exception for an invalid blackout sequence.
LightsSwitch_HandleMultipleBlackouts

Tests if the LightsSwitch handles multiple blackouts correctly.
LightsSwitch_HandleInfiniteBlackoutsWithNegativeNumber

Tests if the LightsSwitch handles infinite blackouts with a negative number correctly.
LightsSwitch_ThrowExceptionForNegativeEpisodeLength

Tests if the LightsSwitch throws an exception for a negative episode length.


Yaml config example with error reproducible:

!ArenaConfig
arenas:
  0: !Arena
    passMark: 2
    timeLimit: 0
    blackouts: [25, 60]
    items:
    - !Item
      name: Agent
      positions:
      - !Vector3 {x: 20, y: 1, z: 2}
      rotations: [0]
      frozenAgentDelays: [10]
    - !Item
      name: Wall
      positions:
      - !Vector3 {x: 20, y: 0, z: 10}
      - !Vector3 {x: 20, y: .5, z: 2.5}
      - !Vector3 {x: 30, y: 0, z: 30}
      - !Vector3 {x: 5, y: 0, z: 38}
      - !Vector3 {x: 35, y: 0, z: 38}
      - !Vector3 {x: 20, y: 0, z: 38}
      - !Vector3 {x: 20, y: 0, z: 39.9}
      rotations: [0,0,0,0,0,45,0]
      colors:
      - !RGB {r: 0, g: 0, b: 255}
      - !RGB {r: 0, g: 0, b: 255}
      - !RGB {r: -1, g: -1, b: -1} #color parameter
      - !RGB {r: 0, g: 0, b: 255}
      - !RGB {r: 0, g: 0, b: 255} 
      - !RGB {r: 222, g: 0, b: 0}
      - !RGB {r: 222, g: 0, b: 0}
      sizes:
      - !Vector3 {x: 2, y: .5, z: 19.5}
      - !Vector3 {x: 2, y: .5, z: 5}
      - !Vector3 {x: 10, y: 0.5, z: 1}
      - !Vector3 {x: 1, y: .5, z: 2}
      - !Vector3 {x: 1, y: .5, z: 2}
      - !Vector3 {x: 2, y: 2, z: 2}
      - !Vector3 {x: 10, y: 2, z: 0.1}

Screenshots (if any)

@alhasacademy96 alhasacademy96 added the bug fix A fix to a known bug label Sep 12, 2024
@alhasacademy96 alhasacademy96 marked this pull request as ready for review September 12, 2024 18:41
@alhasacademy96 alhasacademy96 linked an issue Sep 12, 2024 that may be closed by this pull request
Copy link
Contributor

@benaslater benaslater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems a lot better than previous ones, thank you! I think for a future one it might benefit from having the bug fix commit really focused i.e. the only code that changes in that commit is code that needs to change to fix the bug, and any refactoring is done separately before / after. That would make it really simple for a reviewer to understand what you're changing, and think about how edge cases might affect the bug.

Sorry if it's just me, but I'm struggling to understand what was broken here:

the division by zero issue seems to be caused by the code in the LightStatus method - the condition step == _nextFrameSwitch * agentDecisionInterval

How does this condition cause a division by zero error, and how does the change you've made stop that from happening?

Assets/Scripts/Lights.cs Show resolved Hide resolved
Assets/Scripts/Lights.cs Show resolved Hide resolved
Assets/Scripts/Lights.cs Show resolved Hide resolved
Assets/Scripts/Lights.cs Show resolved Hide resolved
Assets/Scripts/Lights.cs Outdated Show resolved Hide resolved
Assets/Scripts/Lights.cs Show resolved Hide resolved
Assets/Scripts/Lights.cs Show resolved Hide resolved
Assets/Tests/EditMode/LightSwitchTests.cs Show resolved Hide resolved
Assets/Tests/EditMode/LightSwitchTests.cs Outdated Show resolved Hide resolved
@alhasacademy96
Copy link
Contributor Author

This PR seems a lot better than previous ones, thank you! I think for a future one it might benefit from having the bug fix commit really focused i.e. the only code that changes in that commit is code that needs to change to fix the bug, and any refactoring is done separately before / after. That would make it really simple for a reviewer to understand what you're changing, and think about how edge cases might affect the bug.

Sorry if it's just me, but I'm struggling to understand what was broken here:

the division by zero issue seems to be caused by the code in the LightStatus method - the condition step == _nextFrameSwitch * agentDecisionInterval

How does this condition cause a division by zero error, and how does the change you've made stop that from happening?

When episodeLength == 0, it’s impossible to have valid blackout values in the list because any blackout value (negative or non-negative) causes the validation to fail. This is due to the validation logic that checks if each blackout value is less than 0 or greater than or equal to episodeLength. With episodeLength == 0, any blackout value will fail this validation because it cannot be less than 0 and is not less than 0.

As a result, blackouts didd not work when episodeLength (timeLimit) is zero. This issue was caused by a miscalculation in the validation logic. The updated code now correctly handles this case by ensuring that episodeLength is greater than zero. This prevents configurations where blackouts cannot function due to an episode length of zero. By validating that episodeLength is greater than zero, the code re-enables the functionality of blackouts and properly validates the new blackouts list.

This is the most probable cause in my experience. Note that the logic is still sound. It's just when timeLimit ==0, this bug is apparent.

- removed unnecessary exception handling.
- removed explicit call to console.write to avoid unnecessary console logs
The method is now more streamlined and clearer.
@benaslater
Copy link
Contributor

Sorry, I'm still not sure I get it. What's being divided by zero?

When episodeLength == 0, it’s impossible to have valid blackout values in the list because any blackout value (negative or non-negative) causes the validation to fail.

I thought the error was a zero division error. Is there a validation failure and then a zero division error?

The updated code now correctly handles this case by ensuring that episodeLength is greater than zero.

How does this square with setting timeLimit to be zero for an infinite length episode - does that work here, or does this change block those kinds of configurations?

@alhasacademy96
Copy link
Contributor Author

alhasacademy96 commented Sep 17, 2024

I think it's understandable to be confused as I think I wasn't too clear nor comprehensive on what was going on. However, for me it's clear what the problem is and how it's been addressed. Let's get you up to scratch.

With the YAML provided above, let's go step by step (this is using the current main branch code) to reproduce and understand the problem and the fix.

  1. When we run the config we get this error:
Screenshot 2024-09-17 at 15 27 32

"An error occurred while loading or processing the YAML file: Blackout interval is invalid."

This is the result of the division by zero error just below the console log:
Screenshot 2024-09-17 at 15 29 56

The stack trace:

"DivideByZeroException: Attempted to divide by zero.
TrainingArena.SetNextArenaID () (at Assets/Scripts/TrainingArena.cs:210)
TrainingArena.ResetArena () (at Assets/Scripts/TrainingArena.cs:141)
TrainingAgent.OnEpisodeBegin () (at Assets/Scripts/TrainingAgent.cs:753)
Unity.MLAgents.Agent._AgentReset () (at Library/PackageCache/[email protected]/Runtime/Agent.cs:1341)
Unity.MLAgents.Academy.ForcedFullReset () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:567)
Unity.MLAgents.Academy.EnvironmentStep () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:581)
Unity.MLAgents.AcademyFixedUpdateStepper.FixedUpdate () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:43)

"

Now, we know that there is indeed a division by zero error in the trace. It's being called in trainingarena.cs because the arena checks for a blackout explicitly in this script (line 42; private bool _lightStatus = true;) , this is why we're getting the error here and not in lights.cs. It's also becuase were not validating in lights.cs, which then would have shown the trace to lights.cs and not trainingarena.cs.

As I've explained in the PR description what the error was, I'll put it here for completeness:

Specific Error produced:

  • Division by zero: In the LightStatus method, the step is compared against _nextFrameSwitch * agentDecisionInterval. If agentDecisionInterval is 0 (which is possible if timeLimit is 0), it will result in a division by zero error. Furthermore, the error occurs because If timeLimit is set to 0 and blackouts are specified (can be any value), it creates a contradiction. The blackouts are meant to occur at specific steps within the episode, but an episode with a timeLimit of 0 has no steps to begin with.
    Specifically, the division by zero issue seems to be caused by the code in the LightStatus method - the condition step == _nextFrameSwitch * agentDecisionInterval can lead to a division by zero if agentDecisionInterval is zero (which was the case).

The PR mentions the specific error produced as we've reproduced. Now, the missing link here is the other part of why I think this was happening, specifically this point:

Sorry, I'm still not sure I get it. What's being divided by zero?

When episodeLength == 0, it’s impossible to have valid blackout values in the list because any blackout value (negative or non-negative) causes the validation to fail.

I thought the error was a zero division error. Is there a validation failure and then a zero division error?

The updated code now correctly handles this case by ensuring that episodeLength is greater than zero.

How does this square with setting timeLimit to be zero for an infinite length episode - does that work here, or does this change block those kinds of configurations?

The problem is that AAI doesnt know what to do with episodeLength == 0 and blackouts is initialised (as a list). This is the reason why episodeLength == 0 results in the blackouts feature not working properly FOR THIS use case, explained above like so:

When episodeLength == 0, it’s impossible to have valid blackout values in the list because any blackout value (negative or non-negative) causes the validation to fail. This is due to the validation logic that checks if each blackout value is less than 0 or greater than or equal to episodeLength. With episodeLength == 0, any blackout value will fail this validation because it cannot be less than 0 and is not less than 0.

Blackouts for all other timeLimit value works fine. Remember as this was also the main reason why this PR was raised. Furthermore, I myself wasn't aware of this functionality - as the code did not have logic to handle the case of timeLimit == 0 with blackouts intervals (which was removed in earlier modifications of the script as a mistake or to fix the previous bug).

I hope I shed some new light into this problem and the fix. I don't know if there is a need to simplify it further. I think the main thing to extract from this PR is that the lights.cs script and the functionality is more robust as a result and the original ability to have timeLimit ==0 and the blackouts working for this value works as intended now.

Edit: To clarify, the episode will go infinite if timeLimit ==0 for any arena, but this isnt coupled with blackouts nor does it have anything to do with it other than the case then InfiniteEnumerator() will produce infinite blackout intervals.

@benaslater
Copy link
Contributor

This is the result of the division by zero error just below the console log:

The "Blackout interval is invalid" error is not a result of the divide by zero error; it is caused by the explicit throw in Lights.cs here.

The zero division error happens in the second after "Blackout interval is invalid" - I think Unity is trying to handle the error we're throwing in an unhelpful way (The "Couldn't connect ... will perform inference instead" logline at the end of your first screenshot is interesting in that respect). So I feel that although these zero division errors are something that it would be worth looking into, I think they're a red herring and not actually related to the bug we're trying to fix here.

--

Would it be right to say that this commit introduced extra validation on blackout values, that didn't take into account the possibility of _episodeLength being zero, and so caused validation to fail on inputs where timeLimit is 0 and blackouts are included?

If so, could this PR have been an update of line 59 of Lights.cs to something like this? (With appropriate unit test changes)

if (blackout < 0)
    throw new ArgumentException("Blackout interval timestamps cannot be negative.");
// Don't check upper bound on blackout for infinite episodes (_episodeLength == 0)
if (_episodeLength > 0 && blackout >= _episodeLength)
    throw new ArgumentException("Blackout interval timesteps cannot be after episode end");

@alhasacademy96
Copy link
Contributor Author

  1. The log "Couldn't connect to trainer on port 5004 using API version 1.5.0. Will perform inference instead." has nothing to do with this and is showing up as there are no active trainers running - were not training agents (python).
UnityEngine.Debug:Log (object)
Unity.MLAgents.Academy:InitializeEnvironment () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:478)
Unity.MLAgents.Academy:LazyInitialize () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:292)
Unity.MLAgents.Academy:.ctor () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:261)
Unity.MLAgents.Academy/<>c:<.cctor>b__86_0 () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:117)
System.Lazy`1<Unity.MLAgents.Academy>:get_Value ()
Unity.MLAgents.Academy:get_Instance () (at Library/PackageCache/[email protected]/Runtime/Academy.cs:132)
Unity.MLAgents.Agent:LazyInitialize () (at Library/PackageCache/[email protected]/Runtime/Agent.cs:485)
Unity.MLAgents.Agent:OnEnable () (at Library/PackageCache/[email protected]/Runtime/Agent.cs:399)
AAI3EnvironmentManager:Awake () (at Assets/Scripts/AAI3EnvironmentManager.cs:150)

Unity will throw a generic error like division by zero when arena 0 isnt added to the arena configurations (the error makes sense as you can't divide by 0, but it's generic in the sense that it's not helpful much).

  1. I seem to have missed the first error trace as you've highlighted, which is the direct cause of the division by zero error because the arena is simply not loaded to the arena configurations as arena 0. It's always good to have a second pair of eyes!

The root cause of this error is two fold:

  • With a yaml like above, with timeLimit == 0 will result in the error because the arena will fail to load. I think we partially focused on the wrong error here.
  • The lights.cs script was skipping through the validation of the blackouts list when episodeLength (timeLimit) == 0, meaning the interval of the blackouts list is longer than the episode length, which can't be the case.

In this code, were checking if each value in the list is more or equal to timeLimit:

foreach (var blackout in _blackouts)
{
    if (blackout < 0 || blackout >= _episodeLength)
        throw new ArgumentException("Blackout interval is invalid.");
}

This will result in an error because we're not accepting episodeLength as a value of 0.

"Would it be right to say that this commit introduced extra validation on blackout values, that didn't take into account the possibility of _episodeLength being zero, and so caused validation to fail on inputs where timeLimit is 0 and blackouts are included?"

The proposed changes already account for cases to handle such as this so and the tests are updated already in my previous commits.

The original tests didnt account for episodeLenth == 0 as I didn't know this was an acceptable value due to the lack of the feature being used and no documentation, which is why the bug propagated this far. This has been resolved in code, hand-in-hand with better tests.

I hope this makes sense now.

@benaslater
Copy link
Contributor

Okay, thank you I think I've gotten my head around it now! Does this change still keep the validation in the code block in your most recent comment when the timeLimit isn't set to zero (i.e. throw if blackout >= _episodeLength)? I think this relates to our discussion here.

As a side note, I wonder if we should really only throw if blackout > _episodeLength, since it might be a reasonable use to have a blackout going to the end of the episode.

Only throw if blackout > _episodeLength, as blackout == _episodeLength is acceptable.
---

Edge case would be if: blackouts: [10, 100] -> meaning exactly equal to timeLimit (assuming timeLimit == 100)

An invalid case would be:
(blackouts: [10, 101] -> Invalid, as the blackout exceeds the timeLimit of 100.)
@alhasacademy96
Copy link
Contributor Author

The new check throws an exception only if the blackout sequence has a value greater than _episodeLength, allowing a blackout to extend up to and including the end of the episode (blackout == _episodeLength). See latest commit.

@benaslater
Copy link
Contributor

Okay, thank you for chatting through with me. Assuming this tests alright I'm happy with this one!

@benaslater benaslater self-requested a review September 19, 2024 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix A fix to a known bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Blackouts - division by zero error when using timeLimit = 0
2 participants