Skip to content

Commit

Permalink
Wait for app to start with torchx log (#763)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #763

Right now it silently times out after waiting for 10s and then (depending on the scheduler) fails because the job hasn't actually started; I'm trying to use `--log` with the `run` command and inevitably run into this.

I wasn't sure if we want to explicitly timeout, or what the best ergonomics would be -- I can increase the timeout to 10 minutes and blow up if it doesn't start in that time, or change it to a while loop (as done here) -- wanted to check :).

Differential Revision: D48840617

fbshipit-source-id: 8c2326d07d4df8c69a15e3d747ba3b5162ab6bec
  • Loading branch information
kunalb authored and facebook-github-bot committed Aug 31, 2023
1 parent b9a1d0d commit a8d4d2a
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions torchx/cli/cmd_log.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,12 +101,14 @@ def get_logs(
if len(path) == 4:
replica_ids = [(role_name, int(id)) for id in path[3].split(",") if id]
else:
for i in range(10):
display_waiting = True
while True:
status = runner.status(app_handle)
if status and is_started(status.state):
break
if i == 0:
logger.info("Waiting for app to start before logging...")
elif display_waiting:
logger.info("Waiting for app to start before fetching logs...")
display_waiting = False

Check warning on line 111 in torchx/cli/cmd_log.py

View check run for this annotation

Codecov / codecov/patch

torchx/cli/cmd_log.py#L109-L111

Added lines #L109 - L111 were not covered by tests
time.sleep(1)

app = none_throws(runner.describe(app_handle))
Expand Down

0 comments on commit a8d4d2a

Please sign in to comment.