Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network connection error freezes the runtime #635

Open
bajtos opened this issue Dec 11, 2024 · 1 comment
Open

Network connection error freezes the runtime #635

bajtos opened this issue Dec 11, 2024 · 1 comment
Labels
bug 🐛 Something isn't working

Comments

@bajtos
Copy link
Member

bajtos commented Dec 11, 2024

In some cases, zinniad gets stuck, and it takes several minutes until it responds to the SIGTERM signal sent by Station Core after Core detects that Spark is stuck.

In the case for which the logs are shown below:

  • At 2024-12-10T13:31:39Z, Spark enters a 60 second sleep
  • After ~5 minutes, Station Core detects inactivity and kills Spark
  • At 2024-12-10T13:47:45Z, Spark sends an HTTP request to check the current round. The request fails with "connection reset" error
  • At 2024-12-10T13:47:45Z, Spark enters another 60 second sleep
  • At that time, Zinnia main loop ends, the process exits and Station Core detects the exit (via signal)
  • At 2024-12-10T13:47:45Z, Station Core starts Spark/Zinnia again

Logs:

[2024-12-10T13:31:39Z INFO  module:spark/main] Measurement submitted (id: [redacted])
{"type":"jobs-completed","total":[redacted],"rewardsScheduledForAddress":"[redacted]"}
[2024-12-10T13:31:39Z INFO  module:spark/main] Sleeping for 60 seconds before starting the next task...
{"type":"activity:error","module":"Zinnia","message":"Spark has been inactive for 5 minutes, restarting..."}
{"type":"activity:error","module":"spark/main","message":"SPARK failed reporting retrieval"}
[2024-12-10T13:47:45Z INFO  module:spark/main] 
[2024-12-10T13:47:45Z INFO  module:spark/main] Checking the current SPARK round...
[2024-12-10T13:47:45Z ERROR module:spark/main] Error: error sending request for url (https://api.filspark.com/rounds/current): connection error: connection reset
        at async mainFetch (ext:deno_fetch/26_fetch.js:277:12)
        at async fetch (ext:deno_fetch/26_fetch.js:504:7)
        at async Tasker.#updateCurrentRound (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/tasker.js:50:15)
        at async Tasker.next (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/tasker.js:44:5)
        at async Spark.getRetrieval (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/spark.js:40:23)
        at async Spark.nextRetrieval (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/spark.js:189:23)
        at async Spark.run (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/spark.js:208:9)
        at async file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/main.js:4:1
[2024-12-10T13:47:45Z INFO  module:spark/main] Sleeping for 60 seconds before starting the next task...
{"type":"activity:error","module":"Zinnia","message":"Spark crashed via signal SIGTERM"}
Zinnia main loop ended
[2024-12-10T13:47:45Z INFO  zinniad] Starting zinniad with config CliArgs { wallet_address: "[redacted]", station_id:
"[redacted]", state_root: "/Users/redacted/Library/Application Support/app.filstation.desktop/modules/zinnia", cache_root: "/Users/redacted/Library/Caches/app.filstation.desktop/modules/zinnia", files: ["spark/main.js"] }
[2024-12-10T13:47:45Z INFO  lassie] Starting Lassie Daemon
[2024-12-10T13:47:45Z INFO  lassie] Lassie Daemon is listening on port 54326
{"type":"activity:info","module":"spark","message":"Spark started"}
@bajtos
Copy link
Member Author

bajtos commented Dec 11, 2024

What we can do:

  • Add more logs to understand where exactly Spark spent those 5 minutes
  • Add timestamp to log lines printing activities.
  • Change Station Core to use SIGKILL instead of SIGTERM and kill Zinnia immediately the hard way. (Maybe use SIGKILL only when we detect the process got stuck.) This is fine because Zinnia has not yet implemented a graceful shutdown.

@bajtos bajtos added the bug 🐛 Something isn't working label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant