Network connection error freezes the runtime #635

bajtos · 2024-12-11T10:37:15Z

In some cases, zinniad gets stuck, and it takes several minutes until it responds to the SIGTERM signal sent by Station Core after Core detects that Spark is stuck.

In the case for which the logs are shown below:

At 2024-12-10T13:31:39Z, Spark enters a 60 second sleep
After ~5 minutes, Station Core detects inactivity and kills Spark
At 2024-12-10T13:47:45Z, Spark sends an HTTP request to check the current round. The request fails with "connection reset" error
At 2024-12-10T13:47:45Z, Spark enters another 60 second sleep
At that time, Zinnia main loop ends, the process exits and Station Core detects the exit (via signal)
At 2024-12-10T13:47:45Z, Station Core starts Spark/Zinnia again

Logs:

[2024-12-10T13:31:39Z INFO  module:spark/main] Measurement submitted (id: [redacted])
{"type":"jobs-completed","total":[redacted],"rewardsScheduledForAddress":"[redacted]"}
[2024-12-10T13:31:39Z INFO  module:spark/main] Sleeping for 60 seconds before starting the next task...
{"type":"activity:error","module":"Zinnia","message":"Spark has been inactive for 5 minutes, restarting..."}
{"type":"activity:error","module":"spark/main","message":"SPARK failed reporting retrieval"}
[2024-12-10T13:47:45Z INFO  module:spark/main] 
[2024-12-10T13:47:45Z INFO  module:spark/main] Checking the current SPARK round...
[2024-12-10T13:47:45Z ERROR module:spark/main] Error: error sending request for url (https://api.filspark.com/rounds/current): connection error: connection reset
        at async mainFetch (ext:deno_fetch/26_fetch.js:277:12)
        at async fetch (ext:deno_fetch/26_fetch.js:504:7)
        at async Tasker.#updateCurrentRound (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/tasker.js:50:15)
        at async Tasker.next (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/tasker.js:44:5)
        at async Spark.getRetrieval (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/spark.js:40:23)
        at async Spark.nextRetrieval (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/spark.js:189:23)
        at async Spark.run (file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/lib/spark.js:208:9)
        at async file:///Users/redacted/Library/Caches/app.filstation.desktop/sources/spark/main.js:4:1
[2024-12-10T13:47:45Z INFO  module:spark/main] Sleeping for 60 seconds before starting the next task...
{"type":"activity:error","module":"Zinnia","message":"Spark crashed via signal SIGTERM"}
Zinnia main loop ended
[2024-12-10T13:47:45Z INFO  zinniad] Starting zinniad with config CliArgs { wallet_address: "[redacted]", station_id:
"[redacted]", state_root: "/Users/redacted/Library/Application Support/app.filstation.desktop/modules/zinnia", cache_root: "/Users/redacted/Library/Caches/app.filstation.desktop/modules/zinnia", files: ["spark/main.js"] }
[2024-12-10T13:47:45Z INFO  lassie] Starting Lassie Daemon
[2024-12-10T13:47:45Z INFO  lassie] Lassie Daemon is listening on port 54326
{"type":"activity:info","module":"spark","message":"Spark started"}

The text was updated successfully, but these errors were encountered:

bajtos · 2024-12-11T10:40:05Z

What we can do:

Add more logs to understand where exactly Spark spent those 5 minutes
Add timestamp to log lines printing activities.
Change Station Core to use SIGKILL instead of SIGTERM and kill Zinnia immediately the hard way. (Maybe use SIGKILL only when we detect the process got stuck.) This is fine because Zinnia has not yet implemented a graceful shutdown.

bajtos added the bug 🐛 Something isn't working label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network connection error freezes the runtime #635

Network connection error freezes the runtime #635

bajtos commented Dec 11, 2024

bajtos commented Dec 11, 2024

Network connection error freezes the runtime #635

Network connection error freezes the runtime #635

Comments

bajtos commented Dec 11, 2024

bajtos commented Dec 11, 2024