Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluke download failures #213

Open
laraPPr opened this issue Sep 8, 2023 · 3 comments
Open

fluke download failures #213

laraPPr opened this issue Sep 8, 2023 · 3 comments

Comments

@laraPPr
Copy link
Collaborator

laraPPr commented Sep 8, 2023

Sometimes a build failes because EasyBuild crashes after it fails to download a file.

This happened a lot in the TensorFlow PR EESSI/software-layer#321:

  • 7194: who failed to get libpng-1.6.37.tar.gz
  • 7200, 7199,...: There was also a problem with GitHub a few days which caused a lot of build failures

it might be an option to implement a pre-build fetch phase. So that all the builds have the necessary sources.

Yes, that should probably become a feature in the bot, we can look into adding support for a pre-
build fetch phase for example, so we can instruct the bot to first try and fetch all sources before 
letting it submit build jobs.

Ideally the bot would then automatically first submit a single fetch job when it gets the instruction 
to build, but that may be a bit harder (since the fetch part could take a while, and we don't want to 
block the bot to until fetch is done).
@boegel
Copy link
Contributor

boegel commented Sep 9, 2023

@trz42 Thoughts on this?
This could help significantly to avoid fluke build failures I think...

The bot could "prefetch" sources to a directory before launching any build jobs, so they don't need to download anything anymore.

@trz42
Copy link
Contributor

trz42 commented Sep 17, 2023

I understand the issue, but I am a unsure what a good solution to this would be.

  1. If EasyBuild fails to download something, maybe it should have a means to retry? Like we added retry capability to some bot functions with an increasing delay between attempts.
  2. I don't see how this could be easily implemented at the bot-side. The bot does not know much about what it actually processes (PR for some repository that provides a bot/build.sh).
  3. If it would do some kind of pre-processing, the whole job management is getting significantly more complex than it is right now.

If the main issue is, that, sometimes, a job fails late just because some source cannot be downloaded, a relatively easy way around that might be to improve the handling of sources in bot/build.sh or even EESSI-pilot-install-software.sh by adding a fetch phase before the actual build phase. That fetching could be repeated a couple of times. Even if it never succeeds, it might consume fewer resources because that failure is raised before the actual building begins.

Note, I think because of 2 & 3 above, this is not easy to implement. Adjusted the difficulty label accordingly.

@boegel
Copy link
Contributor

boegel commented Sep 21, 2023

The shared_fs_path configuration setting implemented in #214 allows to largely fix this, by using a shared directory for $EASYBUILD_SOURCEPATH, see also EESSI/software-layer#337 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants