CI: Add runner with Ubuntu on ARM64 #634

mmuetzel · 2025-01-22T16:37:47Z

GitHub started hosting runners with Ubuntu on ARM64 processors for open source projects for free:
https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/

Add one configuration that is using these runners to the build matrix.

According to their blog post, the arm64 runners are more efficient and potentially faster than the x86_64 runners. (But it is still in a preview phase and maybe it will take some time for them to better scale and balance the load.) If that turns out to be true, it should be easy to switch more configurations in that workflow to the arm64 runners (and maybe keep only one or two running on x86_64).

mmuetzel · 2025-01-23T08:21:36Z

Oof. More tests are failing on that runner than I would have hoped.
At least some of the failing tests (H1BasisEvaluation, SD_H1BasisEvaluation, pointload2) also fail on macOS (Apple Silicon). Maybe, there are some assumptions somewhere that only hold for Intel/AMD processors?

I don't have physical access to ARM64 hardware. Not sure if I could help track down any of that.
Maybe, valgrind or some sanitizers would be able to find something that is odd also on Intel/AMD?

juharu · 2025-01-23T08:28:05Z

Hi Thanks again for your work! Certainly shouldn't be anything processor specific here. I can also try running with valgrind. Have you checked it's not a stack size problem ? Br, Juha From: "Markus Mützel" ***@***.***> To: "ElmerCSC" ***@***.***> Cc: "Subscribed" ***@***.***> Sent: Thursday, 23 January, 2025 10:22:02 Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634) Oof. More tests are failing on that runner than I would have hoped. At least some of the failing tests ( H1BasisEvaluation , SD_H1BasisEvaluation , pointload2 ) also fail on macOS (Apple Silicon). Maybe, there are some assumptions somewhere that only hold for Intel/AMD processors? I don't have physical access to ARM64 hardware. Not sure if I could help track down any of that. Maybe, valgrind or some sanitizers would be able to find something that is odd also on Intel/AMD? — Reply to this email directly, [ #634 (comment) | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACTOMSW3XLWLF5WMXNWRUNT2MCRCVAVCNFSM6AAAAABVVLY37GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBZGEZTMMJTHE | unsubscribe ] . You are receiving this because you are subscribed to this thread. Message ID: ***@***.***> The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: [ https://csc.fi/en/privacy | Privacy ] Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: [ https://csc.fi/tietosuoja | Tietosuoja ]

juharu · 2025-01-23T08:32:05Z

valgrind runs are clean on my unubtu laptop ... From: "Juha Ruokolainen" ***@***.***> To: "ElmerCSC" ***@***.***> Cc: "ElmerCSC" ***@***.***>, "Subscribed" ***@***.***> Sent: Thursday, 23 January, 2025 10:27:59 Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634) Hi Thanks again for your work! Certainly shouldn't be anything processor specific here. I can also try running with valgrind. Have you checked it's not a stack size problem ? Br, Juha From: "Markus Mützel" < ***@***.*** > To: "ElmerCSC" < ***@***.*** > Cc: "Subscribed" < ***@***.*** > Sent: Thursday, 23 January, 2025 10:22:02 Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634) Oof. More tests are failing on that runner than I would have hoped. At least some of the failing tests ( H1BasisEvaluation , SD_H1BasisEvaluation , pointload2 ) also fail on macOS (Apple Silicon). Maybe, there are some assumptions somewhere that only hold for Intel/AMD processors? I don't have physical access to ARM64 hardware. Not sure if I could help track down any of that. Maybe, valgrind or some sanitizers would be able to find something that is odd also on Intel/AMD? — Reply to this email directly, [ #634 (comment) | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACTOMSW3XLWLF5WMXNWRUNT2MCRCVAVCNFSM6AAAAABVVLY37GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBZGEZTMMJTHE | unsubscribe ] . You are receiving this because you are subscribed to this thread. Message ID: < ***@***.*** > The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: [ https://csc.fi/en/privacy | Privacy ] Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: [ https://csc.fi/tietosuoja | Tietosuoja ]

mmuetzel · 2025-01-23T08:57:22Z

Have you checked it's not a stack size problem ?

I might be wrong. But wouldn't that crash the program?
It looks like the numerics don't behave as expected. E.g., the last time steps before the test ConstantParamFunc failed:

  MAIN: 
  MAIN: -------------------------------------
  MAIN: Time: 34/50:   6.800E+00
  MAIN: -------------------------------------
  MAIN: 
  HeatSolver: Solving the energy equation for temperature
  HeatSolve: 
  HeatSolve: 
  HeatSolve: -------------------------------------
  HeatSolve:  TEMPERATURE ITERATION           1
  HeatSolve: -------------------------------------
  HeatSolve: 
  HeatSolve: Starting Assembly...
  HeatSolve: Assembly:
  HeatSolve: Assembly done
  ComputeChange: NS (ITER=1) (NRM,RELC): ( 0.16009565     0.62252546E-01 ) :: heat equation
  HeatSolve: iter:    1 Assembly: (s)    0.00    0.00
  HeatSolve: iter:    1 Solve:    (s)    0.00    0.00
  HeatSolve:  Result Norm   :   0.16009565005742657
  HeatSolve:  Relative Change :    6.2252546152800785E-002
  MAIN: 
  MAIN: -------------------------------------
  MAIN: Time: 35/50:   7.000E+00
  MAIN: -------------------------------------
  MAIN: 
  HeatSolver: Solving the energy equation for temperature
  HeatSolve: 
  HeatSolve: 
  HeatSolve: -------------------------------------
  HeatSolve:  TEMPERATURE ITERATION           1
  HeatSolve: -------------------------------------
  HeatSolve: 
  HeatSolve: Starting Assembly...
  HeatSolve: Assembly:
  HeatSolve: Assembly done
  ERROR:: IterSolve: Numerical Error: System diverged over maximum tolerance.

I don't know what could be causing that though...

juharu · 2025-01-23T09:00:48Z

Right, somewhat harder to figure out then, i guess, at least with nothing to test on....

juharu · 2025-01-23T09:54:08Z

The following tests FAILED: All the following tests suddenly break the linear system iterator. All use the same BiCGstab- scheme. Don't know if there is something wrong with it, maybe miscopiled, maybe the numerical scheme is at fault (it's known to have some issues, the specifics escape me atm.). The remedy might be to change the iterator scheme, f.ex. BiCGStab -> BiCGStabL ? 126 - ConstantParamFunc (Failed) serial 155 - ConvergenceControl (Failed) serial transient 301 - HeatControlExplicit (Failed) control quick serial 433 - OptimizeSimplexFourHeaters (Failed) control serial 434 - OptimizeSimplexFourHeatersInt (Failed) control serial 680 - TransientCostFourHeaters (Failed) serial 801 - fsi_box (Failed) elasticsolve fsi serial transient I'll try to look at these, but mostly these might be inconsequential ... 540 - SD_H1BasisEvaluation (292 - 292 - H1BasisEvaluation (Failed) benchmark serial Slightly different result norm. Someone should have a look, whether OK anyway... 920 - pointload2 (Failed) serial

juharu · 2025-01-23T10:20:32Z

I switched the iterator on the ConstParamFunc, ConvergenceControl, HeatControlExplicit, TransientCostFourHeater and fsi_box tests (in devel now). I left out the OptimizeSimplex- tests for the time being. These result norm of these tests seem to be somewhat unstable to changes of: 1) Iterator scheme 2) Convergence Tolerance 3) Preconditoning Br, Juha From: "Juha Ruokolainen" ***@***.***> To: "ElmerCSC" ***@***.***> Cc: "ElmerCSC" ***@***.***>, "juharu" ***@***.***>, "Comment" ***@***.***> Sent: Thursday, 23 January, 2025 11:54:03 Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634) The following tests FAILED: All the following tests suddenly break the linear system iterator. All use the same BiCGstab- scheme. Don't know if there is something wrong with it, maybe miscopiled, maybe the numerical scheme is at fault (it's known to have some issues, the specifics escape me atm.). The remedy might be to change the iterator scheme, f.ex. BiCGStab -> BiCGStabL ? 126 - ConstantParamFunc (Failed) serial 155 - ConvergenceControl (Failed) serial transient 301 - HeatControlExplicit (Failed) control quick serial 433 - OptimizeSimplexFourHeaters (Failed) control serial 434 - OptimizeSimplexFourHeatersInt (Failed) control serial 680 - TransientCostFourHeaters (Failed) serial 801 - fsi_box (Failed) elasticsolve fsi serial transient I'll try to look at these, but mostly these might be inconsequential ... 540 - SD_H1BasisEvaluation (292 - 292 - H1BasisEvaluation (Failed) benchmark serial Slightly different result norm. Someone should have a look, whether OK anyway... 920 - pointload2 (Failed) serial The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: [ https://csc.fi/en/privacy | Privacy ] Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: [ https://csc.fi/tietosuoja | Tietosuoja ]

GitHub started hosting runners with Ubuntu on ARM64 processors for open source projects for free: https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/ Add one configuration that is using these runners to the build matrix.

mmuetzel · 2025-01-23T10:28:16Z

I switched the iterator on the ConstParamFunc, ConvergenceControl, HeatControlExplicit, TransientCostFourHeater and fsi_box tests (in devel now).

Thank you for looking into this.

I rebased this PR on top of your latest changes. Let's see if this will make a difference.

mmuetzel · 2025-01-23T11:22:14Z

That doesn't seem to have made much difference.
Still the following test failures:

The following tests FAILED:
	126 - ConstantParamFunc (Failed)                        serial
	155 - ConvergenceControl (Failed)                       serial transient
	292 - H1BasisEvaluation (Failed)                        benchmark serial
	301 - HeatControlExplicit (Failed)                      control quick serial
	433 - OptimizeSimplexFourHeaters (Failed)               control serial
	434 - OptimizeSimplexFourHeatersInt (Failed)            control serial
	540 - SD_H1BasisEvaluation (Failed)                     benchmark serendipity serial
	680 - TransientCostFourHeaters (Failed)                 serial
	801 - fsi_box (Failed)                                  elasticsolve fsi serial transient
	920 - pointload2 (Failed)                               serial

juharu · 2025-01-23T11:31:34Z

OK, thanks, I'll have a look at the log.

juharu · 2025-01-23T12:03:02Z

So it seems something more complicated than the iterator failure. Can I somehow misuse the git "push" triggered "Action" to do debugging (without waiting for too long) ?

mmuetzel · 2025-01-23T13:44:56Z

Can I somehow misuse the git "push" triggered "Action" to do debugging (without waiting for too long) ?

Debugging using only GitHub actions is pretty tedious unfortunately. Unfortunately, I'm not aware of any option to log into the runners and run commands manually.
I would be very interested if someone finds out how to do that.

The next "best" thing (but still pretty bad) what I mostly resort to is:

Fork the repository to my own user account.
Enable running actions on that fork.
Optionally, disable any workflows that I'm currently not interested in. (This can be done on the "three-dot-menu" after selecting the respective workflow on the "Actions" tab.)
Optionally, modify the "configure" step of the workflow file in that fork to disable some build options (e.g., the GUI) to reduce the build time.
Optionally, modify the "check" step of the workflow file in that fork to reduce the number of tests that are being run.
Optionally, modify the source files in that fork to output more intermediate results.
Then iterate until there is some useful information.

That is still quite tedious and time consuming. It would be much easier if I knew how to stop in a debugger or anything like that in a GitHub action.

juharu · 2025-01-24T12:15:20Z

An observation about the linear system iterator fails on ARM64 platform: It seems that the DNRM2() - function used by all iterative methods fails at random intervals. I don't know the reason. I think this is linked in from the openblas() - library here ? If I use my one norm function to replace that, everything starts to work. Other openblas() functions don't seem to have problems.

juharu · 2025-01-24T12:19:06Z

Tests ConstParamFunc, ConvergenceControl, HeatControlExplicit, TransientCostFourHeater, fsi_box, OptimizeSimplexFourHeaters, OptimizeSimplexFourHeatersInt all work out of box after this change. This is something not seen on any other platform. From: "Juha Ruokolainen" ***@***.***> To: "ElmerCSC" ***@***.***> Cc: "ElmerCSC" ***@***.***>, "juharu" ***@***.***>, "Comment" ***@***.***> Sent: Friday, 24 January, 2025 14:15:14 Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634) An observation about the linear system iterator fails on ARM64 platform: It seems that the DNRM2() - function used by all iterative methods fails at random intervals. I don't know the reason. I think this is linked in from the openblas() - library here ? If I use my one norm function to replace that, everything starts to work. Other openblas() functions don't seem to have problems.

mmuetzel force-pushed the ci-ubuntu branch from f6914eb to 36dee09 Compare January 23, 2025 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Add runner with Ubuntu on ARM64 #634

CI: Add runner with Ubuntu on ARM64 #634

mmuetzel commented Jan 22, 2025

mmuetzel commented Jan 23, 2025

juharu commented Jan 23, 2025 via email

juharu commented Jan 23, 2025 via email

mmuetzel commented Jan 23, 2025

juharu commented Jan 23, 2025 via email

juharu commented Jan 23, 2025 via email

juharu commented Jan 23, 2025 via email

mmuetzel commented Jan 23, 2025

mmuetzel commented Jan 23, 2025 •

edited

Loading

juharu commented Jan 23, 2025 via email

juharu commented Jan 23, 2025 via email

mmuetzel commented Jan 23, 2025

juharu commented Jan 24, 2025 via email

juharu commented Jan 24, 2025 via email

CI: Add runner with Ubuntu on ARM64 #634

Are you sure you want to change the base?

CI: Add runner with Ubuntu on ARM64 #634

Conversation

mmuetzel commented Jan 22, 2025

mmuetzel commented Jan 23, 2025

juharu commented Jan 23, 2025 via email

juharu commented Jan 23, 2025 via email

mmuetzel commented Jan 23, 2025

juharu commented Jan 23, 2025 via email

juharu commented Jan 23, 2025 via email

juharu commented Jan 23, 2025 via email

mmuetzel commented Jan 23, 2025

mmuetzel commented Jan 23, 2025 • edited Loading

juharu commented Jan 23, 2025 via email

juharu commented Jan 23, 2025 via email

mmuetzel commented Jan 23, 2025

juharu commented Jan 24, 2025 via email

juharu commented Jan 24, 2025 via email

mmuetzel commented Jan 23, 2025 •

edited

Loading