Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: use Github based GPU instance #181

Merged
merged 15 commits into from
Jun 13, 2024
Merged

MAINT: use Github based GPU instance #181

merged 15 commits into from
Jun 13, 2024

Conversation

mmcky
Copy link
Contributor

@mmcky mmcky commented May 9, 2024

This PR makes use of the Tesla T4 instance now available on GitHub Actions as a beta instance.

  • uses github actions supplied instances
  • deploys from github containers for our docker environment to speed up builds

this migrates

  • ci
  • publish
  • cache

to run on GitHub actions.

Copy link

netlify bot commented May 9, 2024

Deploy Preview for incomparable-parfait-2417f8 ready!

Name Link
🔨 Latest commit a672c97
🔍 Latest deploy log https://app.netlify.com/sites/incomparable-parfait-2417f8/deploys/666a2c387abc8e0008e222c7
😎 Deploy Preview https://deploy-preview-181--incomparable-parfait-2417f8.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

github-actions bot commented May 9, 2024

@github-actions github-actions bot temporarily deployed to pull request May 9, 2024 05:03 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented May 9, 2024

  • re-enable build cache
  • remove docker dependency and test local builds using anaconda (simpler)

@github-actions github-actions bot temporarily deployed to pull request May 9, 2024 05:57 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented May 9, 2024

The results between EC2 (left) and GitHub Actions (right)

Screenshot 2024-05-09 at 4 00 54 PM

@jstac there is a really interesting mix of timing results here between the V100 on EC2 and the T4 on GitHub. Many times are lower with a few exceptions such as wealth_dynamics. I will try and understand the root causes.

@mmcky
Copy link
Contributor Author

mmcky commented May 9, 2024

Just triggered a new publish so we are comparing like for like. (https://github.com/QuantEcon/lecture-jax/releases/tag/publish-2024may09)

@kp992
Copy link
Contributor

kp992 commented May 9, 2024

This is interesting @mmcky! Now since we are using GA's GPU for trial, shall we compare the costs that we were having through AWS and now on GA -- maybe need to figure out how are we going to compare the costs since I believe it will depend on the frequency of commit push?

@mmcky
Copy link
Contributor Author

mmcky commented May 9, 2024

This is interesting @mmcky! Now since we are using GA's GPU for trial, shall we compare the costs that we were having through AWS and now on GA -- maybe need to figure out how are we going to compare the costs since I believe it will depend on the frequency of commit push?

that's right @kp992 -- the pricing is:

Service Cost Units
EC2 (p2.xlarge) $0.90 per instance Hour
GA (Ubuntu GPU 4-core) $0.07 per minute

So if we have a 10 minute job then

Service Cost
EC2 (p2.xlarge) $0.90
GA (Ubuntu GPU 4-core) $0.70

so the pricing really depends on the frequency of long runs vs short runs. Honestly (while the per hour price on GA is a LOT higher, I think it will work out to be pretty similar).

@mmcky
Copy link
Contributor Author

mmcky commented May 9, 2024

@kp992 this is the like-for-like time comparisons now with the current live site.

Screenshot 2024-05-10 at 9 08 40 AM

still an interesting mix of performance differences.

Machine Details:

EC2:

Name GPUs vCPUs RAM (GiB) NetworkBandwidth Price/Hour* RI Price / Hour**
p2.xlarge 1 4 61 High $0.900 $0.425

Github:

CPU GPU GPU card Memory (RAM) GPU memory (VRAM) Storage (SSD) Operating system (OS)
4 1 Tesla T4 28 GB 16 GB 176 GB Ubuntu, Windows

So it appears we are running on a machine with less RAM which is interesting.

@mmcky
Copy link
Contributor Author

mmcky commented May 10, 2024

  • remove the docker container layer to see if that speeds up compute times

@mmcky
Copy link
Contributor Author

mmcky commented May 10, 2024

Currently the kernel is dying when installing directly onto the vm provided by github (rather than using our docker container). IT would be quicker and more efficient to get this route working.

@github-actions github-actions bot temporarily deployed to pull request May 10, 2024 02:50 Inactive
@kp992
Copy link
Contributor

kp992 commented May 13, 2024

Thanks @mmcky, are we moving forward with moving to Github Actions VM for all our repos using AWS?

@mmcky
Copy link
Contributor Author

mmcky commented May 13, 2024

Thanks @mmcky, are we moving forward with moving to Github Actions VM for all our repos using AWS?

I would like to if we can -- as that is less to maintain. But currently I am getting issues with kernels dying which suggest that jax install isn't working properly (without a container).

@github-actions github-actions bot temporarily deployed to pull request May 13, 2024 05:36 Inactive
@github-actions github-actions bot temporarily deployed to pull request May 13, 2024 06:10 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented May 21, 2024

The driver versions under docker are:

NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.3    

and when using the native VM

 NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2  

so the CUDA version is likely causing the issue?

@mmcky
Copy link
Contributor Author

mmcky commented May 21, 2024

@kp992 any ideas on why the jupyter kernel is dying when running directly on the VM but the docker container is OK?

  • @mmcky can we host the docker container on github to speed up the compute?

@kp992
Copy link
Contributor

kp992 commented May 21, 2024

Thanks @mmcky, I will try to look into it. I will create a new PR on top of these commits so I can test and play around separately.

@mmcky
Copy link
Contributor Author

mmcky commented Jun 10, 2024

@mmcky working on using github containers to store the docker container here
QuantEcon/lecture-python.docker#4

@mmcky mmcky added the in-work label Jun 11, 2024
@mmcky
Copy link
Contributor Author

mmcky commented Jun 11, 2024

@kp992 the fetch from github containers is about 10min. That is pretty good right?

@mmcky
Copy link
Contributor Author

mmcky commented Jun 11, 2024

@kp992 it looks like these instances may have CUDA=12.3 installed. Our docker is configured for CUDA=12.5 so there are a lot of ptax warnings. We may need to adjust the Docker container to enable this context (or upgrade CUDA drivers). I think CUDA upgrades would require a reboot but looking into it.

@mmcky
Copy link
Contributor Author

mmcky commented Jun 11, 2024

@kp992 looks like the newer CUDA driver is working. Will post a speed comparison with the current live site once I get the preview.

@github-actions github-actions bot temporarily deployed to pull request June 11, 2024 06:02 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented Jun 11, 2024

@jstac and @kp992 here are the latest results moving our computations to the GitHub based GPU instance. LHS = current live site (built on EC2) and RHS = this PR (built on Github + using CUDA=12.5 driver). Many times are improved except for Wealth Dynamics (@kp992 would you mind reviewing this lecture to see why this might be?)

Screenshot 2024-06-11 at 4 10 39 PM

@jstac
Copy link
Contributor

jstac commented Jun 12, 2024

Thanks @mmcky , good to know.

@kp992
Copy link
Contributor

kp992 commented Jun 12, 2024

Thanks @mmcky, this looks great. I can look at the wealth dynamics timings difference.

@mmcky
Copy link
Contributor Author

mmcky commented Jun 12, 2024

thanks @jstac and @kp992. I am doing one final round of review on this and then I will migrate to use github instances for this lecture series as well.

@github-actions github-actions bot temporarily deployed to pull request June 12, 2024 04:52 Inactive
@mmcky
Copy link
Contributor Author

mmcky commented Jun 12, 2024

  • check this closely as the nvidia-smi is reporting the following and the docker container is using CUDA=12.5

NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 12.3

AH HA! That page hasn't bee re-executed as the date is from 06th of June. This will refresh in a full build.

@mmcky
Copy link
Contributor Author

mmcky commented Jun 12, 2024

@kp992 I think this is ready. If you can cast your eye over it one more time then I'll merge.

@github-actions github-actions bot temporarily deployed to pull request June 12, 2024 23:59 Inactive
@mmcky mmcky added ready and removed in-work labels Jun 13, 2024
Copy link
Contributor

@kp992 kp992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks perfect! Thanks @mmcky

@mmcky mmcky merged commit c995915 into main Jun 13, 2024
6 checks passed
@mmcky mmcky deleted the maint-gpu-runner branch June 13, 2024 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants