You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So right now the bot can take in python and cuda programs, run them and post stdout - instead we'd like the runtime and ncu outputs of a solution to be saved in an artifact so we can render a leaderboard and have people compete on producing the fastest kernel for a given target
There's 2 goals
Produce a large set of useful kernels that can be used to train an LLM to produce better kernels
Create a fun feedback loop where gpu mode folks can go from watching a lecture to writing their first performant kernel
The GPU type: softmax kernel tuned for H100 won't be like one for T4
Dtype: fp8, fp16
Input shape
But one benefit of having a community per Jordan Jurafsky is people can tell us what kernels they find interesting but the "fastest softmax" in the west could be an interesting dimension for a first practice round
Table schema
We'd need likely need a few tables where new kernels could be added where some likely columns are likely to be
Problem table: Reference code or UUID to the specific problem setting: idea here is we want companies and individuals to submit "interesting kernels" they want people to compete on
Submission information: code for the submission, discord username of the person who submitted it, time of submission
UUID to run information: Which would include stdout, ncu outputs
We also need some versioning for our benchmarking setup
Ncu outputs
One thing we'd like to also learn is what NCU outputs experts are looking at to figure out how to optimize their kernels so open to some ideas for how to add telemetry to figure out what people are looking at
Make each ncu output a field and make people have to expand it out
Put the whole results in the equivalent of an online excel spreadsheet and see where people move over their mouse and collect that adata
Discord submission flow
Similarly to /run modal/github train.py we want a run leaderboard <kernel_problem> <dtype> <GPU> <shape> train.py
GPU could probably be implicit we can run "all"
Then the backend needs to take in this kernel, run it, make sure it matches the correctness of a reference and if it does time it and then rank it among all existing solutions in the leaderboard
Optionally we'd also want a run leaderboard <kernel_problem> without a train.py to give the top entries with links to their code
And finally a run new_learboarboard_problem where people get a few fields: the problem name, the reference solution, an optional bounty, discord name of the person who created that kernel
What is a reference
On random inputs
PyTorch code
Some tolerance values
Optional: Latency target
Cold starts
Number of runs to average
If benchmarking methodology proves incorrect, how do we track and invalidate or rerun old results
top level metrics: wall clock, ncu output, peak memory and
Meeting minutes Dec 5
Flags should be user determined for the compiler
Schema really needs to be locked down
Do people submit the launcher or just the kernel
How do driver scripts look like for different languages
How to include unverified submissions where we don't save their code, mark as unverified but keep it - reference bar
LLM generated kernel - have claude as a baseline
Website without discord so people can investigate the results - I agree that discord is submission flow and then website is result investgiation
So right now the bot can take in python and cuda programs, run them and post stdout - instead we'd like the runtime and ncu outputs of a solution to be saved in an artifact so we can render a leaderboard and have people compete on producing the fastest kernel for a given target
There's 2 goals
And we'd be targeting a launch in January 2025
Dimension of the competition
Even for a simple kernel like softmax llm.c has over 700 LOC dedicated to various variants https://github.com/karpathy/llm.c/blob/master/dev/cuda/softmax_forward.cu so we could start with a competition to produce the fastest softmax. Because even for softmax users can compete on
But one benefit of having a community per Jordan Jurafsky is people can tell us what kernels they find interesting but the "fastest softmax" in the west could be an interesting dimension for a first practice round
Table schema
We'd need likely need a few tables where new kernels could be added where some likely columns are likely to be
We also need some versioning for our benchmarking setup
Ncu outputs
One thing we'd like to also learn is what NCU outputs experts are looking at to figure out how to optimize their kernels so open to some ideas for how to add telemetry to figure out what people are looking at
Discord submission flow
Similarly to
/run modal/github train.py
we want arun leaderboard <kernel_problem> <dtype> <GPU> <shape> train.py
GPU could probably be implicit we can run "all"
Then the backend needs to take in this kernel, run it, make sure it matches the correctness of a reference and if it does time it and then rank it among all existing solutions in the leaderboard
Optionally we'd also want a
run leaderboard <kernel_problem>
without a train.py to give the top entries with links to their codeAnd finally a
run new_learboarboard_problem
where people get a few fields: the problem name, the reference solution, an optional bounty, discord name of the person who created that kernelWhat is a reference
On random inputs
Meeting minutes Dec 5
The text was updated successfully, but these errors were encountered: