Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] A collection of items to improve developer / CI speed #13813

Open
6 tasks
alamb opened this issue Dec 17, 2024 · 8 comments
Open
6 tasks

[EPIC] A collection of items to improve developer / CI speed #13813

alamb opened this issue Dec 17, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

Is your feature request related to a problem or challenge?

As DataFusion becomes more mature and gets more features and tests, the amount of time it takes to build and test the system has been increasing

This has several downsides:

Barrier to new contributors is higher

The resources required to build / link DataFusion are now very large which means some people may not be able to run them. For example @Alexhuszagh reports on #13693 (comment)

I only tested this locally with the failing datasource::physical_plan::json::tests::test_chunked_json tests, but it should be the same with all the remaining errors. My test machine doesn't have enough RAM to link the entire core test suite,

We are likely wasting lots of resources running CI tests

By my count the CI checks on the most recent checkin (link) require over 200 hours of runner time

We are lucky the ASF gets many runner credits from github, but this level is very wasteful in my opinion and likely unsustainable as we contemplate adding additional testing such as

Larger binary size

I have noticed that the datafusion-cli binary is now almost 100MB it used to be 60MB

Also, people such as @g3blv have noticed that the WASM build has increased 50%:
#9834 (comment)

See

Decreased developer productivity due to longer cycle times

As DataFusion gets bigger, I have noticed that recompiling datafusion for me takes longer and longer

cargo test ...

Describe the solution you'd like

I would like to make the building and testing of DataFusion "easier" and leaner. This ticket tracks work to improve things in this area

Describe alternatives you've considered

Additional context

No response

@alamb alamb added the enhancement New feature or request label Dec 17, 2024
@alamb alamb changed the title Improved developer / CI experience [EPIC] Improved developer / CI speed Dec 17, 2024
@alamb alamb changed the title [EPIC] Improved developer / CI speed [EPIC] A collection of items to improve developer / CI speed Dec 17, 2024
@Alexhuszagh
Copy link
Contributor

If this could be improved it would be wonderful. Arrow and Datafusion are very important projects using lexical and I want to test all my code against them and other high impact projects as an additional precaution in addition to our own checks internally, so being able to run these tests locally would be a huge benefit on my end.

@Omega359
Copy link
Contributor

Is there a way to test changes to the ci pipeline without actually checking in files to DF?

@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2024

Is there a way to test changes to the ci pipeline without actually checking in files to DF?

We could in theory change the triggering rules for the the jobs to run on any changes to .github as well

@Omega359
Copy link
Contributor

If no one beats me to this I can likely take a look at the ci pipeline during the holiday break. I still want to finish the sqlite test integration first if possible though.

@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2024

I also hope to hack on various parts of this EPIC during the break

I have some other ideas (like reducing the number of distinct binaries for example)

@findepi
Copy link
Member

findepi commented Dec 18, 2024

Is there a way to test changes to the ci pipeline without actually checking in files to DF?

@Omega359 you can enable the workflow(s) on your fork

@Omega359
Copy link
Contributor

Omega359 commented Dec 20, 2024

I've managed to knock some time off of the ci runs on a sidebranch (need to rebase off of main before pushing a PR) but here is what I've managed to get to so far:

an example of a PR run now:
image

my sidebranch on the second run (cache populated)
image

cargo test doc and hash collisions are the two ones right now that take the most time and much of that is DF compilation that can't really be cached (would blow out the 10GB cache limit I think).

Most of the improvement is better caching, splitting up the linux build test into multiple jobs (it's all cargo checks), and switching to nextest for many tests. I also disabled the intel mac build test.

I hope to submit a PR this weekend for this.

@alamb
Copy link
Contributor Author

alamb commented Dec 20, 2024

I've managed to knock some time off of the ci runs on a sidebranch (need to rebase off of main before pushing a PR) but here is what I've managed to get to so far:

LOL your screenshots show 37m --> 19m (which is a 2x improvement by my accounting) so that is a somewhat modest description :)

cargo test doc and hash collisions are the two ones right now that take the most time and much

I think the hash collisions one fails so irregularly, that we should consider simply running it on main (not on all PRs) as discussed in #13845. While that does indeed defer the potential for finding issues I think the tradeoff might be worth it in this specific case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants