all: faster builds by caching the build cache ⚡ #41033

shmsr · 2024-09-27T19:58:53Z

Proposed commit message

Tested the change to build the binary of agenbeat for linux/amd64 where my docker host is darwin/arm64. After debugging, I found that most of the time is spent in the cross-compilation part during packaging. So to isolate and benchmark the change, I pulled out only the docker command that's responsible for the building of binaries:

docker run --ulimit nofile=262144:262144 \
  --env EXEC_UID=501 \
  --env EXEC_GID=20 \
  -v /Users/subhamsarkar/go/pkg/mod:/go/pkg/mod:ro \
  --env DEV=false \
  --rm \
  --env GOGC=off \
  --env GOFLAGS="-mod=readonly -buildvcs=false" \
  --env MAGEFILE_VERBOSE=true \
  --env MAGEFILE_TIMEOUT= \
  --env SNAPSHOT=false \
  -v /Users/subhamsarkar/go/src/github.com/elastic/beats:/go/src/github.com/elastic/beats \
  -w /go/src/github.com/elastic/beats/x-pack/agentbeat \
  docker.elastic.co/beats-dev/golang-crossbuild:1.22.6-main-debian10 \
  --build-cmd "build/mage-linux-amd64 golangCrossBuild" \
  --platforms linux/amd64

It was taking ~4m30s before. After the change, 1st run is similar to before but the subsequent runs take ~25s only ⚡

Along with the introduction of build caching, I've made some general improvements including fixing code smells, better errors, tiny optimizations, etc.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

[ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

botelastic · 2024-09-27T19:59:00Z

This pull request doesn't have a Team:<team> label.

mergify · 2024-09-27T19:59:28Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @shmsr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-09-27T19:59:28Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

shmsr · 2024-09-27T20:03:19Z

dev-tools/mage/crossbuild.go

+	// To speed up cross-compilation, we need to persist the build cache so that subsequent builds
+	// for the same arch are faster (⚡).
+	//
+	// As we want to persist the build cache, we need to mount the cache directory to the Docker host.
+	// This is done by mounting the host directory to the container.
+	//
+	// Path of the cache directory on the host:
+	// 	<repoInfo.RootDir>/build/.go-build/<b.Platform>
+	// Example: <rootdir>/build/.go-build/linux/amd64
+	//
+	// As per: https://docs.docker.com/engine/storage/bind-mounts/#differences-between--v-and---mount-behavior
+	// If the directory doesn't exist, Docker does not automatically create it for you, but generates an error.
+	// So, we need to create the directory before mounting it.
+	//
+	// Also, in the container, the cache directory is mounted to /root/.cache/go-build.
+	buildCacheHostDir := filepath.Join(repoInfo.RootDir, "build", ".go-build", b.Platform)
+	buildCacheContainerDir := "/root/.cache/go-build"
+	if err = os.MkdirAll(buildCacheHostDir, 0755); err != nil {
+		return fmt.Errorf("failed to create directory %s: %w", buildCacheHostDir, err)
+	}
+
+	// Common arguments
 	args = append(args,
 		"--rm",
 		"--env", "GOFLAGS=-mod=readonly -buildvcs=false",
 		"--env", "MAGEFILE_VERBOSE="+verbose,
 		"--env", "MAGEFILE_TIMEOUT="+EnvOr("MAGEFILE_TIMEOUT", ""),
-		"--env", fmt.Sprintf("SNAPSHOT=%v", Snapshot),
+		"--env", "SNAPSHOT="+strconv.FormatBool(Snapshot),
+
+		// To persist the build cache, we need to mount the cache directory to the Docker host.
+		// With docker run, mount types are: bind, volume and tmpfs. For our use case, we have
+		// decide to use the bind mount type.
+		"--mount", fmt.Sprintf("type=bind,source=%s,target=%s", buildCacheHostDir, buildCacheContainerDir),


speed-up part by persisting build cache is here.

shmsr · 2024-09-28T16:33:22Z

.buildkite/x-pack/pipeline.xpack.agentbeat.yml

-          echo "~~~ Downloading artifacts"
-          buildkite-agent artifact download x-pack/agentbeat/build/distributions/** . --step 'agentbeat-package-linux'
-          ls -lah x-pack/agentbeat/build/distributions/


I don't why we were doing this. For the integration test, test binaries will be created which is already handled in mage goIntegTest.

shmsr · 2024-09-28T16:36:02Z

dev-tools/mage/common.go

+	// Parallelize conservatively to avoid overloading the host.
+	if maxParallel >= 2 {
+		return maxParallel / 2
+	}


This needs to be discussed. I did this change i.e., n/2 (compared to ~n before) because there were cases where BK agent was lost. From the Slack threads and BK notice about "Agent lost", seems like it is happening a lot and maybe we should keep this change.

Now it is working with the change. Cross-compilations are quite expensive I think we shouldn't spawn all of them at once.

I have a suspicion the BK agents being lost is related to a more infra level problem, I recall seeing some communication about this.

IMO we should use as much CPU as possible to ensure our builds are as fast as possible. The Buildkite agent's only job is to compile and test things quickly.

I agree with your point. However, excessive parallelization can be problematic in certain scenarios. Prior to implementing this change, I frequently encountered the "Agent Lost" error. But after the change, none! So, I thought let me keep the change and discuss once.

Certainly, I can remove this change for now and revisit it later. The primary focus of this PR is to discuss the persistence of the build cache.

shmsr · 2024-09-28T16:51:08Z

After playing a lot with BK, I think I'd need some help with how things work in CI. Locally, it has been working great. For example: Inside x-pack/agentbeat if I run PLATFORMS=linux/amd64 PACKAGES=tar.gz mage -v package it takes around ~1m24s (2nd run onwards) compared to ~5m42s earlier which is a significant drop; roughly more than 4 mins reduction in build-time. I think we should introduce build caching here in the CI as well.

Here's what I need help with:

Review the changes in the BK pipeline.
I see there's some stale module and build cache in BK. From the date, it looks like 2 days old. How it is currently being done?
What's the ideal way to use the build cache (platform-wise as well) and module cache in multiple steps. Suppose I have warmed up the cache in the first step and now I want the cache to be available for 2nd step. I tried uploading it to artifact after reading somewhere about it but it takes a lot of time because of the upload.
How can I persist the build-cache generated in CI? If this happens, it will significantly drop the build time.
Also for integration tests, I see -count=1 with go test is being used. So even if have the cache, it is not gonna be used. So is it intentional? Is it fine to remove it? I think there should be a pipeline that runs the integrations tests with -count=1 after being merged to main.

shmsr · 2024-09-30T10:58:10Z

dev-tools/mage/crossbuild.go

+	// To speed up cross-compilation, we need to persist the build cache so that subsequent builds
+	// for the same arch are faster (⚡).
+	//
+	// As we want to persist the build cache, we need to mount the cache directory to the Docker host.
+	// This is done by mounting the host directory to the container.
+	//
+	// Path of the cache directory on the host:
+	// 		<os.TempDir>/build/.go-build/<b.Platform>
+	// Example: /tmp/build/.go-build/linux/amd64
+	// Reason for using <os.TempDir> and not <repoInfo.RootDir> as base because for
+	// builds happening on CI, the paths looks similar to:
+	// /opt/buildkite-agent/builds/bk-agent-prod-gcp-1727515099712207954/elastic/beats-xpack-agentbeat/
+	// where bk-agent-prod-gcp-1727515099712207954 is the agent so it keeps changing. So even if we do cache the
+	// build, it will be useless as the cache directory will be different for every build.
+	//
+	// As per: https://docs.docker.com/engine/storage/bind-mounts/#differences-between--v-and---mount-behavior
+	// If the directory doesn't exist, Docker does not automatically create it for you, but generates an error.
+	// So, we need to create the directory before mounting it.
+	//
+	// Also, in the container, the cache directory is mounted to /root/.cache/go-build.
+	buildCacheHostDir := filepath.Join(os.TempDir(), "build", ".go-build", b.Platform)
+	buildCacheContainerDir := "/root/.cache/go-build"
+
+	if err = os.MkdirAll(buildCacheHostDir, 0755); err != nil {
+		return fmt.Errorf("failed to create directory %s: %w", buildCacheHostDir, err)
+	}
+
+	// Common arguments
 	args = append(args,
 		"--rm",
 		"--env", "GOFLAGS=-mod=readonly -buildvcs=false",
 		"--env", "MAGEFILE_VERBOSE="+verbose,
 		"--env", "MAGEFILE_TIMEOUT="+EnvOr("MAGEFILE_TIMEOUT", ""),
-		"--env", fmt.Sprintf("SNAPSHOT=%v", Snapshot),
+		"--env", "SNAPSHOT="+strconv.FormatBool(Snapshot),
+
+		// To persist the build cache, we need to mount the cache directory to the Docker host.
+		// With docker run, mount types are: bind, volume and tmpfs. For our use case, we have
+		// decide to use the bind mount type.
+		"--mount", fmt.Sprintf("type=bind,source=%s,target=%s", buildCacheHostDir, buildCacheContainerDir),


This is the part where build caching is happening.

shmsr · 2024-09-30T11:15:36Z

There's also GOCACHEPROG that we can use. See:

cmd/go: enable GOCACHEPROG by default golang/go#64876
example tool: https://github.com/nfi-hashicorp/gocacheprog-s3
example tool: https://github.com/or-shachar/go-tool-cache
example tool: https://github.com/bradfitz/go-tool-cache (brafitz)
cmd/go: add GOEXPERIMENT=cacheprog to let a child process implement the internal action/output cache golang/go#59719 (main issue)

This is another option that we can use in our CI.

alexsapran · 2024-10-07T15:57:55Z

Why are we making this change?

The CI agents are ephemeral. So, any caching between runs is lost unless we share the cache.

Are we improving a local dev workflow that we want to make re-building faster or improving CI timing?

shmsr · 2024-10-07T16:15:52Z

Why are we making this change?

The CI agents are ephemeral. So, any caching between runs is lost unless we share the cache.

Are we improving a local dev workflow that we want to make re-building faster or improving CI timing?

Yeah currently the change is working well in dev but not on CI because of the reason you've mentioned.

See the 3rd and 4th point here: #41033 (comment)

So I need some help to do this in CI. We can use this but that would take some more work.

alexsapran · 2024-10-08T07:49:35Z

See the 3rd and 4th point here: #41033 (comment)

Answering each bullet

How can I persist the build-cache generated in CI? If this happens, it will significantly drop the build time.

Why would we need to persist the build cache in CI, and what problem are we solving?
Are we trying to optimise a specific step that we identified as having an issue, or is it generally good practice?

So I need some help to do this in CI. We can use #41033 (comment) but that would take some more work.

Again, I agree with you, and this is why I am asking you to define the problem we are solving.

Overall, I see this as follows:
We can fix the local dev use case by improving and reusing the cache, which could end this work.
Then, identify the steps to improve each step's time and see what we can do. This can be a dedicated other PR.
Because we use ephemeral agents in CI, caching is a challenge. Because what if we need to evict something from the cache? Is the caching unique to the specific Build or re-usable between different Builds?

We maintain a set of custom VM images for Beats that come pre-bundled with container images and go packages to help improve CI times. Can we improve something there?
It depends on the problem we want to solve.

shmsr · 2024-10-08T08:33:07Z

Thank you @alexsapran . That makes sense. I propose we proceed as follows:

General improvements like code smells, micro-optimizattions in dev-tools/mage/* (low priority; we can defer this for now). Not so important.
Faster builds in the development environment (implementing build caching). This will improve build times considerably, and in case the cache needs to be evicted, it is easy to do so in dev environment. It's a safe change IMO and good to have.
Exploration of faster builds in the CI environment (utilizing GOCACHEPROG or alternative solutions). Lots of discussion is needed here like you also mentioned.

Changes for points 1 and 2 are ready. I can separate these changes into separate PRs. Additionally, I'll create an issue to track the work for point 3?

What do you think?

alexsapran · 2024-10-08T10:58:23Z

Changes for points 1 and 2 are ready. I can separate these changes into separate PRs. Additionally, I'll create an issue to track the work for point 3?

From my side, splitting the PR is optional. Thanks in advance for raising the issue for 3.

mergify · 2024-10-09T09:14:39Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b shmsr-improve-builder upstream/shmsr-improve-builder
git merge upstream/main
git push upstream shmsr-improve-builder

mergify · 2024-10-10T11:41:28Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b shmsr-improve-builder upstream/shmsr-improve-builder
git merge upstream/main
git push upstream shmsr-improve-builder

super fast crossbuilds

e1836ef

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2024

shmsr changed the title ~~all: super fast crossbuilds ⚡~~ all: super fast crossbuilds by caching the build cache ⚡ Sep 27, 2024

mergify bot assigned shmsr Sep 27, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 27, 2024

shmsr commented Sep 27, 2024

View reviewed changes

make most lints happy

57eec7c

shmsr changed the title ~~all: super fast crossbuilds by caching the build cache ⚡~~ all: faster crossbuilds by caching the build cache ⚡ Sep 27, 2024

shmsr added 18 commits September 28, 2024 02:20

add statements to debug something

b65f336

add statements to debug something

fb2d80b

debug

5a161ce

debug 1

428d059

make parallel better

836b67f

parallelize better

dc868a8

play with buildkite

933b63a

play with buildkite 1

b846a49

play with buildkite 2

6ebf51d

play with buildkite 3

0f2cf7c

play with buildkite 4

6c81c5c

play with buildkite 5

e2dbfae

play with buildkite 6

fc303d1

play with buildkite 6

6838049

play with buildkite 7

468eeb6

play with buildkite 8

5628ab6

play with buildkite 9

97795e1

play with buildkite 10

e21f531

shmsr commented Sep 28, 2024

View reviewed changes

shmsr marked this pull request as ready for review September 29, 2024 08:14

shmsr requested review from a team as code owners September 29, 2024 08:14

shmsr requested review from leehinman and VihasMakwana September 29, 2024 08:14

shmsr marked this pull request as draft September 29, 2024 08:15

shmsr commented Sep 30, 2024

View reviewed changes

shmsr changed the title ~~all: faster crossbuilds by caching the build cache ⚡~~ all: faster builds by caching the build cache ⚡ Sep 30, 2024

Merge branch 'main' into shmsr-improve-builder

c88f333

shmsr mentioned this pull request Oct 8, 2024

Update to go 1.23.2. #41087

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all: faster builds by caching the build cache ⚡ #41033

all: faster builds by caching the build cache ⚡ #41033

shmsr commented Sep 27, 2024 •

edited

Loading

botelastic bot commented Sep 27, 2024

mergify bot commented Sep 27, 2024

mergify bot commented Sep 27, 2024

shmsr Sep 27, 2024 •

edited

Loading

shmsr Sep 28, 2024

shmsr Sep 28, 2024 •

edited

Loading

cmacknz Oct 7, 2024

shmsr Oct 7, 2024

shmsr commented Sep 28, 2024 •

edited

Loading

shmsr Sep 30, 2024

shmsr commented Sep 30, 2024 •

edited

Loading

alexsapran commented Oct 7, 2024

shmsr commented Oct 7, 2024

alexsapran commented Oct 8, 2024

shmsr commented Oct 8, 2024

alexsapran commented Oct 8, 2024

mergify bot commented Oct 9, 2024

mergify bot commented Oct 10, 2024

all: faster builds by caching the build cache ⚡ #41033

Are you sure you want to change the base?

all: faster builds by caching the build cache ⚡ #41033

Conversation

shmsr commented Sep 27, 2024 • edited Loading

Proposed commit message

Checklist

Disruptive User Impact

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

botelastic bot commented Sep 27, 2024

mergify bot commented Sep 27, 2024

mergify bot commented Sep 27, 2024

shmsr Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

shmsr Sep 28, 2024

Choose a reason for hiding this comment

shmsr Sep 28, 2024 • edited Loading

Choose a reason for hiding this comment

cmacknz Oct 7, 2024

Choose a reason for hiding this comment

shmsr Oct 7, 2024

Choose a reason for hiding this comment

shmsr commented Sep 28, 2024 • edited Loading

shmsr Sep 30, 2024

Choose a reason for hiding this comment

shmsr commented Sep 30, 2024 • edited Loading

alexsapran commented Oct 7, 2024

shmsr commented Oct 7, 2024

alexsapran commented Oct 8, 2024

shmsr commented Oct 8, 2024

alexsapran commented Oct 8, 2024

mergify bot commented Oct 9, 2024

mergify bot commented Oct 10, 2024

shmsr commented Sep 27, 2024 •

edited

Loading

shmsr Sep 27, 2024 •

edited

Loading

shmsr Sep 28, 2024 •

edited

Loading

shmsr commented Sep 28, 2024 •

edited

Loading

shmsr commented Sep 30, 2024 •

edited

Loading