Something I've increasingly wondered is if the model of CI where a totally pristine container (or VM) gets spun on each change for each test set imposes an floor on how fast CI can run.
Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.
If I had infinity time, I'd build a CI system that found a runner that maintained some state (gasp!) about the build and went to a test runner that had most of its local build cache downloaded, source code cloned, and toolchain bootstrapped.
You'd love a service like that, until you have some weird stuff working in CI but not in local (or vice-versa), that's why things are built from scratch all the time, to prevent any such issues from happening.
Npm was (still is?) famously bad at installing dependencies, where sometimes the fix is to remove node_modules and simply reinstalling. Back when npm was more brittle (yes, possible) it was nearly impossible to maintain caches of node_modules directories, as they ended up being different than if you reinstalled with no existing node_modules directory.
I think Nix could be leveraged to resolve this. If the dependencies aren't perfectly matched it downloads the _different_ dependencies, but can use any locally downloaded instances already.
So infra concerns are identical. Remove any state your application itself uses (clean slate, like a local DB), but your VM can functionally be persistent (perhaps you shut it off when not in use to reduce spend)?
I mean, given that my full build takes hours but my incremental build takes seconds--and given that my build system itself tends to only mess up the incremental build a few times a year (and mostly in ways I can predict), I'd totally be OK with "correctness once a day" or "correctness on demand" in exchange for having the CI feel like something that I can use constantly. It isn't like I am locally developing or testing with "correctness each and every time", no matter how cool that sounds: I'd get nothing done!
This really depends a lot on context and there's no right or wrong answer here.
If you're working on something safety critical you'll want correctness every time. For most things short of that it's a trade-off between risk, time, and money—each of which can be fungible depending on context.
A small change in a dependency, essentially, bubbles or chains to all dependent steps. I.e., a change in the fizzbuzz source but inherently run the fizzbuzz tests. This cascades into your integration tests — we must run the integration tests that include fizzbuzz … but those now need all the other components involved; so, that sort of bubbles or chains to all reverse dependencies (i.e., we need to build the bazqux service, since it is in the integration test with fizzbuzz…) and now I'm building a large portion of my dependency graph.
And in practice, to keep the logic in CI reasonably simple … the answer is "build it all".
(If I had better content-aware builds, I could cache them: I could say, ah, bazqux's source hashes to $X, and we already have a build for that hash, excellent. In practice, this is really hard. If all of bazqux was limited to some subtree, but inevitably one file decides to include some source from outside the spiritual root of bazqux, and now bazqux's hash is "the entire tree", which by definition we've never built.)
I work in games, our repository is ~100GB (20m download) and a clean compile is 2 hours on a 16 core machine with 32GB ram (c6i.4xlarge for any Aws friends). Actually building a runnable version of the game takes two clean compiles (one editor and one client) plus an asset processing task that takes about another 2 hours clean.
Our toolchain install takes about 30 minutes (although that includes making a snapshot of the EBS volume to make an AMI out of).
That's ~7 hours for a clean build.
We have a somewhat better system than this - our base ami contains the entire toolchain, and we do an initial clone on the ami to get the bulk of the download done too. We store all the intermediates on a separate drive and we just mount it, build incrementally and unmount again. Sometimes we end up with duplicated work but overall it works pretty well. Our full builds are down from 7 hours (in theory) to about 30 minutes, including artifact deployments.
This is how CI systems have always behaved traditionally. Just install a Jenkins agent on any computer/VM and it will maintain persistent workspace on disk for each job to reuse in incremental builds. There are countless other tools that work in the same way. This also solves the problem of isolating builds if your ci only checks out the code and then launches a constrained docker container executing the build. This can easily be extended to use persistent network disks and scaled up workers, but is usually not worth the cost.
It's baffling to see this new trend of yaml actions running in pristine workers, redownloading the whole npm-universe from scratch on every change, birthing hundreds of startups trying to "solve" CI by presenting solutions to non-problems and then wrapping things in even more layers of lock-in and micro-VMs and detaching yourself from the integration.
While Jenkins might not be the best tool in the world, the industry needs a wake-up shower on how to simplify and keep in touch with reality, not hidden behind layers of SaaS-abstractions.
Agreed, this is more or less the inspiration behind Depot (https://depot.dev). Today it builds Docker images with this philosophy, but we'll be expanding to other more general inputs as well. Builds get routed to runner instances pre-configured to build as fast as possible, with local SSD cache and pre-installed toolchains, but without needing to set up any of that orchestration yourself.
- Watch which files are read (at the OS level) during each step, and snapshot the entire RAM/disk state of the MicroVM
- When you next push, just skip ahead to the latest snapshot
In practice this makes a generalized version of "cache keys" where you can snapshot the VM as it builds, and then restore the most appropriate snapshot for any given change.
I have zero experience with bazel, but I believe it offers the possibility of mechanisms similar to this? Or a mechanism that makes this "somewhat safe"?
Yes it does, but one should be warned that adopting Bazel isn't the lightest decision to make. But yeah, the CI experience is one of its best attributes.
We are using Bazel with Github self-hosted runners, and have consistent low build times with a growing codebase and test suite, as Bazel will only re-build and re-test what needs to be changed.
The CI experience compared to e.g. doing naive caching of some directories with Github managed runners is amazing, and it's probably the most reliable build/test setup I've had. The most common failure we have of the build system itself (which is still rare with ~once a week) is network issues with one of the package managers, rather than quirks introduced by one of the engineers (and there would be a straightforward path towards preventing those failures, we just haven't bothered to set that up yet).
> Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.
Couldn’t this be addressed if every node had a local caching proxy server container/VM, and all the other containers/VMs on the node used it for Git checkouts, image/package downloads, etc?
I'm using buildkite - which lets me run the workers myself. These are long-lived Ubuntu systems setup with the same code we use on dev and production running all the same software dependencies. Tests are fast and it works pretty nice.
Self-hosted runners are brilliant, but have a poor security model for running containers or building them within a job. Whilst we're focusing on GitHub Actions at the moment, the same problems exist for GitLab CI, Drone, Bitbucket and Azure DevOps. We explain why in the FAQ (link in the post).
That is a benefit over DIND and socket sharing, however it doesn't allow for running containers or K8s itself within a job. Any tooling that depends on running "docker" (the CLI) will also break or need adapting.
Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.
If I had infinity time, I'd build a CI system that found a runner that maintained some state (gasp!) about the build and went to a test runner that had most of its local build cache downloaded, source code cloned, and toolchain bootstrapped.