Something I've increasingly wondered is if the model of CI where a totally prist...

capableweb · on Nov 18, 2022

You'd love a service like that, until you have some weird stuff working in CI but not in local (or vice-versa), that's why things are built from scratch all the time, to prevent any such issues from happening.

Npm was (still is?) famously bad at installing dependencies, where sometimes the fix is to remove node_modules and simply reinstalling. Back when npm was more brittle (yes, possible) it was nearly impossible to maintain caches of node_modules directories, as they ended up being different than if you reinstalled with no existing node_modules directory.

ar_lan · on Nov 18, 2022

I think Nix could be leveraged to resolve this. If the dependencies aren't perfectly matched it downloads the _different_ dependencies, but can use any locally downloaded instances already.

So infra concerns are identical. Remove any state your application itself uses (clean slate, like a local DB), but your VM can functionally be persistent (perhaps you shut it off when not in use to reduce spend)?

RealityVoid · on Nov 18, 2022

You wouldn't catch it, it's true.

But I'd depends if you're willing to trade accuracy for speed. I suggest the correct reaction to this is... "How much speed?"

I presume the answer to be "a lot".

rad_gruchalski · on Nov 18, 2022

My immediate reaction is “correctness each and every time”.

saurik · on Nov 18, 2022

I mean, given that my full build takes hours but my incremental build takes seconds--and given that my build system itself tends to only mess up the incremental build a few times a year (and mostly in ways I can predict), I'd totally be OK with "correctness once a day" or "correctness on demand" in exchange for having the CI feel like something that I can use constantly. It isn't like I am locally developing or testing with "correctness each and every time", no matter how cool that sounds: I'd get nothing done!

d4mi3n · on Nov 18, 2022

This really depends a lot on context and there's no right or wrong answer here.

If you're working on something safety critical you'll want correctness every time. For most things short of that it's a trade-off between risk, time, and money—each of which can be fungible depending on context.

rad_gruchalski · on Nov 18, 2022

Do you really need to build the whole thing to test?

deathanatos · on Nov 18, 2022

In my experience, yes.

A small change in a dependency, essentially, bubbles or chains to all dependent steps. I.e., a change in the fizzbuzz source but inherently run the fizzbuzz tests. This cascades into your integration tests — we must run the integration tests that include fizzbuzz … but those now need all the other components involved; so, that sort of bubbles or chains to all reverse dependencies (i.e., we need to build the bazqux service, since it is in the integration test with fizzbuzz…) and now I'm building a large portion of my dependency graph.

And in practice, to keep the logic in CI reasonably simple … the answer is "build it all".

(If I had better content-aware builds, I could cache them: I could say, ah, bazqux's source hashes to $X, and we already have a build for that hash, excellent. In practice, this is really hard. If all of bazqux was limited to some subtree, but inevitably one file decides to include some source from outside the spiritual root of bazqux, and now bazqux's hash is "the entire tree", which by definition we've never built.)

(There's bazel, but it has its own issues.)

maccard · on Nov 18, 2022

I work in games, our repository is ~100GB (20m download) and a clean compile is 2 hours on a 16 core machine with 32GB ram (c6i.4xlarge for any Aws friends). Actually building a runnable version of the game takes two clean compiles (one editor and one client) plus an asset processing task that takes about another 2 hours clean.

Our toolchain install takes about 30 minutes (although that includes making a snapshot of the EBS volume to make an AMI out of).

That's ~7 hours for a clean build.

We have a somewhat better system than this - our base ami contains the entire toolchain, and we do an initial clone on the ami to get the bulk of the download done too. We store all the intermediates on a separate drive and we just mount it, build incrementally and unmount again. Sometimes we end up with duplicated work but overall it works pretty well. Our full builds are down from 7 hours (in theory) to about 30 minutes, including artifact deployments.

Too · on Nov 19, 2022

This is how CI systems have always behaved traditionally. Just install a Jenkins agent on any computer/VM and it will maintain persistent workspace on disk for each job to reuse in incremental builds. There are countless other tools that work in the same way. This also solves the problem of isolating builds if your ci only checks out the code and then launches a constrained docker container executing the build. This can easily be extended to use persistent network disks and scaled up workers, but is usually not worth the cost.

It's baffling to see this new trend of yaml actions running in pristine workers, redownloading the whole npm-universe from scratch on every change, birthing hundreds of startups trying to "solve" CI by presenting solutions to non-problems and then wrapping things in even more layers of lock-in and micro-VMs and detaching yourself from the integration.

While Jenkins might not be the best tool in the world, the industry needs a wake-up shower on how to simplify and keep in touch with reality, not hidden behind layers of SaaS-abstractions.

jacobwg · on Nov 18, 2022

Agreed, this is more or less the inspiration behind Depot (https://depot.dev). Today it builds Docker images with this philosophy, but we'll be expanding to other more general inputs as well. Builds get routed to runner instances pre-configured to build as fast as possible, with local SSD cache and pre-installed toolchains, but without needing to set up any of that orchestration yourself.

colinchartier · on Nov 18, 2022

This was the idea behind https://webapp.io (YC S20):

- Run a linear series of steps

- Watch which files are read (at the OS level) during each step, and snapshot the entire RAM/disk state of the MicroVM

- When you next push, just skip ahead to the latest snapshot

In practice this makes a generalized version of "cache keys" where you can snapshot the VM as it builds, and then restore the most appropriate snapshot for any given change.

lytedev · on Nov 18, 2022

I have zero experience with bazel, but I believe it offers the possibility of mechanisms similar to this? Or a mechanism that makes this "somewhat safe"?

hobofan · on Nov 18, 2022

Yes it does, but one should be warned that adopting Bazel isn't the lightest decision to make. But yeah, the CI experience is one of its best attributes.

We are using Bazel with Github self-hosted runners, and have consistent low build times with a growing codebase and test suite, as Bazel will only re-build and re-test what needs to be changed.

The CI experience compared to e.g. doing naive caching of some directories with Github managed runners is amazing, and it's probably the most reliable build/test setup I've had. The most common failure we have of the build system itself (which is still rare with ~once a week) is network issues with one of the package managers, rather than quirks introduced by one of the engineers (and there would be a straightforward path towards preventing those failures, we just haven't bothered to set that up yet).

skissane · on Nov 19, 2022

> Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.

Couldn’t this be addressed if every node had a local caching proxy server container/VM, and all the other containers/VMs on the node used it for Git checkouts, image/package downloads, etc?

quesera · on Nov 19, 2022

> the model of CI where a totally pristine container (or VM) gets spun on each change for each test set imposes an floor on how fast CI can run

I believe this is the motivation behind https://brisktest.com/

mattbillenstein · on Nov 18, 2022

I'm using buildkite - which lets me run the workers myself. These are long-lived Ubuntu systems setup with the same code we use on dev and production running all the same software dependencies. Tests are fast and it works pretty nice.

raffraffraff · on Nov 18, 2022

I'm not using it right now, but at a previous company we used Gitlab CI on the free tier with self-hosted runners. Kicked ass.

alexellisuk · on Nov 18, 2022

Self-hosted runners are brilliant, but have a poor security model for running containers or building them within a job. Whilst we're focusing on GitHub Actions at the moment, the same problems exist for GitLab CI, Drone, Bitbucket and Azure DevOps. We explain why in the FAQ (link in the post).

goodoldneon · on Nov 18, 2022

> poor security model for running containers or building them within a job

You mean Docker-in-Docker? If so, we used Kaniko to build images without Docker-in-Docker

alexellisuk · on Nov 18, 2022

There is a misconception that Kaniko means non-root, but in order to build a container it has to work with layers which requires root.

Using Kaniko also doesn't solve for:

How do you run containers within that build in order to test them? How do you run KinD/K3s within that build to validate the containers e2e?

goodoldneon · on Nov 18, 2022

The benefit of Kaniko (relative to Docker-in-Docker) is that you don't need to run in privileged mode.

We test our containers in our Dev environment after deploying

alexellisuk · on Nov 18, 2022

That is a benefit over DIND and socket sharing, however it doesn't allow for running containers or K8s itself within a job. Any tooling that depends on running "docker" (the CLI) will also break or need adapting.

This also comes to mind: "root in the container is root on the host" - https://suraj.io/post/root-in-container-root-on-host/

mattbillenstein · on Nov 19, 2022

This reminds me of the erlang map-reduce "did you just tell me to fuck myself" meme

Shish2k · on Nov 18, 2022

> Each job will always have to run a clone

You can create a base filesystem image with the code and tools checked out, then create a VM which uses that in a copy-on-write way

maccard · on Nov 19, 2022

AWS Autoscaling groups with a custom AMI does this by default, fwiw.