I have a large Docker image that sometimes has issues building, because it's so big.
I'm using [Docker Build with Cache Action](https://github.com/whoan/docker-build-with-cache-action) for build caches, but sometimes this will fail because the cache plus the new layers is more than 14 GB.
I'm trying to catch this error and run a second time with no cache, but it seems some of the Docker things (images, layers, whatever) are still lingering. So I run a step `docker system prune --force` and it removed around 700 MB of files. This wasn't enough, so I ran a more aggressive command, `docker system prune --force --all --volumes`. This appears to have removed some of the GitHub Actions required components, as the next step fails to run, with the error:
Unable to find image 'e87b52:0c9a57e74a414c4bbe60bd043fcbf313' locally
I noticed you are using the GitHub-hosted runners to run your jobs. Typically each virtual machine provides about 14 GB of SSD disk space (see here) available, sometimes the actual space may be a little more or less.
Because the layers of your image build is more than 14 GB, so it can easily exceed the limit, and returns the error "No space left on device".
I think, using the "docker system prune" command to remove the unused data may be not helpful. It's unlikely to make more space that is more than 14 GB, because the disk space has been fixed when the runner startup. And as you mentioned, you may mistakenly remove some important data or dependencies.
Currently, as a workaround, I recommend you use self-hosted runners to run the these jobs. You can install the self-hosted runners on your local machines (or your VMs) that have more available space provided to run the jobs.
I'm hoping to avoid self-hosted runners unless absolutely necessary.
The issue I experience is that if I build the image without caching, it works, but takes 25 minutes.
If I build with caching, it takes 5 minutes.
If I build with a change early in the process, I've downloaded the caches and it runs out of disk space because the cache + the new layers is too much.
Here's what I'm hoping to achieve:
What I'm having trouble with is that step 3 fails because there is left over Docker layers (or something) from step one.
If I do just build without a cache it works fine, but it will always take 25 mins. I want both cache and handling the cache being too big!
I have a workaround maybe you can consider to be as a reference: dividedly run step1 and step2 you mentioned above in two jobs. That means making them run in two runners.
For example, as simple demo:
jobs: job1: Build the image with a cache job2: build the image without using a cache if: job1.status == "failure"
You just need to pass the status of job1 to the if conditional of job2. Because each job in a workflow executes in a fresh instance of the virtual machine, so you don't need to clean the cache and free up the disk.