Our Linux builds have started to fail because they are running out of disk space. While investigating the cause of the failures, I found that the amount of disk space on the Ubuntu images are declining on a weekly basis. Here is what gets displayed in our log for
df -h within a typical Docker container that we’re using for our builds that takes about 4 GB of disk space once everything installed:
https://github.com/tensorflow/java/runs/515298915 (Mar 18, 2020)
https://github.com/tensorflow/java/runs/559103432 (Apr 4, 2020)
https://github.com/tensorflow/java/runs/581350731 (Apr 13, 2020)
So it looks like we’re losing about 1 GB of space every week, and in about 3 months, we’ll be reaching 0 GB, probably preventing the images from being able to boot at all.
Hi @saudet ,
I checked your workflow yml file, you are using container nvidia/cuda:10.1-cudnn7-devel-centos7 to run your job.
Based on my test , the decreased disk space is used by this image. Please see my job yml and workflow logs :
- name: df
docker system df
docker run nvidia/cuda:10.1-cudnn7-devel-centos7
docker system df
The size of image is 3.5GB .
Yes, I know and I did mention that the container takes about 4 GB of space. The point is, 13 GB of space is less than the 14 GB that is supposed to be guaranteed:
Each virtual machine has the same hardware resources available.
- 2-core CPU
- 7 GB of RAM memory
- 14 GB of SSD disk space
If you only guarantee 13 GB, then the documentation should be updated. Do you agree?
As of at least yesterday, we’re now down to about 9 GB:
https://github.com/tensorflow/java/runs/599744540 (Apr 20, 2020)
overlay 84G 80G 3.7G 96% /
I doubt very much we’re supposed to have so little disk space to work with.
I had seen the same thing on our builds that started to fail. The following values are after
docker system prune:
/dev/sda1 84G 74G 9.9G 89% /
/dev/sda1 84G 70G 15G 84% /
So they fixed whatever they temporarely broke (I hope). It would have been nice to acknowledge the problem, so that we now that a fix was underway.
You could see the logs in my above screenshot , the /dev/sdb1 is 14 G, used 41M, then the Avail show 13G . It is not an accurate data.
Sorry for this bad experience. There is an opening issue in virtual-environments repo , engineering team are working on a fix fot it. https://github.com/actions/virtual-environments/issues/709#issuecomment-616767507
Please wait for sometime. Kindly let me know your current status.