Windows builds failing randomly

I have builds set up for Windows (windows-latest) and macOS - it’s a matrix, so 2 builds run for each.

Sometimes one of the Windows ones fails and cuts off messages like this:

2020-04-19T08:57:16.6185991Z -- Configuring done
2020-04-19T08:57:19.0547176Z -- Generating done
2020-04-19T08:57:19.6715826Z CMake Warning:
2020-04-19T08:57:19.6918171Z ##[error]Process completed with exit code 1.
2020-04-19T08:57:19.6939372Z Cleaning up orphan processes

…but the other Windows build works. Then I re-run everything using the “re-run” button and they all pass (or they swap and the other Windows one fails this time).

Seems like something is killing the process randomly but don’t have access to any additional info - just the cut-off logs.

This is happening frequently enough that it’s annoying because the macOS ones randomly get “cancelled by user” when nobody has touched it as well :slight_smile:

Windows builds take ~20-25 minutes, so manually re-running all the time is not a good option.

Any idea how to troubleshoot this?

1 Like

Could you please enable runner diagnostic logging and enable step debug logging in your workflow? Let’s check whether there are more information in detail. And if you don’t mind, please share a trouble workflow run with debug logs here?  

Please note that, the debug logging will be enabled for all the workflows in the current repository, after you getting a debug log for a failure workflow run, you could disable it.  

Thank you for the reply. I don’t have admin access to set that up. I’ll have to try to get the organization admin to do it.

In the meantime, here’s one that failed like this (without the extra debug information). And here is the workflow YAML.

When I re-run these, they usually pass the first time, but sometimes it takes a couple of tries.

I notice that it’s (often? always?) when it’s doing cmake, so I wonder if it’s a bug in the Windows cmake that’s installed.

I’ll update when I can get that extra information.

Since I can’t seem to attach anything here in the forum I uploaded the logs over here.

I took a look through them and can’t see anything obvious.

The one that failed this time is Windows Latest MSVC SCALAR_DOUBLE=OFF in step 7_Configure CMake.txt (sometimes it’s Windows Latest MSVC SCALAR_DOUBLE=ON that fails though).

...
2020-04-21T22:54:16.1043114Z -- QHULL found (include: C:/Miniconda3/envs/ccenv/Library/include, lib: optimized;C:/Miniconda3/envs/ccenv/Library/lib/qhullstatic.lib;debug;C:/Miniconda3/envs/ccenv/Library/lib/qhullstatic.lib)
2020-04-21T22:54:16.1043705Z -- looking for PCL_COMMON
2020-04-21T22:54:16.1058916Z -- Found PCL_COMMON: C:/Miniconda3/envs/ccenv/Library/lib/pcl_common_release.lib
2020-04-21T22:54:16.8933755Z CMake Warning (dev) at C:/Program Files/CMake/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:272 (message):
2020-04-21T22:54:19.1474864Z ##[error]Process completed with exit code 1.
2020-04-21T22:54:19.1478756Z ##[debug]Finishing: Configure CMake

I don’t see anything in the additional Runner and Worker logs that looks strange, but I don’t know what I’m looking for exactly.

I have forked your repo to my org. I met the same error as yours. I am trying to involve github engineering team to further investigate this issue. There might be some time delay. Appreciate your patience.

Great - thank you for your time.

This was a very large time sink yesterday as I had to re-run dozens of times.

If you (or they) have any suggestions on how to reduce the build time on Windows (~25 minutes) I’d love to hear that as well :slight_smile:

I have another run that keeps failing - I’ve tried it at least a dozen times today. I know it should pass because it passed the PR request stage, was merged, and now the merged CI keeps failing :slight_smile:

I turned on the logging and found this in one of the “worker” logs - maybe it will help?

[2020-04-23 17:14:49Z INFO HostContext] No proxy settings were found based on environmental variables (http_proxy/https_proxy/HTTP_PROXY/HTTPS_PROXY)
[2020-04-23 17:14:49Z INFO HostContext] Well known directory 'Bin': 'C:\runners\2.169.1\bin'
[2020-04-23 17:14:49Z INFO HostContext] Well known directory 'Root': 'C:\runners\2.169.1'
[2020-04-23 17:14:49Z INFO HostContext] Well known config file 'Credentials': 'C:\runners\2.169.1\.credentials'
[2020-04-23 17:14:49Z INFO Worker] Version: 2.169.1
[2020-04-23 17:14:49Z INFO Worker] Commit: 4f840647b375b65a4efa361abc1b84f783442de6
[2020-04-23 17:14:49Z INFO Worker] Culture: en-US
[2020-04-23 17:14:49Z INFO Worker] UI Culture: en-US
[2020-04-23 17:14:49Z ERR Worker] System.ArgumentException: Handle has been disposed or is invalid. (Parameter 'pipeHandleAsString')
   at System.IO.Pipes.AnonymousPipeClientStream..ctor(PipeDirection direction, String pipeHandleAsString)
   at GitHub.Runner.Common.ProcessChannel.StartClient(String pipeNameInput, String pipeNameOutput)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)

Thank you for looking deep into this issue . I am still trying to contact to senior engineer , your patience would be appreciated. 

Thank you.

We have had to re-run these things dozens of times (and you can’t just re-run the ones that fail - it’s all-or-nothing), so not only has this been very frustrating, it has been a major time sink and a waste of build resources at GitHub.

Combine this with the macOS runners randomly deciding someone cancelled them, and it shakes my confidence in Actions as a CI system. It kind of goes without saying that we need to be able to trust this and rely on it to be rock-solid.

1 Like

Hi @asmaloney !

Thank you for your patience.

Could you please check the investigation results below and see if it helps to solve the issue?

It turned out that by default CMake is not displaying its internal errors, and in order to find a reason of the fail we changed the workflow to show the CMake logs and saw there’s a compilation error from the C++ build itself

CMakeError.log content
Performing C++ SOURCE FILE Test COMPILER_HAS_DEPRECATED_ATTR failed with the following output:
Change Dir: D:/a/CloudCompare/CloudCompare/build/CMakeFiles/CMakeTmp
Run Build Command(s):C:/Miniconda3/envs/ccenv/Library/bin/ninja.exe cmTC_144a7 && [1/2] Building CXX object CMakeFiles\cmTC_144a7.dir\src.cxx.obj
FAILED: CMakeFiles/cmTC_144a7.dir/src.cxx.obj
C:\PROGRA~2\MICROS~1\2019\ENTERP~1\VC\Tools\MSVC\1425~1.286\bin\Hostx64\x64\cl.exe /nologo /TP /DWIN32 /D_WINDOWS /W3 /GR /EHsc -DCOMPILER_HAS_DEPRECATED_ATTR /MDd /Zi /Ob0 /Od /RTC1 -std:c++14 /showIncludes /FoCMakeFiles\cmTC_144a7.dir\src.cxx.obj /FdCMakeFiles\cmTC_144a7.dir\ /FS -c src.cxx
src.cxx(1): error C2065: ' __deprecated__': undeclared identifier
src.cxx(1): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
src.cxx(1): error C2448: ' __attribute__': function-style initializer appears to be a function definition
ninja: build stopped: subcommand failed.

The diff to show logs: dsame/CloudCompare@e8248d4 (idea is to surround cmake… with try…catch and output all the logs)
The sample build: https://github.com/dsame/CloudCompare/runs/645255014?check_suite_focus=true

Thanks for looking into it Alena.

…but the build you’re showing didn’t fail. What you’re showing here (and in the log) looks like the standard way CMake tests for compiler attributes. In that case it’s supposed to fail.

(The code it points at is cmake’s code for testing compilers for specific atttributes - not our code.)

The build works on a real Windows machine (the rest of the team are using them).

Right now with master it usually fails several times before it builds properly. Here’s a more recent example:

https://github.com/CloudCompare/CloudCompare/runs/618602537

If I keep re-running that one it will pass eventually.

So maybe try re-running it until it fails?

Did you see my comment above about the worker logs and pipes? That seems more likely to be an issue.

Your comment makes sense. We were actually able to reproduce the issue from your forked repo, and the only errors we were able to get are the ones in the build I provided in the answer above (that build I attached did not fail because of applied logic in try…catch):

CheckIncludeFile.c(1): fatal error C1083: Cannot open include file: 'pthread.h': No such file or directory
ninja: build stopped: subcommand failed.

Could you please add the suggested patch to your code, and provide the logs you will get? it will help us to understand what is happening in your build. 

> did not fail because of applied logic in try…catch

I’m sorry, but that’s not the case. As I mentioned above, the things you are showing as errors here will not cause the build to fail. They are normal - it’s the way that cmake determines what’s available on the machine.

So I don’t think you have reproduced the error with the try…catch yet. It fails in the cmake config process, so it never starts the build step. What about the other error I mentioned with the runner? Again - it seems a likely candidate.

I will see what else I can do at my end, but please try to reproduce it again. The symptom is (I think) that cmake is trying to produce a warning but instead something (runner?) produces an error which kills the build. It doesn’t happen all the time, but it happens very frequently. I’ve had to run them 16 or 17 times in some cases to get it to pass.

The reason builds fails is either CMake or another tool invoked by CMake writes to stderr. PowerShell triggers terminating exception then.

The reason build fails randomly is a race condition we did not identified exactly becaues the build system is too complex and not obvious. 

In order to identify the exact the problem of CMake fail i suggest you to apply the code of this PR: https://github.com/CloudCompare/CloudCompare/pull/1232 which intended to intercept the stderr stream and to do not let it miss the logs and PowerShell output. 

With this changes the CMake Configure will not break on the warnings but print them to the build log, so you can analyze them and find out the unexpected behaviour. 

The sample build: https://github.com/dsame/CloudCompare/runs/694200554?check_suite_focus=true

The messages cause PowerShell to fail becaues of CMake writes to stder are marked with “ERR” prefix:

Please, review the suggested PR and confirm it can help effectively to solve the problem

There’s a slightly modified build.yml which fails in case if there is any sdterr otput from the build. It seems to be too strict because but you might find it usefull

https://github.com/dsame/CloudCompare/blob/9888fc2a5e26d7c8bb0059f79f6d2bd1db881921/.github/workflows/build.yml

Please try to apply it your build in order to see the stderr output that was lost in your build and find out the differences in success and failed builds. 

Alternatively, as the recent Windows images have bash installed i’d advise to switch to it from PowerShell:
https://github.com/CloudCompare/CloudCompare/compare/CloudCompare:master…dsame:dsame/use-cmd?expand=1

With the bash you should not have the problem with CMake output to stderr

I like the idea of switching to bash since I’m completely unfamiliar with PowerShell anyways. It seems like the easiest option. I’ll do that and report back.

Thanks for your help Sergey!

After switching from PowerShell to bash as suggested by @dsame, we have not had any of the random failures on Windows we were seeing before.

Solution: Don’t use PowerShell.

Thanks @dsame, @alenasviridenko and @yanjingzhu for looking into it.