I developed a small web scraper that assembles data published on a government-run website, and writes it to one big CSV file (in order to facilitate analysis of that data).
I wanted this work to take place in GitHub Actions and the file to be published in the same Git repo.
The scope of work is not huge - making ~1000 HTTP requests to a server in Europe and producing a CSV file ~9 MB in size.
To view the repo in question follow any of the links below.
Run 1 - timed out after 6 hours of very slow execution - link
- Trigger type: schedule CRON expression (started at 23:45 UTC)
- Logging: App had debug logging enabled, which produced a build log file of 16 MB (uncompressed). So a little noisy.
Logs show something like this (lines truncated, every DEBUG statement is ~7000 characters long):
2020-11-17T23:46:36.0512253Z > Task :runScraper 2020-11-17T23:46:36.0515677Z 57 [main] [DEBUG] LoggingConnectionDecorator - Request ... 2020-11-17T23:46:37.0697393Z 1237 [main] [DEBUG] LoggingConnectionDecorator - Request ... 2020-11-17T23:46:49.7661161Z 2456 [main] [DEBUG] LoggingConnectionDecorator - Request: ... 2020-11-17T23:47:21.6486301Z 3516 [main] [DEBUG] LoggingConnectionDecorator - Request: ... 2020-11-17T23:47:54.3218651Z 4593 [main] [DEBUG] LoggingConnectionDecorator - Request: ... 2020-11-17T23:48:10.0014316Z 5609 [main] [DEBUG] LoggingConnectionDecorator - Request: ...
The numbers between the timestamp provided by GitHub Actions (leftmost) and
[main] (i.e. 57, 1237, 2456, 3516, …) are the number of milliseconds since application start.
So while my application believes it is executing 1 request every ~1s, in reality it takes: 1s, 12s, 32s, 33s, 16s, … (and all further requests take ~30s each), which adds up to 6 hours and the job terminates due to exceeding the 6 hour limit.
There was also a manually triggered development-time run with debug logging enabled and limited data set (2 requests only) run at 23:23 UTC and it also showed ~15s real-world time per request.
Run 2 - successful after ~17 minutes - link
- Trigger type: manual (started at 09:12 UTC)
- Logging: App had debug logging disabled, which produced barely any logs (~10 lines)
Looking at logs:
2020-11-18T09:15:54.3774218Z 147601 [main] [INFO] Main - Found 7625 permits for region Alytaus TP 2020-11-18T09:18:12.2767310Z 285516 [main] [INFO] Main - Found 7636 permits for region Kauno TP 2020-11-18T09:19:41.2766749Z 374497 [main] [INFO] Main - Found 5084 permits for region Klaipėdos TP 2020-11-18T09:21:48.2766867Z 501502 [main] [INFO] Main - Found 6869 permits for region Marijampolės TP 2020-11-18T09:21:48.8769642Z 502107 [main] [INFO] Main - Found 0 permits for region Miškų kontrolės skyrius 2020-11-18T09:25:15.3768629Z 708626 [main] [INFO] Main - Found 9635 permits for region Panevėžio TP 2020-11-18T09:27:03.7768508Z 817006 [main] [INFO] Main - Found 5820 permits for region Šiaulių TP 2020-11-18T09:28:33.7768503Z 907056 [main] [INFO] Main - Found 4824 permits for region Utenos TP 2020-11-18T09:30:01.5765216Z 994796 [main] [INFO] Main - Found 5173 permits for region Vilniaus TP 2020-11-18T09:30:01.5766979Z 994800 [main] [INFO] Main - Retrieved results: 52666
Some calculations are required to interpret timings here.
Each HTTP request results in 50 permits, so 52666 permits in total are roughly 1050 HTTP requests, each one on average taking ~1s of application time (i.e. same as in the 1st run).
However, in this run the “real-world time” is mostly in line with application time:
- 847199 (=994800-147601) application time ms between the first and the last logged statement are 11,12 minutes of application time apart
- Real-world time between first and last logged statement is around 14 minutes (from 09:15:54 to 09:30:01)
What could cause ~25x difference in execution time between Runs 1 and 2?
- Logging? Although I would not expect logging ~7000 chars to produce a slowdown of at least 20 seconds.
- Different job start times? I read that the shared workflow runners are hosted in US, and “Run 1” takes place somewhere towards the end of business day in US, while “Run 2” takes place during US night.
- Temporary glitch? Although GitHub status page does not show any incidents on that day, and development run with limited dataset also showed slow execution ~20 minutes before “Run 1”