If a long list of IDs would be useful to identify any trends, please let me know. Happy to dump some logs for a larger set and send them here or via the internal thread.
Thanks @mattipv4 – I think the IDs you submitted yesterday:
…aren’t as helpful as the literal
request_id values, like:
However, the patterns we have noticed (sometimes we see a deleted repo return a 404, that bubbles up an error) don’t seem to correlate to the ~2% of errors we’re seeing as timeouts. For the timeouts, we still only have the same recommendation to decrease your counts and use pagination. But we’re also trying to identify if there truly is something that we can take action on.
So if you do have more
request_id values to provide, please share them here or the shared Slack thread. I’m in there, too =)
Looks like those might’ve been the error IDs on our end rather than yours, sorry, my bad!
Have just shared a log file with all the request IDs we saw in our last batch run in the internal thread (didn’t feel right to dump that file here).
Hey @mattipv4 – thanks again for your patience while we’ve reviewed both the request IDs you provided, and the app behavior over all. I’m fairly confident that we’ve done what we can do for now, so let’s start with…
TL;DR: the vast majority of errors being returned are as we initially mentioned; timeouts. There is a global timeout limit of 10s for any single GraphQL call. Complexity, and long-running field requests, all add up to the resulting error message being received by your app.
To address: lower the number of PRs requested. And, simplify the query by removing:
reviewDecision. A combination of reducing the number of PRs and potentially completely removing any or each of the 3 fields listed here, should completely resolve the timeouts. There is also another error that gets bundled with the same external error that your app receives, that you won’t have any control over.
Let’s get into a bit more detail.
We profiled your query and traced performance of individual fields, to try to isolate problem areas either with the query itself, or issues with our services that handle those respective requests. The reason for suggesting the removal of these fields:
…is that each were found to be markedly slower than other fields in the query, which all added to the total time spent running the query. I’ll talk more about what we intend to do to address this later, but for example…
We found that at worst, each of these 3 fields could add up to 6s for those fields alone! This is not always true in all cases, and we believe that there is a caching concern that might explain why sometimes we see those fields take far longer than other times. However, those were the standouts that result in our recommendation to consider for removal.
Beyond completely removing those fields from this single query, we do recommend reducing your calls to PRs from 100 to 50, and iterating through a pagination process in the app. And then lowering to 25, if you continue to see errors returned. I do know that this is what you were hoping to avoid, but with a complex GQL query, and a 10s timeout that cannot be adjusted at this time, this is going to be the quickest path to resolving those timeouts.
Through this effort, we identified a new error on our side, which accounted for 0.218% of events from our sampled timeframe. However, this error did get bundled up with the same error message received with the timeouts, and they’re worth addressing for us. Especially considering they impact more than just this app, alone.
There’s nothing for you to take action on here. But do know that because of this research, we were able to identify these errors, and have begun triaging for how to better handle and communicate them, if not squash them entirely.
Again, we recommend reducing the PRs requested in your query from 100 to 50. We can also recommend removing one or more of the three fields mentioned above. Either or both of these steps should help with avoiding the 10s timeout limit, which is being hit and responsible for 90%+ of the errors being returned.
Unfortunately, there isn’t any simple or singular action that we can take internally, to avoid the timeouts being experienced. The action must be to adjust the query, if we’re to properly address this during the timeframe for Hacktober.
That said, there is action for us to take, and we’ve already begun steps to address things. Specifically:
- We’re opening a new issue on the error that I mentioned, which also bubbles up to the same generic error message
- We are also reaching out to various DB teams to address the performance of those 3 aforementioned fields
- There is another unmentioned error we discovered, which is something that can add ~10ms to a request
Unfortunately, the steps that we are going to take, won’t be able to be completed within the month as they’ve only been elevated in the past day. They require resources and input from multiple teams, and most certainly won’t be addressed before the end of the month.
I know that we were hopeful to identify an alternative to adjusting the query. Hopefully with the information above, you can see why we’re making the recommendations that we are.
Very much appreciate your patience and willingness to provide useful data, to understand the situation!
Please be well, and to keep us updated as you make adjustments to the query/app.
Thanks for the thorough update on this, it definitely all makes sense.
Unfortunately, we need those slow fields as part of Hacktoberfest itself, and can’t drop them
I will get to work on our end to start paginating to hopefully make the requests small enough that even with the slow fields we’re not seeing the errors anymore. Will report back on how that goes in a couple of days!
Thanks once again
Good morning! As promised, an update on how things are looking:
We rolled out a change last night just before our midnight batch run to reduce our query down to 25 PRs at a time, with logic to paginate up to 100 PRs for each user as needed.
To give some context, looking at our logs, the last two runs prior to this change observed a roughly 2.5% error rate of this timeout/bug message from the API.
After the change (and this, unfortunately, is slightly more approximate as we’re now having to guess how many calls we’re making based on how many PRs we see each user having), we are now seeing about a 0.15% error rate of this error, so significantly reduced!
I will share a dump of request IDs from that 0.15% in our Slack thread from the last two batch runs we did (both with the pages of 25 change) just in case they’re revealing anything different than what y’all have already found.
Also, just wanted to flag that we are seeing a few instances of using having to abort the request on our end after a 60s timeout (about 0.001%, so most definitely nothing major, but wanted to flag all the same).
Thanks for the help once again!