Handling lost or missed webhook event deliveries

How do you guys deal with webhook event deliveries that where missed or lost (whatever the cause)?

Having written a Bot for our specific needs, I now understand why even CircleCI sometimes doesn’t notice a push on a PR and doesn’t create a job. Events get lost. Github does not attempt to resend events at all. (They do mention it on the marketplace docs).

A full blown five-nines HA solution is too expensive for us.

How does Github handle round robin DNS? Does it retry the other servers if one does not respond? That would save having to have a HA load balancer / proxy with a floating IP setup, which is the most expensive part in the end for the minimal amount of compute we need to do.

Open to suggestions here. The only alternative I see is implementing a poll based alternative that regularly checks for things that were missed, but that’s a lot of coding and a lot of API calls we’d need to do.

1 Like

If you want to be 100% certain to catch every event, then you’ll have to set up a polling-based solution. If you can share some information on what you’re designing, perhaps we could give some other suggestions.

Let us know.

1 Like

The project is https://github.com/bioconda - we have a largish repo with recipes from which we build a life-sciences software distribution. The App handles a bunch of things. The part I’m having trouble with is the linting. It keeps missing PR open events and PR sync events for some reason. Hosting on AWS helped, but I’m still missing PRs. We’ve seen the same with CircleCI actually. In particular if we have hackathons or similar “sprints” in which lots of pushes happen, events get lost.

Do you know whether the webhook delivery will honor round-robin DNS? If so, we could just have two front-ends accepting events and stuffing things into the queue to increase availability.

Mostly, we need to be certain the linter gets to run so that we can require the check_suite to be complete for merging.

Yes, webhook delivery honors round-robin DNS. The advantage to round-robin DNS is that it is the DNS system that handles the round-robin process. The application layer isn’t aware of the work going on under the hood.

I work on a few largish repositories that are similar in size to https://github.com/bioconda/bioconda-recipes and none of them have the problems you seem to be describing with regularly missing PR open and sync events. Are you using a framework that allows for handling multiple simultaneous connections without blocking? For example, one of my applications today received these two requests within 30 milliseconds of each other:

As you can see, each of them took over a full second to process, but both of them succeeded and returned a 200 response with nothing dropped.

If you can look at the webhook history and confirm that GitHub is simply not sending the webhook events on a consistent and regular basis, please contact private support at https://github.com/contact so that we can take a deeper look at what might be going on.


@lee-dohm wrote:

Yes, webhook delivery honors round-robin DNS. The advantage to round-robin DNS is that it is the DNS system that handles the round-robin process. The application layer isn’t aware of the work going on under the hood.

I guess my question wasn’t precise enough. Some background:

Round-robin DNS with multiple A records for a domain name was historically meant for load balancing. The DNS server hands out one of the IPs at random. These days, there is application layer support that, to help with availability, tries all IPs if one does not respond. Chrome e.g. will do this. If one server is down, it will try the next (similar to what multiple MX records have been doing for mail all along). This isn’t handled by the network layer though, or “transparent”. It’s typically an application layer feature provided by the HTTP library (so that it can e.g. react to a 50x error and just try the next IP).

If you do

host microsoft.com

you should see 5 IPs. Chrome will try them all in turn if the first fails to respond. (I think most browsers do)

ping microsoft.com

on the other hand will only try the first IP, and will try the same IP with every further invocation.

So my question was really: Does the code underlying Github’s webhook delivery backend try alternate IPs if one is not responding (like chrome) or does it just try one (like ping)?

I doubt we have a load problem. On the scale of things, we are very very very tiny. It’s an availability problem. The bot uses aiohttp and has multiple threads (IIRC) so it should have no issue with a lot of concurrent requests. But it may block due to programmer error, or due to it simply rebooting. If the webhook delivery side foes try multiple IPs, I can simply sidestep the whole “pull missed events after restart” issue by having n>1 instances and be done with it.

It’s good to hear that this isn’t common, though. Means I need to look to our app to see why.

The webhook delivery system only attempts to deliver each event once.

Let us know what you find and if there are any other questions we can answer.

epruesse, I’m with you… I would love it if Enterprise Cloud had a value-add, “retry webhook deliveries 3 times”.

For some of the larger scale projects I work on, we also experience missed events and have had to occassionally write jobs to go hit the events API or check the data to be eventually consistent, which is no fun.

As a compromise we have had some good luck using Azure Logic Apps, similar to AWS lambda etc., where we are able to have our hook events delivered to Logic Apps, which are much more available than our systems were, and then we can configure retry policies or even dead letter queueing those events on our end for later processing. It’s not free but was less expensive than other alternatives.

We also miss events due to the limitations of webhooks, such as very large payload events (large commits) that are never sent, as documented.