What is geo-replication?
GitHub Enterprise 2.11 introduced geo-replication which is an enhancement of our high availability configuration. Whereas the latter consists of a primary and a warm standby replica, geo-replication allows adding active replicas (geo-replicas) to the setup. In a nutshell, they are able to serve read requests for source code, binary files and GitHub Pages. Geo-replication does not require changes on the client side and is completely transparent to users.
For clients accessing GitHub Enterprise from a distant region, geo-replication can increase the speed of specific read operations significantly. They include Git and GitHub Pages requests as well as LFS, avatar, release asset, and repository archive downloads. It does so by transparently serving those requests from a local geo-replica, meaning that developers don’t need to configure new URLs.
Faster response times increase user experience and developer happiness. But they also improve the performance of automated systems like CI servers as they need to fetch or clone repositories frequently.
Furthermore, geo-replication can improve the availability of GitHub Enterprise. Being a full replica, active replicas can be promoted to be the new primary in case of a massive disaster.
Due to the way geo-replication works, it may slow down Git write operations to GitHub Enterprise. For this reason, geo-replication is not generally suited for scaling GitHub Enterprise horizontally to improve its overall performance. Should you need to do so, we recommend vertical scaling of all nodes. But maybe you can avoid the need to scale at all? Often you can improve GitHub Enterprise’s performance can by implementing scalable CI patterns, or by enabling Git rate limits.
As a side note, we do support horizontal scaling via clustering for very big installations. However, clustering is not compatible with geo-replication. If you have any questions about scaling GitHub Enterprise, please contact Business Support.
How does geo-replication work?
Topology wise, think of GitHub Enterprise’s replication as a star network with the primary being the hub and replicas being the spokes. This structure is supposed to be transparent to users, which is why geo-replication requires a DNS service that is aware of the geographic location of each node. Amazon’s Route 53 service, for example, can resolve GitHub Enterprise’s hostname to the geo-replica that is closest to the user’s location.
Active replicas receive replication data from the primary just like every replica does. Additionally, they act as a point of presence for incoming requests. After SSL termination, they proxy web requests like GitHub UI and API requests as well as requests they can not handle locally, to the primary. However, they are able to serve read and write requests for Git and LFS files as well as read requests for avatars, release assets, zipped repository archives, and GitHub Pages.
Technology wise, GitHub Enterprise distributes non-Git assets asynchronously within the replication network. They can then be served transparently by the closest replica that has already received the assets. However, it relies on Spokes for Git replication, which comes to play in both, high availability and geo-replication configurations, alike. The diagram below gives an overview of a geo-replication configuration with one active replica:
Reading from Spokes
Spokes works on the application layer, and replication happens on a best-effort basis. It is considered synchronous when changes to the primary can be mirrored to the replica. An active replica can then serve Git read requests to clients in its vicinity.
However, forwarding data from the primary to a replica may not always succeed due to network disruptions or the replica becoming unavailable. In these cases, the primary eventually marks the replica offline for all traffic and then tries to reach it again. Once it succeeds, it marks the replica online again, which in turn can then continue to receive replication data and serve as a point of presence. Spokes continuously checks replicated data for consistency and repairs missing or corrupt data asynchronously. During such repairs, an active replica proxies requests it can not serve itself to the primary.
Writing to Spokes
Spokes forwards Git write operations to every node that is currently online. The primary then accepts or rejects the operation by waiting for a success message from every replica. As a consequence, a Git push to a geo-replication setup cannot be faster than the time it takes for the slowest replica to acknowledge reception.
Of course, the network connection to a replica may fail or the replica may fail to complete a write operation. In these cases the primary either runs into a network timeout or receives a failure message from the replica. Luckily, a Git push is required to succeed on the primary only, for the primary to accept it overall.
The following are recommendations accrued from interactions of our Business Support team with customers implementing geo-replication.
Network performance is important
Of all network ports you may need to open, we’d like to emphasize that GitHub Enterprise replication requires all nodes to be able to communicate with each other in two ways:
- SSH on port 122 for administrative shell access.
- UDP on port 1194 for replicating data via TLS secured connections inside a tunneled private network.
In corporate VPNs, GitHub Enterprise may need to share limited network capabilities with other applications. However, Git is quite sensitive to latency. Since Spokes uses Git for replication, this also applies for the network connecting nodes in a geo-replication setup. Thus, we recommend to keep the following network properties in mind:
High reliability : Failed network connections may lead to replicas not being able to serve requests until asynchronous repairs succeed.
Low latency : The lower the latency, the faster data is replicated and write operations are accepted.
High data transfer rate : The higher the data transfer rate, the faster high write loads and big uploads are handled.
For reference, the network latency of a connection form the United Kingdom to New Zealand is around 300 ms. The maximum latency between the primary and a geo-replica should not be significantly higher. Furthermore, please make sure the primary resides in the geographical middle of your geo-replication setup. This helps with averaging the distances from the primary to each geo-replica and avoids an above-average latency to a single node.
The availability of a geo-replication setup is equivalent to the availability of the primary GitHub Enterprise instance. If the primary is not available, active replicas won’t work anymore. On the other hand, if all replicas go down but the primary is available, it can still serve all requests.
Because of this, it makes sense to always include a high availability replica in your setup, just as you would without geo-replication. Such a warm standby, non-activated, replica should reside close to the primary to keep network latency low and replication in sync. In case of a disaster, the failover process is the same as in a high availability configuration, too.
However, in case of a complete datacenter loss in one region, geo-replication offers the advantage of failing over to an active replica in a different geographic area. This may be of special interest if the primary and the high availability replica reside inside the same datacenter. After failover the new primary operates as a standalone instance and previous replicas can be re-added.
Based on a high availability configuration, we recommend adding a second replica in its target region, according to step 2 of this guide and leave it as inactive. You should then configure the new replica to have its own datacenter (step 3). That is any value other than
default if you haven’t configured any datacenter yet. They’re really about geographical regions. For example, the primary and its high availability replica should share a datacenter, whereas every geo-replica should have its own.
After issuing the
ghe-repl-start command, it may take a long time until the new replica is in sync with the primary. The duration mainly depends on the amount of data that needs to be transferred and the network performance. To shorten this time, you may want to restore a backup of the primary to the new replica. Doing so cuts down replication to changes since the backup was taken.
The Replication overview in the management console shows the average ping times from the primary to every replica. We also recommend measuring the data transfer rate to the new replica. Once you’re content with the network performance, you may activate the new geo-replica (steps 4 & 5). It is now able to serve local clients and you may add another geo-replica, if desired. The following screenshot shows the Replication overview of an example geo-replication setup with one high-availability replica and two geo-replicas.
Geo-replication helps to increase reading speeds in distant regions. This improves build times as well as coding fun. Watch out for network performance to the geo-replicas, though; it’s crucial. And, nope: geo-replication is usually not an appropriate mitigation for heavy polling.
Do you have questions regarding geo-replication? Reach out to Business Support!
What are your experiences with geo-replication? Feel free to comment below!