Understanding your graphs part 1 - System Health
A GitHub Enterprise virtual appliance consists of individual services, configured to run on a customized Linux operating system. Monitoring the system resources such as CPU and Memory (RAM), along with GitHub application and system service metrics can help GitHub Enterprise administrators to identify performance bottlenecks, or unusual activity trends.
The GitHub Enterprise Management Console includes a Monitor dashboard located at
http(s)://[hostname]/setup/monitor. This dashboard displays graphs created with data gathered by the built in
collectd service. Data used in the graphs is sampled every 10 seconds.
Each graph has an informational tooltip describing the graph, which is accessible by hovering over or clicking on the i in the upper left corner of each graph.
Graph data can also be forwarded to an external receiver, by enabling Collectd forwarding within the GitHub Enterprise Management Console. This allows building customized dashboards and alerts for your GitHub Enterprise graph data.
This article series will explore what each of the dashboard sections cover and what specific graph trends to watch out for. As each GitHub Enterprise system is unique in user patterns and integrations, we encourage administrators to reach out to the GitHub Enterprise support team to assist with interpreting your specific instance’s monitor graphs if questions arise. The graph data is included within appliance Support Bundles which can be shared with our support team.
The system health graphs provide a general overview of services and system resource utilization. CPU, Memory, and Load Average graphs are useful for identifying trends or times where provisioned resource saturation has occurred.
- Abnormally high CPU utilization, or prolonged spikes can mean your instance is under-provisioned.
- In the above example, the CPU was nearly 100% consumed by
userfor a period of time.
- Presence of CPU “steal” time on the CPU graph can be an indication that other virtual machines running on the same host system are saturating the underlying resources, causing the GitHub Enterprise system to wait for CPU cycles.
- User and System are generally the largest consumers of CPU time.
- The Linux Kernel provides a layer of in memory disk caching, which is represented by the “cache” on the graph. It is perfectly normal, and recommended to have at least a few GB of cache overhead. The system will attempt to cache as much as possible, but applications can take this memory on demand. Because of this, we consider the total amount of available memory to be the sum of “cached” and “free” values.
- Running out of available free + cached memory can lead to out of memory (OOM) events, causing services to terminate and unexpected application behavior.
- System Load Average is a measurement showing the running task demand on the system.
We recommend monitoring the fifteen minute
longtermsystem load average for values nearing or exceeding the number of CPU cores allocated to the virtual machine.
- When the load average rises above the number of CPU cores, it generally means that tasks are needing to wait for resources before they can run.
- Assuming the above example graph is a GitHub Enterprise system with 2 CPU cores, we can determine that processes are often waiting for resources.
- By clicking on
runningin the legend at the bottom of the graph, we can isolate different process states. In the above example we have selected
- The running process count will fluctuate with system activity. Sharp changes or drops could be expected depending on usage trends.
- Large or consistent numbers of
zombieprocesses may indicate a service problem.
- It is expected to have processes in the
sleepingstate during normal operation.
- This graph represents the
maxnumber of open files, as well as the current number of
- On a healthy system, the number of
usedfiles should never reach the
maxvalue. Reaching the
maxcan indicate problems with a GitHub Enterprise service.
- Limiting maximum open files is a protection built into Linux to prevent runaway processes from impacting other services on the system.
fork_ratetrend greatly depends on system activity, and will reach values upwards of 1000-2000 on busy systems.
- Large spikes beyond the observed averages should be investigated.
Continue the conversation
There’s more to come in the “Understanding your graphs” mini-series. If you’d like to follow along, just subscribe to the “Understanding your graphs” label (link below). Please let us know if you have any questions in the comments.