Help
cancel
Showing results for 
Search instead for 
Did you mean: 

Understanding your graphs part 1 - System Health

GitHub Staff

Understanding your graphs part 1 - System Health

 

A GitHub Enterprise virtual appliance consists of individual services, configured to run on a customized Linux operating system. Monitoring the system resources such as CPU and Memory (RAM), along with GitHub application and system service metrics can help GitHub Enterprise administrators to identify performance bottlenecks, or unusual activity trends.

 

The GitHub Enterprise Management Console includes a Monitor dashboard located at http(s)://[hostname]/setup/monitor. This dashboard displays graphs created with data gathered by the built in collectd service. Data used in the graphs is sampled every 10 seconds.

Each graph has an informational tooltip describing the graph, which is accessible by hovering over or clicking on the i in the upper left corner of each graph.

 

Graph data can also be forwarded to an external receiver, by enabling Collectd forwarding within the GitHub Enterprise Management Console. This allows building customized dashboards and alerts for your GitHub Enterprise graph data.

 

This article series will explore what each of the dashboard sections cover and what specific graph trends to watch out for. As each GitHub Enterprise system is unique in user patterns and integrations, we encourage administrators to reach out to the GitHub Enterprise support team to assist with interpreting your specific instance's monitor graphs if questions arise. The graph data is included within appliance Support Bundles which can be shared with our support team.

 

System Health

 

The system health graphs provide a general overview of services and system resource utilization. CPU, Memory, and Load Average graphs are useful for identifying trends or times where provisioned resource saturation has occurred.

 

CPU

 

graph-cpu

 

  • Abnormally high CPU utilization, or prolonged spikes can mean your instance is under-provisioned.
  • In the above example, the CPU was nearly 100% consumed by user for a period of time.
  • Presence of CPU "steal" time on the CPU graph can be an indication that other virtual machines running on the same host system are saturating the underlying resources, causing the GitHub Enterprise system to wait for CPU cycles.
  • User and System are generally the largest consumers of CPU time.

 

Memory

 

graph-memory

 

  • The Linux Kernel provides a layer of in memory disk caching, which is represented by the "cache" on the graph. It is perfectly normal, and recommended to have at least a few GB of cache overhead. The system will attempt to cache as much as possible, but applications can take this memory on demand. Because of this, we consider the total amount of available memory to be the sum of "cached" and "free" values.
  • Running out of available free + cached memory can lead to out of memory (OOM) events, causing services to terminate and unexpected application behavior.

 

Load

 

graph-load

 

  • System Load Average is a measurement showing the running task demand on the system.
  • We recommend monitoring the fifteen minute longterm system load average for values nearing or exceeding the number of CPU cores allocated to the virtual machine.
  • When the load average rises above the number of CPU cores, it generally means that tasks are needing to wait for resources before they can run.
  • Assuming the above example graph is a GitHub Enterprise system with 2 CPU cores, we can determine that processes are often waiting for resources.

 

Processes

 

graph-proc-run

 

  • By clicking on running in the legend at the bottom of the graph, we can isolate different process states. In the above example we have selected running processes.
  • The running process count will fluctuate with system activity. Sharp changes or drops could be expected depending on usage trends.
  • Large or consistent numbers of blocked or zombie processes may indicate a service problem.
  • It is expected to have processes in the sleeping state during normal operation.

 

Files

 

graph-files

 

  • This graph represents the max number of open files, as well as the current number of used open files.
  • On a healthy system, the number of used files should never reach the max value. Reaching the max can indicate problems with a GitHub Enterprise service.
  • Limiting maximum open files is a protection built into Linux to prevent runaway processes from impacting other services on the system.

 

Forks

 

graph-forks

 

  • The fork_rate trend greatly depends on system activity, and will reach values upwards of 1000-2000 on busy systems.
  • Large spikes beyond the observed averages should be investigated.

 

Continue the conversation

 

There's more to come in the "Understanding your graphs" mini-series. If you'd like to follow along, just subscribe to the "Understanding your graphs" label (link below). Please let us know if you have any questions in the comments.