Help
cancel
Showing results for 
Search instead for 
Did you mean: 

Analyzing Git traffic using Governor

GitHub Staff

Introduction

 

GitHub Enterprise Server has an internal monitor and concurrency controller for Git processes called Governor, which keeps count of Git operations. A command-line utility to query Governor data (ghe-governor) was made available with GitHub Enterprise Server 2.11. Governor data files, located under /data/user/gitmon/, hold one hour of data per file and are retained for two weeks. The files contain timestamps in their names which you can use to confirm the time period they cover. Here is an example:

$ sudo sh -c "ls -l /data/user/gitmon/gitmon.*.db" | (head -1 && tail -1) | awk '{print $NF}' | grep -o "[0-9]*" | while read; do echo -n "$REPLY = ";date -d @${REPLY}; done
1551186000 = Tue Feb 26 13:00:00 UTC 2019
1552392000 = Tue Mar 12 12:00:00 UTC 2019

 

Usage

 

First, let's have a look at Governor's syntax. We will focus on common examples and queries later in this article.

 

Individual (Top) queries

 

Governor can find the top N records of Git queries for a given metric (column). The resulting table will be sorted by that column.

ghe-governor top <column> [options]

 

The column can be any of rt, cpu, disk_read, disk_write, disk, uploaded, received, net, rss, or cpu_busy.

 

Aggregate queries

 

Governor can find the top N groups of Git queries for a given grouping function and a given metric (column).

ghe-governor aggregate <grouping-function> <column> [options]

 

The grouping function can be any of hostname, program, repo, git_dir, via, ip, user_id, result_code, cloning, die_message, or die_message_raw.

 

The column can be any of count, rt, max_rt, avg_rt, avg_parallelism, max_parallelism, cpu, avg_cpu, disk, disk_read_bytes, disk_read_kb, avg_read_bytes, disk_write_bytes, disk_write_kb, avg_write_bytes, uploaded_bytes, uploaded_kb, received_bytes, received_kb, net, avg_uploaded, cpu_busy, or users.

 

Please see below for an explanation of some of the resulting table columns:

  • RT means response time, so AVG RT is the average time in seconds that Git invocations took, and MAX RT is the running time in seconds of the longest-running invocation, per host.
  • PL is parallelism, or how many Git invocations are running at any time. So MAXPL and AVGPL are the maximum and average, respectively.
  • CPU/SEC is how many seconds of CPU time are used only by Git per second of wall-clock time. This is the number of CPUs dedicated to Git, averaged over the entire duration of the query. You can divide the value by the actual number of CPU cores to get Git specific CPU percentage utilization. Unlike Unix system load, this number cannot exceed the actual number of CPU cores.
  • UPL is data that GitHub Enterprise Server uploaded -- i.e., client fetches and clones.
  • RCV is data that GitHub Enterprise Server received -- i.e., client pushes.
  • The READ, WRITE, UPL, and RCV columns are all in GB, but the rate is in MB/s.

 

Options for all queries

 

Every query type can be limited in scope in the following ways:

  • -j = set output format to JSON instead of an ASCII table
  • -n<N> = limit the output size to N (default: 20)
  • -t <timespec> = only consider Git invocations since a given start time (default: 48 hours ago). You may want to use a tool such as https://www.epochconverter.com/ to convert UTC to Unix Epoch for finely-grained queries.
    • -t 1371614483 = Invocations since a given Unix timestamp (seconds since 1970)
    • -t 1371614483637 = Invocations since a given Java timestamp (milliseconds since 1970)
    • -t-1d = Invocations in the last day
    • -t-2h = Invocations in the last two hours
    • -t-20m = Invocations in the last twenty minutes
  • -u <timespec> = consider Git invocations up to a given end time (default: now)
  • -r <owner>/<repository> = consider only queries that match a given owner (user or organization) and repository. You can specify this option multiple times (logical OR).
  • -o <owner> = consider only queries that match a given owner (user or organization). You can specify this option multiple times (logical OR).
  • -V <protocol> = consider only queries arriving via a specific protocol (e.g. shell, git, blob edit, gitrpc, ssh, initial commit, web branch create, pull request branch delete button, or pull request merge button). You can specify this option multiple times (logical OR).
  • -P <program> = consider only queries that ran a given Git subprogram (e.g. rev-list, diff-tree, dgit-helper, show-ref, merge-base, log, diff, blame-tree, diff-pairs, upload-pack, shortlog, rev-parse, pack-objects, pack-refs, repack, cat-file, upload-file, ahead-behind, dgit-state, or for-each-ref). You can specify this option multiple times (logical OR).
  • -I <address> = consider only queries from a specific IP address. -I "" means local operations and is equivalent to -V shell. You can specify this option multiple times (logical OR).

 

The following are long options for aggregate queries:

  • --count-only = only show the KEY and COUNT columns
  • --distinct-users = also show the #USERS column

 

Example queries

 

Now, that we know Governor's syntax, let's have a look at typical usage scenarios and example queries.

 

Analyzing Git traffic

 

The overall summary provides the total and average number of Git requests over a recorded period:

ghe-governor-summary

 

The following set of sample commands may help to identify Git traffic patterns or spikes in activity. They make use of the count metric, which is a good reference point to know what is being requested the most.

  • Count of all Git operations by repository:
    ghe-governor aggregate repo count
    
  • Count of Git clones and fetches by repository:
    ghe-governor aggregate repo count -P upload-pack
    
  • Count of Git pushes by repository:
    ghe-governor aggregate repo count -P receive-pack
    
  • Number of users pushing code to an organization:
    ghe-governor aggregate user_id count -o <organization> -P upload-pack --distinct-users
    

 

To dive a bit deeper, the following queries indicate the actual volume of Git traffic:

  • Average amount of cloned and fetched data by user id, IP address or repository:
    ghe-governor aggregate user_id avg_uploaded
    ghe-governor aggregate ip avg_uploaded
    ghe-governor aggregate repo avg_uploaded
    
  • Peak amount of cloned and fetched data by user id, IP address or repository:
    ghe-governor aggregate user_id uploaded_kb
    ghe-governor aggregate ip uploaded_kb
    ghe-governor aggregate repo uploaded_kb
    
  • The corresponding top query may be of interest, too (not grouped):
    ghe-governor top uploaded
    

 

Furthermore, you might be interested in bursts of concurrent clones. A thundering herd of clones can cause a spike in resource usage. You can check for concurrent clones by aggregating on max_parallelism (result table column MAXPL):

ghe-governor aggregate repo max_parallelism -P upload-pack

 

CPU Profiling

 

The above metrics are only so useful in performance profiling. But Governor also collects CPU timing data, which is helpful in diagnosing high CPU utilization caused by Git operations.

  • Top repositories by CPU time:

    ghe-governor aggregate repo cpu
    
  • Top programs by CPU time for a single repository:

    Using the repository -r flag, you can see the CPU breakdown for individual repositories as well. This time we're interested in the program that used the most CPU time:

    ghe-governor aggregate program cpu -r <organization>/<repository>
    
  • Top IP addresses by CPU time for a single repository:

    Grouping by IP address and CPU time can help to identify continuous integration systems or users that are causing a performance hit:

    ghe-governor aggregate ip cpu -r <organization>/<repository>
    
  • General Governor records with the most CPU time (not grouped):

    ghe-governor top cpu
    

 

Disk usage

 

Sometimes, you want to find out which repository or program caused a specific disk write peak that you've seen. The following commands may be of help here.

  • Top repositories by disk write volume in a specific time interval:
    ghe-governor aggregate repo disk_write_kb -t <timespec> -u <timespec>
    
  • Top programs by disk write volume in a specific time interval:
    ghe-governor aggregate program disk_write_kb -t <timespec> -u <timespec>
    
  • General Governor records with the highest volumes of disk writes (not grouped):
    ghe-governor top disk_write
    

 

TL;DR

 

Governor ships with GitHub Enterprise Server and is able to provide insights about how your developers use Git and which implications their behavior may have on your GitHub Enterprise Server instance. In Enterprise Support, we regularly rely on Governor to help us answer all kinds of questions related to Git usage. Now, you can do the same.

What are your experiences with Governor? Feel free to comment below!

1 Comment
Copilot Lvl 2

i thought he was going be in large skelled