Several concurrent requests to the same domain

I want to fetch some text files from a repo but when I use several fetches the response takes too long to arrive on the client side. It takes about 1min 20secs which is forbiddingly long for a web app.

I am not sure but I suspect that happens because the browser limits the number of concurrent request to raw.githubusercontent.com

Does GitHub have multiple domains, is there an alternative domain I could send my request to please?

Thanks

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<script>

    var filenames = [
        {'url': "https://raw.githubusercontent.com/acycliq/testing/master/firstDataset/firstDataset_0.tsv"},
        {'url': "https://raw.githubusercontent.com/acycliq/testing/master/firstDataset/firstDataset_1.tsv"},
        {'url': "https://raw.githubusercontent.com/acycliq/testing/master/firstDataset/firstDataset_2.tsv"},
        {'url': "https://raw.githubusercontent.com/acycliq/testing/master/firstDataset/firstDataset_3.tsv"},

        {'url': "https://raw.githubusercontent.com/acycliq/testing/master/secondDataset/secondDataset_0.tsv"},
        {'url': "https://raw.githubusercontent.com/acycliq/testing/master/secondDataset/secondDataset_1.tsv"},
        {'url': "https://raw.githubusercontent.com/acycliq/testing/master/secondDataset/secondDataset_2.tsv"},
        {'url': "https://raw.githubusercontent.com/acycliq/testing/master/secondDataset/secondDataset_3.tsv"}
]


    var fetchData = (data) => {
        var filenames = data.map(d => d.url);
        return Promise.all(
            filenames.map(d => fetch(d))
        )
    };

    app = async function () {
        console.log('ping')
        var results = await fetchData(filenames);
        console.log(results.length)
        console.log('pong')
    }
    app()


</script>
</body>
</html>

:wave: @acycliq: There are a number of publicly available URLs that GitHub relies on as a part of it’s application’s functionality. raw.githubusercontent.com is one of those URLs, but we don’t rely on using that URL as illustrated in the code snippet for production cases because that service is subject to change at any time.

The GitHub API provides several documented and established resources that your application can rely on. I highly recommend using our Contents API to get repository content. Files and symlinks support a custom media type for retrieving the raw content or rendered HTML (when supported). This interface is great for fetching contents that are 1MB or less in size.

I mention the 1MB limit because I noticed something interest when I made this curl request to one of the specified resources:

curl https://api.github.com/repos/acycliq/testing/contents/firstDataset/firstDataset_0.tsv

At this time of writing, the API returns this response:

{
  "message": "This API returns blobs up to 1 MB in size. The requested blob is too large to fetch via the API, but you can use the Git Data API to request blobs up to 100 MB in size.",
  "errors": [
    {
      "resource": "Blob",
      "field": "data",
      "code": "too_large"
    }
  ],
  "documentation_url": "https://docs.github.com/rest/reference/repos#get-repository-content"
}

This particular file is 75.5MB––this exceeds the 1MB limit for the Contents API.

We have a Blobs API that exposes a Get a blob endpoint which supports blobs up to 100MB in size.

curl -v https://api.github.com/repos/acycliq/testing/git/blobs/97352f333a61d4ac8db3a24744f948c081a26624

Does this help with what you’re looking to do?

@francisfuzz: Thanks for looking into this. My main issue is that I fetch all the files in parallel. This is part of a data loader that feeds the data to a web application. Since most browsers allow up to 6 concurrent connections there is a significant delay until the moment I start streaming the data. In firefox for example, where you can increase the max number of connections by tweaking its config settings, streaming starts almost instantaneously. I was hoping that GitHub would have multiple domains, like raw2.githubusercontent.com etc so I could split my fetches to several domains.

@acycliq: Thanks for sharing that additional context. I can confirm here that there aren’t multiple domains; the Contents API would be your best bet to get the data you’re looking for from a repository. Alternatively, I think it may be worthwhile to checkout a database-as-a-service rather than using a GitHub repository for fetching those datasets.