Back for another Ecosystem related update. I know back in my first update I mentioned that folks using various notebooks would have something to look forward to. But unfortunatly, I can’t share publicly yet! Though good things are coming, still.
Because this space is so heavily influenced by our users, I’m happy to elevate a user in our Ecosystem space that is deserving of kudos. With that, please join me in celebrating:
Hugh Byrne !!
Hugh has been incredibly active in this corner of our forums, and I’m very grateful to your activity here.
Thank you Hugh!
There wasn’t a lot in this Universe that directly relates to Ecosystem, but ICYMI:
…has a fantastic breakdown of new things to be excited about, post-Universe!
Are you hoping to leverage GitHub data for a project/application?
Unfortunately, we don’t have datasets available. We can however, offer some advice on collecting data you could leverage.
Please use the GitHub API for collecting the data you need and don’t scrape the website: About GitHub's APIs - GitHub Docs. The GitHub API was built and optimized for programmatic data, while the website was optimized for humans.
If you intend to collect lots of Git data from a single GitHub repository, such as commits and file contents, please clone the repository locally instead of using the GitHub API. You can then get all the data you need from the local clone, e.g. using a Git library like libgit https://libgit2.org. That will be much faster than using the GitHub API and you wouldn’t need to worry about API rate limits. You’re allowed to clone as many repositories as you want, but please do so at a reasonable pace.
Use data from archives like:
- http://ghtorrent.org/, and
- Google Cloud Blog - News, Features and Announcements.
They might not have all the data you need, but they might have at least some data you need. And that would reduce the number of GitHub API requests you need to make.
Focus your initial data collection on a subset of all projects (e.g. those with at least a few stars or followers, or those which are not forks). That would allow you to collect data for a relevant subset sooner to start your analysis, and you can collect the rest later.
Finally, please make sure you read the terms of service for the API: GitHub Terms of Service - GitHub Docs
That’s all for now!
Be well, friends