What:The State of the Octoverse explores a year of change with new deep dives into developer productivity, security, and how we build communities on GitHub, during this AMA Senior Data Scientist Derek Jedamski @DJedamski will answer questions and share insights about the report from build to execution.
Where: The AMA will take place in this event topic, the topic will be open 10 minutes before start time.
Derek Jedamski is a Senior Data Scientist on the Data Science team at GitHub. He was one of the primary contributors to the analysis behind the State of the Octoverse report in 2020. The Data Science team at GitHub supports a broad array of product areas, Derek is currently primarily supporting the Security Product space for all of their data needs. Prior to GitHub, he worked at various tech and fintech companies primarily focused on machine learning.
Hey everybody! I’m Derek - Senior Data Scientist at GitHub. I’ve been at GitHub for almost four years now and I was one of the primary contributors to the analysis for the State of the Octoverse report in 2020.
Looking forward to spending the next hour with you, feel free to ask me anything about Octoverse, GitHub, or whatever else comes to mind!
is your map wrong or did you count the European part of Russia as Asian stats?
Hi @rimutaka - you are right, good catch! We did count Russia entirely under the stats for Asia. Without leveraging granular location data for users, we can’t appropriately assign usage patterns to the European part of Russia vs the Asian part of Russia. With that said - as you appropriately call out, in the future perhaps we should consider allocating these usage patterns to where the majority of the population resides (instead of based on majority by land mass). Thanks for the feedback!
I am working on a side project studying all sorts of developer interaction via GitHub. So far I downloaded 6,392,268 repos, which is miniscule amount. Is there a better way of accessing the code other than downloading it?
You keep your data on AWS S3, I presume. Is it possible to get direct read access, requestor pays, to it?
That’s a great question. Unfortunately, I do not have a very satisfying answer for you but I can tell you it is something we are thinking a lot about. We take the privacy of our users’ data very seriously but at the same time, we recognize the value of some of this aggregated data to the public. The challenge is in finding the right line there and it is something we are thinking a lot about!
Having direct EC2 <-> S3 would allow others to analyse all sorts of aspects of the community and the code without incurring the cost of storing it and also improve the latency. EC2 to S3 is lightning fast compare to going via an API.
That’s a lot of repos! I would be interested to hear what specifically you are studying in regards to these developer interactions.
Sadly, there is not a better path to accessing this data at this point though I will say, we certainly feel your pain.
I suppose I will just reiterate that we are thinking about a lot about these exact pain points that you are mentioning. We are thinking through the possible solutions that would allow access to some of this fully secure/anonymized data in a way that would be useful to users that want to analyze this data, such as yourself.
So sadly, the answer is: there is not a better way right now but we feel your pain and we are thinking through how we can address this.