Feedback on Changes to Code Search Indexing

https://github.blog/changelog/2020-12-17-changes-to-code-search-indexing/ says:

Starting today, GitHub Code Search will only index repositories that have had recent activity within the last year. Recent activity for a repository means that it has had a commit or has shown up in a search result. If the repository does not have any activity for an entire year, the repository will be purged from the Code Search index. This change will enable the most relevant content for developers to surface in the code search index as well as keeping code search queries fast for all customers.

I’m disappointed to hear this, as it doesn’t match my common use-cases for GitHub code search.

I often use code search to find examples of how to use an obscure or undocumented feature of a library I am using. The solutions for these searches frequently come from old, abandoned repositories - if a library feature is being actively used by current projects it’s much more likely to have documentation which means I don’t need to turn to code search.

I also use it for my own code all the time - with the user:simonw prefix. I have a ton of repositories from old projects that I haven’t touched in years which I still like to refer back to for examples of how I solved old problems.

I recently built my own code search engine, https://github.com/simonw/datasette-ripgrep - which I can use to find examples and solutions in my older projects. I’m still disappointed that I won’t be able to use GitHub code search for that any more.

I understand running a high quality index on this scale is expensive and complicated. Any chance I might be able to keep search code indexing of my older repositories as part of my paid GitHub plan, or as part of having a paid organization on GitHub?

6 Likes

Oh hang on, I misunderstood the announcement:

Recent activity for a repository means that it has had a commit or has shown up in a search result

This is less bad, because it means that older repos that have useful content in them are much more likely to stay indexed provided they match at least one search per year.

Suggestion: update the post to make that detail a little less easy to miss!

1 Like

Disappointing if there is no option to avoid this.

The most common reason I use Github code search is when there is some properly obscure method that I want to see in action. That’s generally when I’m using legacy apple methods which are typically only in old code.

1 Like

This is a real downer. I use code search all the time to find obscure but very valuable code and information. There have been several times when scouring the internet for technical information turned up nothing and code search saved the day. My anecdotal experience is that the vast majority of relevant code search hits have been last edited more than a year ago. The majority of the code on GitHub lives in a repo that has last been edited more than a year ago.1

I understand that sometimes tough business decisions need to be made, and I can’t imagine any engineer at GitHub actually wants to make this change. I want to strongly object to the language used in the announcement. In particular:

This change will enable the most relevant content for developers to surface in the code search index…

I find this dishonest and offensive. It is obviously untrue. Your announcement would be vastly improved if you delete this.

…as well as keeping code search queries fast for all customers.

This at least has the virtue of being true, at least conditionally. Still, as a general statement the FOSS community I live in is allergic to weasel words. The bitter pill would be easier to swallow without this. Better yet, just be honest and forthright:

We do not have the resources to continue code search as it currently is, and we regret having to make this difficult decision.

That would have been so much better—refreshing, even. It is also true (I imagine).

You are programmers, so you already know the disadvantages this change brings. One that I feel is particularly relevant to GitHub’s identity is the impact it will have on the social value of GitHub. For me—and I am surely not alone—the most interesting and rewarding social interactions I have on GitHub come from discovering someone else’s project. Other offerings also have issues and pull requests and similar features. What GitHub has that others do not is the vast ocean of people and projects already on the platform. Reducing accessibility to that resource diminishes your value proposition relative to alternative platforms.

I am disappointed in this change, and I am sorry you had to make the decision.

Best wishes,

Robert


  1. A different metric is the number of repositories that are over a year old. That number was reported to be 2/3rds in 2018.
2 Likes

To me it seems perfect. As the “sort on indexed date” it is the only way to have recent results from code search it was annoying to see code from 10 years ago in the results, it didn’t make any sense.

The real improvement though would be to have the date of the document (date of the commit) in the search results.

If the repository does not have any activity for an entire year, the repository will be purged from the Code Search index.

This kills 100% of my use-cases for Github Search. Abandoned repos are excellent reference material for how to use long-term stable but poorly documented stuff, like X11 APIs.

Lack of commits doesn’t imply lack of research value. Sometimes it implies feature stability, or boredom/illness/death on part of the author.

3 Likes

This is horrible horrible news. I regularly use search to find valuable content and code that will now become inaccessible forever.

Unindexed content is dead content. A sad day indeed.

1 Like

Does anyone have any idea exactly how many repos or how much code this actually affects? Some statistics would be awesome! :bar_chart:

Maybe someone at GitHub could help answer this and aid in the community’s understanding of how far reaching these changes really are? :thinking:

1 Like

Bump particularly for a github employee’s response to abaines-phx

Edit ah, an old repo I was just search said “try again later [due to lack of index] or contact us if it’s still not working [link to community]” and then I found this thread (after waiting what I thought was a few minutes). I think search eventually worked after maybe O(10min)

We are not able to provide any hard numbers here.
We are still indexing content that remains demonstrably useful. However, indexing all content would not make for an efficient search function.

@ernest-phillips

It’s hard to take this announcement seriously when Github can’t (or cannot be bothered) to provide any sort of summary of data to support this decision. What constitutes demonstrably useful? And who decides that?

With regards to the implications of the announcement itself - this change is objectively un-good without at least further clarification of the true impact and some utility controls to adjust the indexing. There are many, many organizations whose businesses rely in some way on legacy code-bases (for better or worse!).

Sometimes nobody touches a repo or repos for several years - then suddenly a critical security vulnerability or bug must be fixed to satiate the demands of a paying customer - and, as others have stated, unilaterally shrinking the indexing window can cause headaches in cases like this.

As someone who uses github enterprise, I’d like to see (at a minimum) the ability to decide for myself what is indexed, for how long, and what our organizations tolerance for speed is. Yeesh…at least give us a way to bring an expired repo back into the indexing window without filing a support case…

Given the probable technical limitations of providing such controls with the current search implementation, perhaps it would be worthwhile instead, to engage with the hundreds of customers who have been pushing for fixes to github search for YEARS: https://github.com/isaacs/github/issues/908

Pointed comments yes, but this dictatorial approach to changing core functionality is ridiculous.

In my case i’ve a repo in which latest commit is just 1 month old and the web code search feature on GitHub doesn’t work at all.

I’d opened a new thread because i wasn’t aware of this thread since my thread was moved here by a Mod, thanks!