Best-practices for adding advanced repository metadata?

Before you know it your GitHub organization is a mess, filled with thousands of repositories, with different naming conventions, related to different departments and applications, filled with different programming languages or even just config-files and being used in different levels in the organization. And from the people having access, who is primarily responsible? I’m looking for ways to bring some order in the chaos.

What would be good ways to add relevant metadata to the repositories, so it can be indexed using a bot of some sorts? Let me share you some solutions I encountered and why I feel they are not the right solution:
A lot can be added in a README, with the benefit of showing up on first glance. But it won’t be so structured, making it difficult to parse.

GitHub Topics
GitHub topics allow you to set some labels on the repository. Great to make them more discoverable, but not structured and it gets crowded in the UI after a couple of topics.

The good old MAINTAINERS file. It will help clarify who is responsible, but doesn’t allow for much more. Also it is free-format, but that can be checked upon of course.

Sort of the newer and more advanced version of a MAINTAINERS file. But it also includes information on scope within the repository, making it more difficult to parse. And entire groups can be references, making it mare difficult to pinpoint an exact person. Like the MAINTAINERS file the scope is limited to the maintainers and it is not fit for other information.

This GitHub - publiccodeyml/publiccode.yml: a metadata standard for public software definition seems a good fit. Although it is intended for open source projects, it would also work in an internal organization as well. It is structured so good to parse. But as there might be different needs, it might be necessary to fork it into a custom solution. And still it requires tooling for parsing these files.

So, my question, how are others taking care of this? What tooling is commonly used?

The project has defined a format for reporting information about code projects. This project is managed from GSA/code-gov repository. More specifically the data standard is defined in GSA/code-gov-data repository with JSON schemas defined in the schemas folder.
Edit: after a discussion on the TTS Slack I found out that at the moment the code.json standard is sufficient and there are no resources available to accept improvement proposals. If you desire changes you’d have to fork the project. Of course this might change in the future.

The AWS CodeCommit Crawler for InnerSource Portal expects an innersource.json file to be present. As the name implies, that does seem to fit the use-case I was aiming for. An example innersource.json file is provided in the repo.

The Backstage project looks for catalog-info.yaml files to be present. More details on the format can be found on the Descriptor Format of Catalog Entities page. Looking through GitHub this format is already commonly used (GitHub Search for catalog-info.yaml files) It supports a variety of kinds, including more organizational types like groups or service-level definitions like APIs.