Fundamental Issues with Maven Packages (and how to address them)

The Maven support in GitHub Packages is rather odd in a number of ways. In a lot of ways, it feels like an NPM hosting product which was extended to allow Maven metadata. Some of those ways are simply annoying and/or confusing, but others are more fundamentally problematic. Note that this post isn’t intended to be excoriating (I really like GitHub Packages in concept, and I’m using it across numerous projects public and private), merely productively putting forward possible solutions to the problems from the perspective of an end-user.

For context, I maintain numerous major open source projects in the Scala ecosystem, and I manage a company which has a sprawling polyrepo structure of public and private projects published using Maven. I’ve been building infrastructure tooling for over a decade, and I currently maintain the sbt-github-packages plugin (which provides GitHub Packages publication and resolution support for Scala’s Sbt build tool).

Resolvers

The biggest problem with Maven Packages today, by far, is the fact that it is effectively designed around the “one package per resolver” concept. Resolvers are specific to repositories, and a repository is generally going to correspond to a small number of packages at most. This is severely annoying for any polyrepo corporate setup (such as at my company), and it’s utterly disqualifying for any open source usage.

The problem is that, in order for users to depend upon packages published to GitHub Packages, they need to add a new resolver (e.g. https://mvn.pkg.github.com/djspiewak/shims) for every single repository that published something on which they depend. These explicit resolvers are required for both direct and transitive dependencies.

The problems here are two-fold. First, from a user standpoint, having to manually add resolvers for every single direct and transitive dependency amounts to the same thing as removing transitive dependency functionality entirely, since all effective dependencies must be explicitly declared as resolvers (while direct dependencies are also declared as artifacts, redundantly). This is very, very annoying and very brittle. It’s not too much of a problem right now, since almost everyone published to Maven Central, but if GitHub Packages were to become widespread within the OSS community, builds would become all-but unmaintainable, given the number of effective dependencies common in most builds (usually in the dozens-to-hundreds of artifacts).

Second, from a tooling standpoint, resolvers are incredibly slow. Every resolver necessitates a server round-trip to fetch metadata and validate (or invalidate) the existence of a particular artifact, and the list of resolvers must be traversed linearly, and then retraversed for every artifact being resolved! There isn’t a lot that can be done to optimize this without introducing severe cache consistency issues, particularly on repositories with frequent publication, which means that any build which has a nontrivial number of dependencies published to GitHub Packages will have pathologically long build times.

To be clear, Bintray (the other major alterantive to Maven Central for OSS) already has both of these problems to a degree, but conventionally, most users only have a single repository to which they publish all of their packages. This is in contrast to the GitHub Packages approach, where individual users may have dozens or even hundreds of repositories, each requiring their own resolver in downstream builds. Additionally, Bintray hasn’t really become ubiquitous in the Maven ecosystem, partially for these reasons, which ameliorates the problem.

If GitHub Packages were to become commonplace, the Maven ecosystem would, quite literally, collapse.

I can think of a two possible solutions to these problems while still preserving GitHub’s meta-distributed nature:

  • Package repositories remain tied to the repository, but there is an organization/user-global mirror package address which aggregates all of the packages from all of the repositories. We can already publish to https://mvn.pkg.github.com/user (without any repository); there is basically no reason why we shouldn’t also be able to resolve from it. This would reduce the “one resolver per repository” problem to “one resolver per user”, which is not great, but at least no worse than Bintray is today. This also entirely solves the problem for corporate accounts (such as my company).
    • My company currently approximates this behavior by having a pair of special repositories which contain no code: one for public packages and one for private. This is clunky and annoying to say the least, and also violates the stated design goal of GitHub Packages of keeping artifacts and code together.
    • As a note, Packages already includes an organization-global check to ensure that packages are only ever published to a single repository per organization. So in other words, you already have the metadata necessary to do an organization-wide aggregating mirror; you should use it.
  • Extend the maven POM specification with resolver information in published artifacts. This would require coordination with major tool authors: at a minimum, Maven, Ivy, Sbt, Gradle, and Lein. A published artifact could optionally include the set of resolvers with which it was built. During transitive resolution, tooling could take these resolvers into account as low priority repositories for anything at or below that level in the graph. This would be somewhat complicated to implement for most tools, but not impossible, and it would allow intended resolver information to be effectively paired with artifactId information (which is already published), while still allowing higher topological levels of the graph (such as the user’s own build) to override, facilitating forking and such. Overall I think this is a very very good approach for the ecosystem and would ultimately liberate Maven as a whole from its effective strong centralization (since, as a practical matter today, having numerous resolvers is untenable). This doesn’t solve the performance problems associated with zillions of resolvers, but if resolvers are conventionally at the user/org level rather than the repository level, the performance issues should be relatively confined.

I strongly believe that both of these should be pursued, though obviously only the first one is implementable solely by GitHub; the latter requires coordination with the tooling ecosystem. If GitHub is serious about making GitHub Packages a meaningful option for the Maven ecosystem, public or private, these truly are necessary changes.

Release Atomicity

This is another big one. Even polyrepo-style organizations often have several artifacts contained within a single repository. Usually this number is not particularly high (usually single digit order of magnitude), but still more than just one. This, btw, completely eliminates the apparently intended “convenience” of publishing to the organization-level address (https://mvn.pkg.github.com/owner), since artifactIds will practically never correspond with repository names.

The problem here is a lack of staging support within the publication system. Not only does each repository correspond to multiple artifacts, but each artifact corresponds to multiple files! It is not at all unusual for a build tool to publish artifacts as they become available during the build process, even when the entire build has yet to fully complete. This means it is not at all unusual for a release to partially publish, failing after pushing some (or even most) of the artifacts, but not all.

Staging fixes these problems and is fully supported by both Sonatype Nexus and Bintray (though with caveats that make it clear that JFrog doesn’t actually understand how important this feature is). The idea behind staged releases is that, during a release, artifacts and files are published to a staging location and not made visible until all are fully uploaded, at which point a separate API call (e.g. “sonatypeReleaseAll”) is made which atomically moves the staged artifacts over to the main repository. This two-phase commit ensures an “all or nothing” semantic to releases.

Note that each staging process needs to carry an explicit (randomly assigned) ID, since multiple unrelated releases may be performed concurrently in separate builds, and each should be staged and released independently. This is something that Sonatype gets wrong even to this day (though, bizarrely, half of Nexus’ API does support this type of thing, it’s just that the other half doesn’t and so the whole thing falls over). In practice, such concurrent conflicts are rare, but they do happen.

Right now, GitHub Packages has absolutely no concept of staging. Things are released essentially as they are published, and this means that “half completed” releases are the norm across many projects. This really isn’t a tenable situation, particularly since public packages are (rightfully!) impossible to delete. Frankly, this lack of two-phase releasing alone makes GitHub Packages almost unusable, even if the aforementioned repositories problem were fixed.

My proposal to fix this: add three new API endpoints which manage staging. The first endpoint should create a staging release and produce an ID token which will then be passed to subsequent calls and publish POSTs. Any artifact file upload which contains such an ID will be added to the staging repository rather than immediatley released. Finally, once all uploads have been completed, a second API endpoint will be invoked to atomically release the staging repository, which invalidates the ID token for subsequent usage and atomically makes all artifacts and files available for resolution. The third API endpoint would take a staging ID and simply delete the repository, representing a rollback semantic for a partially failed release (implementable only because staging repositories are not visible until fully released).

Packages is somewhat usable for corporate applications (such as my company) without this change, but it’s… less than ideal. It is categorically unusable for OSS in the Maven ecosystem without this kind of functionality.

Display Issues with Scala Artifacts

Due to incompatibilities between major versions of the Scala language, all artifacts in the Scala ecosystem are published with a major version suffix in their artifactId indicating the language version with which they are compatible. For example:

com.codecommit:shims_2.12:1.2.3

GitHub Packages internally handles such artifacts without a problem (as it should, since they are entirely compliant with the Maven POM specification). However, the web UI appears to have some display issues related to them. As an example: https://github.com/slamdata/tectonic/packages/121844 Note the entirely misleading XML fragment which is generated by the UI. The artifact ID is correct here (and accurately displayed by the title header), but the Maven POM fragment is wrong. This doesn’t bother me personally, but it is an issue raised by a large number of users of the sbt-github-packages plugin.

Also, fwiw, there are a lot of tools which use the Maven ecosystem which are not Maven and cannot use the XML fragment shown anyway. It is more conventional to display a tabbed interface in which several tool options are shown. For example, note the right hand side of this page: https://search.maven.org/artifact/com.slamdata/tectonic_2.12/6.0.0/jar

This is a minor issue, IMO, but still one worth addressing.

Public Packages Require Authentication

This is really just a silly thing, but seriously, why does the resolution of public packages require a token? The especially silly thing about it is the token itself doesn’t appear to require any particular permissions, it just has to be a valid token (app or personal) from some user, somewhere. It doesn’t even need to be the same user as the one given in the User header for the same HTTP request!

This is just weird. The only reason I can think of for why this was done is rate limiting, but you already allow unauthenticated cloning and other similar operations against your API, so I can’t really understand why you would prevent unauthenticated access in this one case. At the very least, this inhibits the usability of end-user tooling for GitHub Packages, since everyone needs to correctly configure their local environment with their personal token, even when they’re only acting as a user of someone else’s published artifacts.

Conclusion

GitHub Packages shows a lot of promise, and in theory I really like both a) the distributed nature enabled by GitHub’s existing user/org hierarchy and strong support for forking, and b) the fact that build artifacts are (theoretically) kept proximate to the code itself. However, in practice, the Maven implementation is almost unusable in most cases, and at best feels very much like it was designed for a very different ecosystem (NPM) and has only been retrofitted to Maven.

10 Likes

@djspiewak Thank you for all of your opinions and suggestions. We really care about your feedback for Github Packages. I would recommend you to share your idea in the feedback form for GitHub Packages.

Thank you for helping us build a better Github. 

As a matter of fact, these should be very easy since the package names have to be unique across the organization. I’ve tried to upload the same package name to different repos and get an error. It should be a matter of exposing the endpoint that shows them all.

@djspiewak - you said, “My company currently approximates this behavior by having a pair of special repositories which contain no code: one for public packages and one for private. This is clunky and annoying to say the least, and also violates the stated design goal of GitHub Packages of keeping artifacts and code together.” How did you achieve this; do you have any instructions for doing so?