Completely new to Github, however programming since 40 years, and using git for about 2 years now. In very general terms I must admit first of all that I have still big troubles to get the "git logic" and "git language" into my head - much more than many programming languages (and also CVS and SVN...) that I already loaded into my head, so I would be very happy if somebody can explain in a not too much "git-ish" language or slang!
My question is about proper project setup with git and GitHub - which is related but not identical. So far I have it in a local git repository that is accessible only for myself, on an external computer with backup, so it should not get lost. It is all written in C++, with CMake as the build system and Qt for the GUI parts.
Now I have a "base" version of the software that I want to publish as Open Source, but then I also have an "extended" version that I can only sell for a license fee for some reasons. In order to get the code published I did some effort during the last months to make it two projects: the "base" that is in one git repository, and the "extended" as a separate second project that depends on "base" through some CMake constructs. The advantage is that I can now publish "base" freely while "extended" remains completely hidden. Still - I did not find time yet to really do the publication.
Instead I found a customer for whom I am writing another "custom extension" for the "base" version. So far I simply made it a "branch" of "base", so I can switch between "master" and "customer", and if there is something that should be shared I can "merge" between the branches - very handy and helpful!
However: Now the "base" repository is again not in a state that I can publish it - because it contains in a branch the "customer" code that I am not supposed to publish!
In other words: I am back in the same trouble - and it would again be very tedious to split the codes...
Ideally I would even merge the "extended" version back into the main repository and have then three "branches": public "master", and private "extended" and "customer", but this is not possible! As far as I understand I have to either publish the entire repository (and expose also the "private branches"), or I have to do a huge effort to separate the open and closed parts of the code.
Now I was reading about some ways to deal also with such kind of problems with both Github and Gitlab, because you cannot only "branch" but also have some kind of "more separate branch" that still allows to also merge code from the one to the other - first of all from the public "master" to the private "children".
So finally the question: Is this true - and if yes: What is the name of this technology, and where can I find some instructions to make it happen?
I would very much like to understand the procedure in advance, before I jump into reorganizing my entire code again (always very error prone!).
Many thanks in advance for any helpful answers!!
Thank you for all the context to help me understand your particular problem better. Let me start off by suggesting @jwiegley's Git from the Bottom Up e-book as, in my opinion, the best way to "load git into your head".
To answer your specific question, "is it possible to have private branches?" Yes, it is possible, though not the way you seem to be thinking about them.
You are correct that it is not possible to have a published repository that contains a branch `foo` where the contents of branch `foo` are not accessible to people with read access to the repository. However, it is possible to have two instances of a single repository where one instance is public and the other is private. You can publish the branches that need to remain private to the private instance and publish the branches that can be public to the public instance.
With that said, I would not normally recommend this kind of setup because one would either need to be highly disciplined to prevent accidentally publishing private branches to the public repository instance. You could create custom tooling to make it less likely that mistakes would occur, but even if you eliminated the possibility that private branches would be exposed, there would still be the chance that code would be added to the wrong (public) branch and leaked in that way.
There's no name for the feature or technology, because it is a consequence of how distributed version control systems, like git, are designed, specifically the "distributed" part. For the purpose of illustration, here's how this could be achieved. I'm going to start from an empty repository because it's easier 😀
mkdir sample-split-repo cd sample-split-repo git init touch README.md git add --all . git commit -m "Initial commit"
At this point, we have a basic repository with the one `README.md` file in it. Let's add a file for code:
touch script.rb # In an editor add some simple statements to the file git add --all . git commit -m "Add public code"
Now let's create the two repositories. I've created:
for the purposes of this demonstration. Then we add both repositories as "remote" repositories of our local copy:
git remote add public https://github.com/lee-dohm/split-repo-public.git git remote add private https://github.com/lee-dohm/split-repo-private.git
And then we push the current version of the `master` branch (containing only public code) to both repositories:
git push public master git push private master
Now. let's create some private code on a separate private branch:
git checkout -b private-branch # Make some changes to script.rb to add private code git commit -a -m "Add private code" git push private private-branch
So, at this point, we have two repositories containing mostly the same stuff. Both are public so that you can see how this works, but in practice the split-repo-private repository would actually be private. But as you can see here:
Now, at any time, you can "promote" private code to public code by simply merging the code from the `private-branch` branch into `master` and pushing the latest copy of the `master` branch to the public repository.
I hope that helps and let me know if you have any questions.
Thank you very much for your friendly and very clear explanation - almost a little tutorial: I appreciate it very much!
Actually you are using the feature of having several "remotes" for one and the same local repository which I knew in principle, but somehow never really "trusted" - because I did not really know what I am doing. Maybe the best thing is indeed that I just play a little with your example, or with a similar one that I create locally - and where I cannot do any damage. Until I feel comfortable with it!
There was another track that I was considering, with "branches" a bit more separate - but I don't know if I can do all the necessary steps in that setup:
- Start with one local repository and move it to Github (or possibly Gitlab - no idea yet).
- Then make some kind of remote "clone" of the public repository and make that one private. Now I would also have two repositories on Github, both with the same content, but in some kind of "parent-child" relation.
- Now I can generate also a local version of that second Github repository, so also locally I would have two of them: one where I manage the private project, and one for the public.
But now the QUESTION is in this case: are there ways to "merge" changes in the private repository to the private clone? Because that is the intention: updates of the private project should be taken over also in the private.
The other way round would not normally happen - which is basically a question of adding every update to the right project locally and pushing it to the right Github remote.
For me this setup would look still a bit more "safe" than your proposal that needs a bit more awareness of always doing the right thing! But ok, also in my potential setup such kind of awareness if of course required.
Only the question would be: Is there an easy way to do such kind of "merge from the public"?
Actually it should be the case because I know that there are lots of public repositories on github, licensed in such a way that I am allowed to derive "closed" projects from them - and I assume that also there should be a way to "pull" public updates into my private project (although in "git language" the verb "pull" has a very specific meaning, I am using the verb now in the more general "english language"!).
Any comment on that idea? Is it a) possible and b) advisable in your eyes?
The short answer is: Yes, it is possible to do what you describe. But because git is a distributed version control system, there is no material difference between your solution and my solution.
The version control systems that you're used to, you mentioned CVS and SVN specifically, are designed in a client/server configuration. There is one central repository that is the single source of truth that acts as the server. Then there are many clients that interact with the server to update it with new content or changes. Clients can't exchange new content or changes between each other except through the central server. As a matter of fact, the clients never have a full copy of the repository locally.
Distributed version control systems, on the other hand, are designed so that every instance of the repository is a full "clone" of the entire repository. There is no "server" except by agreed-upon convention and any instance can exchange new content or changes with any other instance. All clones of any repository are peers. It's only that everyone's used to client/server version control architectures that makes it so that people typically describe networks of git repositories in the client/server way to make it easier for new people to understand.
For example, let's say that there are three members of a team working on a project together: Alice, Bob, and Clarice. They have a git repository on a local server and a cloned backup of it off-site that is kept up-to-date by a scheduled task. With this setup, all of these workflows are possible using git (or theoretically any other distributed version control system):
I hope those examples illustrate the possibilities that are available with this kind of system. It doesn't matter which repository is created "first" or "last", they're all equal partners in the network. It's only a question of which instance has which changes.
So, no matter what topology you choose for the network, the same level of care is needed to prevent mistakes.
I hope that answers your question. Let me know if you have any more.
Thanks again for your explanations! I know that learning one thought is always easier than a "way of thinking", and git is in that sense a "way of thinking", not just another version control system!
One thing I realize from your last explanations - which are not completely new for me, but it was not sufficiently in the "core" of my reasonings: If all the repositories - "local", "remote", whatever... - are more or less equivalent, the "natural thing" would always be to keep them all aligned as much as possible. Like when adding ink to a pot with water, the blue will initially be local, but eventually it would spread all over the pot and no differentiations remain.
The "unnatural" thing is then the intention to keep some "private" extention of a "public" repository separate. If it is just two projects - no problem: it is like having two pots with water with different amount of ink that is spreading. But if you want to keep a differentiation within one single pot, the physical comparison does not fit any more because in the case of water and ink it is impossible!
And in the case of git repositories, this is the point where "taking very much care" comes into the game!
For me the bottom line is: I have to keep this in mind, do some little local "playing around" with mini repos with mini changes, until I feel safe to do the same with my real code.
I also learn from this that there might be less of "git extension" if I go to Github (or also gitlab) than I thought: All these "pull request" etc. things that you find there are then more a certain communication strategy between users than a fundamental change in the overall logic - which is still "git logic".
Distrubited version control systems, like git, do require a certain mental paradigm shift to really grok, it is true. But I tihnk you're getting it now. At least, from your description, I think you understand the risk I was trying to convey in my first message 😀
Please do feel free to reach out with any more questions!
Thank you for all - you really helped me a good deal further! I am right now doing some little local tests with only few files and several repositories, and I am studying "Git from the Bottom Up" - which is exactly the kind of thing that I missed: So far most of my explanations were "do this - do that - it is all very easy" ... and at the end I felt like I understood a lot but nothing really at depth...
Only one more question - if you happen to have an answer: You initially proposed a setup that would solve my problem - saying at the same time that you would not recommend it! So the question still is whether there is a git setup that you would recommend indeed!? Because - the problem that I need to solve is real! And I believe that I am not the only one with a similar setup:
[public base project] -> [private extended project]
Actually my public base project is again derived from a public project, which is Paraview (see paraview.org), written in C++ and managed with CMake (where the makers of CMake happen to be the same as the makers of Paraview). They are solving the problem at the CMake level, not at the git level: You have a base project (Paraview), and then you have a separate project that "imports" the base project with the means of CMake. So at the git level there is no interference at all, and you can change the version of the underlying Paraview for your project, do some adaptations to the new version and recompile.
However, with this setup you still have to duplicate quite a number of source files and kind of "settings files", depending on the level of "intrusiveness" of the derived project of course (which then need to be reworked manually in the case of a version update of Paraview), and also the project setup is sometimes overly complex, just in order to fit the pattern. (Which does not mean that the Paraview people have not done a phantastic work: I am really really impressed day after day!)
But this is basically the reason why I was looking for my own solution at the git level - in order not to add even more complexity to this setup, where the first "derivation" is given (CMake based) and I am trying to do the second with the means of git:
[Paraview] -> [base project] -> [extended project]
Anyway, I am not stuck, so no real need to answer in detail - because I think I understand now both the shortcomings and strengths of the "CMake solution" and the "git solution". Except you still have another hint or remark that I did not think of so far!
You've definitely got a good understadning of the choice in front of you.
By using a package manager, one can take several open- or closed-source components and combine them into a larger project. Each of those components are also probably composed of several lower-level components and so on. This allows the decision of "should this project be open or closed source?" to be made at the project level, rather than trying to remember which pieces of a project are supposed to be public or private.
I'm not sure if the project you're trying to build lends itself to that kind of methodology or if a package or dependency manager is available to you in the toolchain or language ecosystem you've chosen (or are restricted to for various other reasons). But using a system based on the "CMake solution" with the help of a dependency/package manager is what I have seen be successful most often for the kind of problem you're describing.
I made a lot of assumptions about your situation and the tools you're using. So if I'm stating the obvious, I apologize. I felt that a discussion of this problem space wasn't complete without mentioning dependency management. In any case, I hope that's helpful!