Is there a way to split a source file into 'regions' that resolve branch/fork merges independently?


I work on a fork of a heavily forked (>1000 forks) repository which requires frequent merges from the daily updated base. During merges, certain source files that basically specify large arrays of similarily formatted but different and discete data often cause seemingly unnecessary conflicts when Git tries to merge content from different elements into each other (the order of the elements may differ between forks, and is consequential in the program structure; it can’t arbitrarily be changed on child forks in order to reduce the number of merge conflicts). This can cause 10+ conflicts per week in some files that need to be manually resolved, but seem as though they ought to be able to automatically merged, if only Git had more context about the source code.

In cases such as these, is there a means of specifying some metadata to cause Git to treat individual regions of code as discrete entities (as if they were separate source files), such that changes will be automatically merged only within them rather than across the source file as a whole? Or is there another good way of resolving this problem? I looked into rerere but that seems unsuitable when the precise specific content of the conflicts isn’t known in advance(?) This seems like a common use case for heavily branched or forked repositories.

Thank-you for any advice!

Great question! Unfortunately, the answer is “not really, no.”

Source control systems since the days of our forefathers have essentially been line-oriented beasts, in that they detect differences between files using the diff utility. Since diff calculates the minimum number of lines that have to be changed to transform the “before” file into the “after” file, it often doesn’t deal super-well with structured data that isn’t strictly line-oriented, for example JSON or XML. This often creates a tension between what makes the most sense from a language-design perspective or from a “simple tools are more robust” perspective.

With that said, working in a line-oriented way works really, really well given how little it has to know about the contents of the file. Essentially, it has to be a text file, preferably with multiple lines of text. And it’s best if it knows what the exact sequence of characters is that signify a line separator (but since the Windows line separator includes the Unix line separator inside it, it can wing it in some cases). It doesn’t have to understand any more than that about the syntax of the information in the file and doesn’t need to know anything about the semantics of the file’s contents.

In order to do what you’re requesting, the diff++ utility (or whatever the replacement for diff is called) would need:

  • A completely foolproof way of detecting the file format of the text within the file
  • A parser for every possible file format  OR a parser for the most popular formats and an extension mechanism to allow people to define their own
  • A meta-language that allows people to signify implicit rules about the text (such as ordinality) that aren’t explicit in the file format

The reason for this is because you’re asking the diff++ utility to understand the semantics of the contents of the file. For example, are these two JSON files different or the same?

  "id": 5,
  "message": "Hello, world!"

"message": "Hello, world!",
  "id": "5"

Syntactically, they’re different because:

  • The ID changed from an integer to a string
  • The order of the keys changed

But semantically? Who knows! They could be very different if the order of keys is significant and there are restrictions on the data type of the ID key. Or they could be identical if neither of those matter.

So what’s the solution? Well, you could attempt to redesign the content of the files in a way that makes the semantics explicit and line-oriented. But this would also probably make editing them or parsing them with other tools unnecessarily complicated.

Honestly, the best solution that we, as an industry, have been able to come up with for this problem in the past 50+ years is exactly what git delivers: it goes as far as it can to figure things out and knows when to give up and make a human take over so it doesn’t mess anything up.

Then again, git and diff were optimized for source code and not textual data files. So perhaps if the project used some other tool for managing data such as an ETL system, that might work out better?

Let me know if you have more questions.

1 Like

Thank-you for the very detailed response! I hope a feature of this kind is considered in future, but I see it’s much more complex than I thought.