For example, it seems like most SCM people think that merging is about
getting the end result of two conflicting patches right.
In my opinion, that's the _least_ important part of a merge. Maybe the
kernel is very unusual in this, but basically true _conflicts_ are not
only rare, but they tend to be things you want a human to look at
regardless.
The important part of a merge is not how it handles conflicts (which need
to be verified by a human anyway if they are at all interesting), but that
it should meld the history together right so that you have a new solid
base for future merges.
In other words, the important part is the _trivial_ part: the naming of
the parents, and keeping track of their relationship. Not the clashes.
For example, CVS gets this part totally wrong. Sure, it can merge the
contents, but it totally ignores the important part, so once you've done a
merge, you're pretty much up shit creek wrt any subsequent merges in any
other direction. All the other CVS problems pale in comparison. Renames?
Just a detail.
And it looks like 99% of SCM people seem to think that the solution to
that is to be more clever about content merges. Which misses the point
entirely.
Don't get me wrong: content merges are nice, but they are _gravy_. They
are not important. You can do them manually if you have to. What's
important is that once you _have_ done them (manually or automatically),
the system had better be able to go on, knowing that they've been done.
I see that git has been updated to 'pass' the indent-block test, because it produces the correct output, but the resulting indentation is not correct.
I have git.mergetool set to bc3 (Beyond Compare 3), so I tried running 'git mergetool' for each of the failed cases. In the adjacent case, bc3 merged things correctly, and all I had to do was accept its merge. In the indent-block case, I just had to fix (some of) the spaces, before accepting the merge. The only case where I had to do some real work was in dual-renames, but even then, it was fairly trivial.
So, I agree with you. To me, it doesn't matter that git (or any other tool) sometimes gets content merging wrong. It _is_ gravy, and can be handled by other tools (bc3 in my case). What external tools can't do is manage your history.
I agree with Linus that creating a solid merge base is far more important than clever merging. But I think there is still room for improving on Git in this respect. A lot of what I'm about to say is inspired by this video from the Camp guys: http://projects.haskell.org/camp/unique
Git forces you to treat your history as a single linear sequence of commits. This is an unnecessary restriction if some of the changes in that sequence are totally independent of each other. For example, if two changes touch two completely different files and are unrelated, why should you be forced to sequence them in one order?
Here is a practical situation illustrating this limitation. Once in a while I'll want to patch a coworker's in-progress change into my working directory where I also have changes. Perhaps I want to build a binary with several experimental in-progress changes in it. Suppose my coworker's changes are totally independent of mine (say they touch completely different files).
I can do this in Git by applying his patch to my working directory (or doing a "git merge" with his branch). But now suppose I'm done with the experiment and want to back out my coworker's change, so my working directory is left with only my change. If I haven't made any more tweaks to my change in the meantime then I'm ok, I can just "git reset --hard HEAD^" to discard my coworker's change. But what if I made further changes to my change in the meantime? There's no easy way with Git to manipulate the two changes independently within the same branch, even though there are no actual dependencies between them.
Sure you could create a separate branch for the merged thing. Every time you want to change your part you switch back to your branch, make the change, then switch back to the merged branch and merge again. But who wants to be that disciplined? Who should have to be that disciplined when the computer could do the work of knowing that the two lines of change are independent of each other?
Git's ability to create stable and verifiable SHA1's is important, and I think that any future SCM will need to have this capability. But I don't think this implies that you have to treat the history in a strictly linear way. You could create SHA1 checkpoints when a particular person wants to publish and/or sign a tree and its contents, but still allow the individual commits to be treated in a more flexible way. The SHA1 checkpoints could be like barriers; each change is either part of the checkpoint or not, and the checkpoints could know their parent checkpoint(s) so that there is still a verifiable history available for auditing.
I hope an approach like this could make large projects like Linux more intuitive to follow. I always found it unfortunate that the graph of commits for any project with lots of merge activity is totally indecipherable. For example, here is a screenshot of Git's own Git repository: http://i.imgur.com/RyQm3.png If independent changes could be viewed independently, and if every merge didn't have to be an explicit commit, perhaps this could be easier to follow.
There are definitely lots of unanswered questions here and I don't claim to have all the answers. My point is just that I don't think Git is necessarily the last word in distributed version control.
I don't see why you think git's approach to history is flawed. If changes to separate areas are kept separate, how can anyone coordinate on a single set of changes? What if 1 patch does't affect an area, but depends on it staying the same? Sorry, I can't imagine why you would want a commit to contain anything other than the state of the entire repo.
That doesn't mean there aren't easy solutions to your problems, though.
For your first situation (test-merging in coworker commits), why are you doing it on your own dev branch? Say you're working on branch 'mine', just `git checkout -b 'mine-exp'` before merging things. Then you can continue to develop on 'mine' and/or mess with 'mine-exp'. Can't manipulate separate changes within the same branch? Create more branches!
Want to look at the history of only certain paths? Just specify them at the end of your `git log` command (you can get something similar to gitk's output with `git log --oneline --graph --decorate`). If that's not enough for you, there's a whole section on 'History Simplification' in the git-log manpage. It's possible in gitk as well, under View>(New/Edit) view. The penultimate text area lets you narrow the history to commits that affect the specified files and directories.
Could all of this be documented better? Maybe, but in the first case, creating branches is the git philosophy (they're only 40 bytes each). And limiting the log output is something that svn had, so I figured git did as well.
> I don't see why you think git's approach to history is flawed.
I didn't use the word "flawed" nor do I believe it is flawed. If anything I would call it "incomplete," since what I am describing is essentially a superset of Git's existing model. I'm dreaming about whether the future could be better than what we have today.
> Say you're working on branch 'mine', just `git checkout -b 'mine-exp'` before merging things.
I explored this option in my comment. The problem is that it requires discipline that is not fundamentally necessary. It imposes branching/merging busywork that I believe could be avoided.
Suppose I'm right and some of this branching/merging busywork could be avoided. Wouldn't we be in a better place than we are today? Isn't it worth exploring this possibility?
Well, if you're asking the user to keep track of anything more than the SHA1 of a commit and all of the commands available to you, then that's asking a lot (because of all the commands that git has)
> it requires discipline that is not fundamentally necessary
If you don't proactively make branches, then you're like me. In that case, I make liberal use of git add -p, git stash, and git reset --hard (but only when everything is stashed or committed, to move branch pointers around). And then I always make a mental note, that next time I'm going to make risky changes, make the branch first (I don't get to use git often enough for my habits to change, though). But in my case, it's usually because I start working on one thing, and then mid-course, decide to work on something else. Because I came from svn, I forget how cheap commits are as well, and that it's possible to commit non-working code. It's definitely more desirable than creating commits that do too much.
So, I disagree with both parts of your claim. It doesn't require significantly more discipline, provided you don't make the mistake of committing too much (e.g. always be sure to start working on an idea from a clean checkout). And the discipline required for the smoothest workflow _is_ fundamentally necessary regardless of what VCS you use, because no VCS can know what idea you're working on unless you tell it.
If you're often committing to the wrong place, I suggest adding the current branch name to your shell prompt, or getting familiar with `git cherry-pick`. If you can't keep track of what's been merged with what, gitk and various git-log options are your friend.
Do you at least agree that git has the history pruning options you were asking for?
> Suppose I'm right and some of this branching/merging busywork could be avoided. Wouldn't we be in a better place than we are today? Isn't it worth exploring this possibility?
First of all, my proposed solutions can be followed _now_, regardless of the pursuit of yours.
Second, what you're describing sounds to me exactly like submodules, which require their own discipline, and have their own set of problems (hence the recent inclusion of the git-subtrees project). And if you think that maybe git can automatically decide what files go in each submodule, then good luck, because I think you're at a point where nothing can convince you otherwise.
Edit: I take back all I said about your idea being unconditionally too complicated. It seems like it's already being done by the darcs/camp projects. I'll have to check those out eventually.
> First of all, my proposed solutions can be followed _now_, regardless of the pursuit of yours.
That may be, but if Linus had thought this way there would be no Git. We'd still be trying to shoehorn CVS into doing what we want. Personally I think it's extremely gratifying to help make the future happen.
> Second, what you're describing sounds to me exactly like submodules
What? Not at all. Take five minutes and watch the video I linked to.
> That may be, but if Linus had thought this way there would be no Git.
No, because CVS did not track the necessary information. Git does.
> What? Not at all. Take five minutes and watch the video I linked to.
An oversignt on my part. I didn't even see the video link the first time around, but when I saw it in another HN post, I added the following to my post
>> Edit: I take back all I said about your idea being unconditionally too complicated. It seems like it's already being done by the darcs/camp projects. I'll have to check those out eventually.
So, it still seems like it's more complicated than git for the average case (everyone syncing to a common state), but it's obviously not too much information for any one person to keep track of in all cases, since these projects exist.
> You can use "git rebase -i" to remove the unwanted commit from the history.
True, but now you're forcing the user to make a bunch of yes/no decisions about what should be kept and what should be discarded. If you make the wrong decision at any point you can lose your work! (I can't remember off the top of my head if Git keeps a tag for the pre-rebase state, but if so that's a lot of clutter that would accumulate over time). And the VCS still hasn't helped you determine what lines of development are independent of each other.
The workflow I am envisioning lets you visually see your working directory as a bunch of independent lines of development. Before you do anything, you can see that your coworker's change is independent of your own changes (because the VCS has analyzed the changes and knows that this is so), which gives you confidence that you can easily remove his change without affecting any of your changes. Then you have the option to simply remove his change, and it's gone.
This. In fact we have a policy of never using merge, but always using rebase (to the point where we think 'git pull --rebase' should be the default action of pulls).
+1, although we use "git fetch origin && git rebase -p origin/branchname" to avoid the nasty behaviour of 'git pull --rebase' where it rewrite all commits of a merged branch on the current branch instead of just redoing the merge commit. Looking back at a 18 month old part of the history and seeing (feature branch x was merged here) is far more helpful than finding a bunch of duplicate commits IMHO
yes, using rebase to avoid spurious merge commits is also a useful practice, but I was referring to "git rebase -i" (interactive), which can be used to remove commits, splitting/squashing commits, editing commits etc.
Sometimes you just don't have time to do clean commits (by committing only relevant changes with "git add -p"), or you fix something later which was logically part of a previous commit.
Being able to curate the commit history or your local clone, besides pure aesthetics, allows you to present a given contribution to other team members so that they can better understand what you wanted to do and perhaps review the commits.
Some of the more complicated merges, e.g. "adjacent lines" scare me. The comment says "They clearly don't conflict since they don't modify the same lines." And while that seems obvious for humans reading the given test case, it seems easy enough to construct a situation where that is not the case due to, for instance, a function call spread out over multiple lines.
Sadly these merges require a fair amount of language-specific knowledge. That doesn't have to be something that we can't ever expect merge tools to do, but one has to be realistic.
More than language-specific knowledge is necessary. Let's say we have two branches to merge:
branch A: rename foo() to bar() and adapt calls
branch B: add baz(), which calls foo()
We can assume that there is no merge conflict here, since A touches various lines within major code blocks and B adds some lines between two code blocks.
Now show me one merge tool, which understands that the call to foo() within baz(), must also be renamed to bar(). Most tools will probably just merge and produce a broken build.
With a structured code representation (i.e. ASTs rather than flat text) this falls out naturally. The calls to foo identify foo via foo's GUID, not via the string "foo". So if you rename foo to bar you don't need to rename any call sites, since the GUID is still the same. Similarly, merging the branch B just works: the call to foo there is still pointing to the right foo (now called bar) via its GUID. When displaying a call to a function, the IDE looks up the name associated with the GUID.
Imagine you have a modularized compiler that can round-trip between raw text-based source and parse trees as well as final binaries with associated meta data attached. In that case it's not too far fetched to imagine version control systems that merge at the level of parse trees, which would allow it to detect the conflicts you describe.
... unless the reason you renamed 'foo' was so you could introduce another function called 'foo' which does foo properly/differently.
For a realistic example, suppose you decided that 'foo' should acquire a lock. So you rename all existing 'foo' to 'foo_nolock', and add a new wrapper 'foo' which takes the lock and called 'foo_nolock'.
If your other branch called the original 'foo', it should probably now be calling 'foo_nolock', but instead it'll be calling the lock function after the merge, and your compile (or even tests) may not be able to find that error.
This is why the round trip between source-code and parse tree is so great. Say branch A adds a call to foo(), and branch B swaps out foo() for foo_nolock(). You can tell from the round trip on branch A that there was a new reference to foo(). Then in branch B you can tell that the implementation of foo() has changed.
I'm not sure how you would represent such a conflict. A valid way to resolve it would be to tell the DVCS, "You dummy, this isn't a conflict, the author of branch B obviously wanted to change foo() for every call-site, even those he didn't know about." The normal diff-file syntax of "this branch added these lines, that branch removed those lines" wouldn't work.
However I don't think semantic parsing helps here. For example, suppose I'd told you (the feature branch developer) that I was going to change 'foo' so that it had locking semantics, and you had deliberately used 'foo' because of this. Now when we merge you definitely don't want your 'foo' to be changed to 'foo_nolock'. Alternately you can think of a case where I don't change all 'foo' to 'foo_nolock', so the VCS has no idea what the "rule" is.
I don't think it is appropriate to change a reference, like you say. If two people modify the same code, then you should signify a conflict and have someone resolve the issue by hand. There's no machine on earth that can tell whether you meant to call foo() or foo_nolock(). The point is to prevent a false positive (the worst thing by far when merging). If you modify the foo() function and I add a new reference to it, current line-based merge strategies will silently resolve that because our edits appear to be far apart, even though they are semantically conflicting. With some semantic analysis you can determine that manual resolution is much better. The point is to throw a conflict, not change a reference silently.
The point of better merge tools shouldn't be to automate merging with 100% correctness, that's an impossible task. Instead, the point should be to have a high level of accuracy in doing safe merges and in alerting a human being to an unsafe merge that requires resolution.
Its true that most of the time, merging files by "shuffling the decks together" works pretty ok. Many small changes are independent, even most.
But sometimes they're not, and there is no way on earth the merge tool can distinguish. E.g. I fix a bug by incrementing a counter in the caller; you fix it my incrementing in the subroutine. Merge: now we have a new bug, the counter is incremented twice.
Another simple, irreconcilable issue is files that contain lists of things. I want to add two more items to the list; you add three. Merging, we have a 'conflict': we've both changed the end of the file. The merge here would have been trivial: add all the lines. But the merge tool cannot intuit from a text file, which kind of change it was: list edit or algorithmic change.
A database instead of a text file can help, as long as the schema allows sufficiently complex description to help the merge tool, and that would require considerable foresight on the part of the database designer.
What I'd like to see at some point is language-aware merge tools that can both correctly merge stuff like that, and flag conflicting edits even if they don't touch the same source lines.
I thought the three way merge tool was independent of the source control system. I'm pretty sure that's true for the four systems I've used: hg, tfs, perforce, and the horrible horrible SLM. I can say however that SLM's default three way merge just seemed to always do the right thing.
Tools for managing real conflicts seem more interesting. Most conflict resolution tools seem to 'help' in a way that leaves me completely baffled. They automate the creation of unintentional edits rather than helping you understand the history of the changes that lead to the conflict, and tracking and reversibility of what you are doing during merge.
I've resorted to temporarily overlaying another source control system to track dealing with resolving large complicated merge conflicts.
Inasmuch as a merge is a 3-way comparison between a common parent and two branches, it is basically DVCS-agnostic. The really interesting thing here is that Darcs doesn't just do three-way merges: it actually tracks every change along the way. From what I understand, Darcs conceptually resolves conflicts as if you rewound one branch and played it on top of the other, and vice versa simultaneously. A conflict is considered resolved if these operations are commutative, that is, the order of commits doesn't affect the result. Manual intervention is required when this fails to be true. This is inherently more powerful than a three-way merge, because you have the entire history of each divergent branch to help you understand changes, instead of only the net effect of each.
See http://www.guiffy.com/SureMergeWP.html for another merge test suite with some background material. A year ago i tried a few of them in various diff-tools, none passed all of the tests, including guiffy even though they claim to in the article. Some of the tests can also be considered objective or non-resolvable but it still an eye opener to see how poor the merge tools really are.
Btw, i thought merge conflict handling was a feature of the diff tool, not the scm?
Try Beyond Compare (I'm not affiliated, just a longtime customer). I just went through each test case and had no problems. Resolving one of the standard conflicts involved using 'align with <F7>' to separate the changes at the end of the file. Most of the pathological cases were solved automatically and without even conflict markers, and when they weren't, selecting a hunk, right clicking, and choosing 'take left then right' worked.
When a diff tool is included with the scm, it's hard for the average user to separate the functionality even if it's possible.
And odds are you'll probably to be able to use different diff/merge tools to handle various file formats (plain text, xml, binary, etc.) so you won't have to rely on just one getting everything right.
2009: Git: Bram Cohen vs Linus Torvalds http://news.ycombinator.com/item?id=505876
which refers to
2007: A look back: Bram Cohen vs Linus Torvalds http://www.wincent.com/a/about/wincent/weblog/archives/2007/...
which refers to
2005: Re: Merge with git-pasky II. http://www.gelato.unsw.edu.au/archives/git/0504/2153.html
Where Linus says:
For example, it seems like most SCM people think that merging is about getting the end result of two conflicting patches right.
In my opinion, that's the _least_ important part of a merge. Maybe the kernel is very unusual in this, but basically true _conflicts_ are not only rare, but they tend to be things you want a human to look at regardless.
The important part of a merge is not how it handles conflicts (which need to be verified by a human anyway if they are at all interesting), but that it should meld the history together right so that you have a new solid base for future merges.
In other words, the important part is the _trivial_ part: the naming of the parents, and keeping track of their relationship. Not the clashes.
For example, CVS gets this part totally wrong. Sure, it can merge the contents, but it totally ignores the important part, so once you've done a merge, you're pretty much up shit creek wrt any subsequent merges in any other direction. All the other CVS problems pale in comparison. Renames? Just a detail.
And it looks like 99% of SCM people seem to think that the solution to that is to be more clever about content merges. Which misses the point entirely.
Don't get me wrong: content merges are nice, but they are _gravy_. They are not important. You can do them manually if you have to. What's important is that once you _have_ done them (manually or automatically), the system had better be able to go on, knowing that they've been done.