On 1.: As others already noted, Git has a GC-mechanism, which means that objects...

btschaegg · on Oct 3, 2020

Something I just thought of:

If you've GCd your repo and you're sure there are no references to old commits laying around (also keep in mind remote branches and so on!), this might help you discover large objects that are still in your "new" history:

If run in bash, this should print the 20 largest objects in your repo and their size (in bytes):

  git rev-list --all \
  | xargs -n1 git ls-tree -r \
  | awk '$2 == "blob" { print $3 }' \
  | sort -u \
  | while IFS= read blob; do
      echo "$blob $(git cat-file blob $blob | wc -c)";
  done \
  | sort -rnk2,2 \
  | head -20

If you find some blob that is too large, you could then search for its name like this:

  large_blob=<blob_id>

  git rev-list --all \
  | xargs -n1 git ls-tree -r \
  | fgrep $large_blob

notafraudster · on Oct 3, 2020

RE #2:

(First, thanks for everything!)

Our regular routine at the time: Imagine production and dev are even. We add 5 regular commits to dev. We PR merge dev -> production. We do not do squash and merge. We do not delete dev. Dev and master are now even but master has the merge commit as well.

Flash forward several years. Looking at history, we now see each of those 5 commits for a run of about a year appear multiple times (at the peak, 5 times). The exact same commit, same commit message, etc. This is supposed to be impossible with git and we don't know what caused it, but assume that it was a user error either trying to resolve #1 or just by people who didn't know how to merge merging. Whatever error caused it clearly occurred a few times because for a while the commits appear 5 times, then 4, then 3, then 2, then back to normal, so I suspect whatever the cause, it happened 4 times, duplicating a range of commits each time.

We'd like to go back and de-duplicate, keeping the whole history that led us to now but only having 1 of every commit. I don't fully care about branch history. I just care about blame and about commit counts for all the people who worked on the project. I am fine with the commit counts falling with the duplicate commits removed, but just not falling to 0 like it would if we started from scratch.

We are not worried about local copies (it's a collaborators-only repo that there'd be no reason for uninvolved people to fork) and right now we have no collaborators since the project is grant funded and we don't have an active grant.

That's half the reason I'm trying to clean up these nuisance issues now before other people are involved again.

btschaegg · on Oct 3, 2020

No problem. Glad to be of any help :)

I'm sadly still not all that sure I understand you correctly (and I'm afraid I can't tell you exactly what caused your issue), but:

If it is the commits that "do the work" that are duplicated (not the merges), I'd guess there already was someone (or some script/tool) that already rewrote history a couple of times and not everybody was aware of that.

If your team didn't look much at the history structure and try to actively shape it in a certain way, this could just have happened because someone did an innocent `git pull` after the history of the branch had been altered on the server.

`git pull` by default equates to `git fetch` and `git merge`, so if your history was altered, the branch would contain copies of the original commits (with totally new IDs), and git would "knit" the two copies together in a new merge. That means that this probably has happened around five times (since you see five copies of the oldest commits).

In this case, the hashes of the duplicated commits should be different. If that's not the case, I'd guess you have a client that visualizes the history in a weird way and the problem is something else.

Cleaning this up might then be more cumbersome, since the points you'd need to "adjust" weren't committed "back to back". If you're willing to do the work by hand, you could just string together intact pieces of the history into a clean one by using `git rebase` with `-r` and `--onto` (or `git cherry-pick`, but I've never used that one for complex topologies, so I don't know how helpful it is there)...

btschaegg · on Oct 3, 2020

I just did some experiments, and I think this should help identify such merge commits from the commit IDs of two copies:

Run this in bash (if you're using Windows, use "Git Bash"). Make sure you replace the `<hash>` tokens with the appropriate commit IDs. The hash it prints should be the merge commit tying the two copies together.

  commit_1=<hash>
  commit_2=<hash>
  
  awk '
          ARGIND == 1 { h[$1]++ }
          ARGIND == 2 && h[$1] { last = $1 }
          END { print last }
      ' \
      <(git log --ancestry-path --format=%H $commit_1..HEAD) \
      <(git log --ancestry-path --format=%H $commit_2..HEAD)