Responding in hopes some Git non-novices are here and can give some quick advice.
I have a fairly large Git repo with 5 years of commits from numerous team members including a bunch of non-technical people who had never used Git before.
There were two major issues:
1. We started off storing binary files -- mostly images, but also a ton of raw data files that got versioned every day or two -- in this repro and it spiralled out of control size-wise. I ended up using "BFG" to nuke the binary files and I think that worked, but it still feels like the repo is way too large in terms of file size. Is it possible that orphaned old versions of files are floating around somewhere in .git/? What are the best practices here?
2. The branching strategy was badly wrong. We used two branches: production and dev. New commits are made to dev, dev is merged into production periodically via GitHub PRs. Dev is NOT deleted and we did not use a squash strategy on merges. Somehow this resulted in the repo having 2, 3, 4, or 5 copies of the same commit immediately in a row. I think that somehow a branch got merged into itself? I don't know how that would be possible, but I can tell you the symptom is that there's a period of time several years ago where every commit appears in quintuplicate, and this slowly decreases until every commit is just doubled, and then at some point we're back to a correct commit history. What's the likely cause here and what's the likely solution?
We would like to fix both these things while preserving history (obviously the problem could be immediately "solved" with rm -rf .git && git init, but we'd like to avoid that at least partially so that no one who worked on the project has their historical commit record broken and so we can still use blame to know who most recently touched some of the older stuff)
My own git-fu is not great, so, thanks for posting these exercises.
Regarding problem 2: It's hard to know what exactly you're seeing, but be aware that the output of "git log" is only pretending to be linear; it's really a tree/graph. ("git log --graph" can help, but the output can also be annoying to parse.) So when you see multiple copies of a commit "in a row", chances are that they are really present on different branches which were later merged. If they have the same commit ID, I would probably call this just a display quirk -- there's really only one "copy" in truth, but for some reason the history is confusing "git log". But if they have different commit IDs, that means (as far as I know) that someone has done a "git rebase" or a "git cherry-pick", which you didn't mention in your description.
As others already noted, Git has a GC-mechanism, which means that objects can still linger around in any copy of your repo for a while. And if you need to version binary files, you'd better use git-lfs or git-annex. Obviously, if you don't need them, just nuking them outright with BFG or `git filter-branch` is fine, too.
If you'd like to try git-lfs: It also includes tooling to retroactively migrate your Repository[1], but that'll re-write your history (although BFG obviously does that, too).
If GCing your repo will not reduce its size, you'll probably have to hunt down any remnant branch and/or tag that might reference the old history and thus "keep it alive".
On 2.:
I'm not entirely certain I understand you correctly. Which commits were duplicated? The resulting merge commits?
Assuming that, if you wanted to, you could probably build some shell script to get rid of them (or use something like `git checkout prod; git rebase -ir <first_commit>` and remove the duplicated merges yourself).
But from a repo perspective, this shouldn't cause too much trouble (i.e. the additional space required will be negligible), and doing so would, again, mean that you'll rewrite history, potentially causing issues for others who still have local copies referencing your old commits.
Also: If you try to go the rebase route: Make sure you understand the log Git will create for you. Using `-r`, it will preserve merges, but how they are represented is not very intuitive and you'll have to wrap your head around that first.
You could also try to achieve the same result with `git filter-branch` and `--commit-filter`. In this case, you'd probably want to write a script that only performs the `git commit-tree` command if the tree ID passed to the filter is not the same tree as the first parent commit was referencing already (this should weed out all commits that don't change anything).
Describing issues in commit histories as prose text is tricky :). If your problem looks different than I assumed, you could try to create a bogus Git repo that showcases the pattern in its history and put it on GitHub as a reference.
You can create "empty" commits for that via `git commit --allow-empty`.
If you've GCd your repo and you're sure there are no references to old commits laying around (also keep in mind remote branches and so on!), this might help you discover large objects that are still in your "new" history:
If run in bash, this should print the 20 largest objects in your repo and their size (in bytes):
Our regular routine at the time: Imagine production and dev are even. We add 5 regular commits to dev. We PR merge dev -> production. We do not do squash and merge. We do not delete dev. Dev and master are now even but master has the merge commit as well.
Flash forward several years. Looking at history, we now see each of those 5 commits for a run of about a year appear multiple times (at the peak, 5 times). The exact same commit, same commit message, etc. This is supposed to be impossible with git and we don't know what caused it, but assume that it was a user error either trying to resolve #1 or just by people who didn't know how to merge merging. Whatever error caused it clearly occurred a few times because for a while the commits appear 5 times, then 4, then 3, then 2, then back to normal, so I suspect whatever the cause, it happened 4 times, duplicating a range of commits each time.
We'd like to go back and de-duplicate, keeping the whole history that led us to now but only having 1 of every commit. I don't fully care about branch history. I just care about blame and about commit counts for all the people who worked on the project. I am fine with the commit counts falling with the duplicate commits removed, but just not falling to 0 like it would if we started from scratch.
We are not worried about local copies (it's a collaborators-only repo that there'd be no reason for uninvolved people to fork) and right now we have no collaborators since the project is grant funded and we don't have an active grant.
That's half the reason I'm trying to clean up these nuisance issues now before other people are involved again.
I'm sadly still not all that sure I understand you correctly (and I'm afraid I can't tell you exactly what caused your issue), but:
If it is the commits that "do the work" that are duplicated (not the merges), I'd guess there already was someone (or some script/tool) that already rewrote history a couple of times and not everybody was aware of that.
If your team didn't look much at the history structure and try to actively shape it in a certain way, this could just have happened because someone did an innocent `git pull` after the history of the branch had been altered on the server.
`git pull` by default equates to `git fetch` and `git merge`, so if your history was altered, the branch would contain copies of the original commits (with totally new IDs), and git would "knit" the two copies together in a new merge. That means that this probably has happened around five times (since you see five copies of the oldest commits).
In this case, the hashes of the duplicated commits should be different. If that's not the case, I'd guess you have a client that visualizes the history in a weird way and the problem is something else.
Cleaning this up might then be more cumbersome, since the points you'd need to "adjust" weren't committed "back to back". If you're willing to do the work by hand, you could just string together intact pieces of the history into a clean one by using `git rebase` with `-r` and `--onto` (or `git cherry-pick`, but I've never used that one for complex topologies, so I don't know how helpful it is there)...
I just did some experiments, and I think this should help identify such merge commits from the commit IDs of two copies:
Run this in bash (if you're using Windows, use "Git Bash"). Make sure you replace the `<hash>` tokens with the appropriate commit IDs. The hash it prints should be the merge commit tying the two copies together.
For 1), try using git-filter-repo (https://github.com/newren/git-filter-repo). This is the currently recommended alternative to previous tools like filter-branch, and it is much more user-friendly.
`git filter-repo --analyze` will generate a report of blobs stored in the repo at `.git/filter-repo/analysis/blob-shas-and-paths.txt`, and it's very easy to sort them by filesize and strip them out from there.
Regarding 2, that branching strategy sounds completely reasonable and simple. You wouldn't except to delete the "dev" branch in that case, nor squash any commits, because "dev" is a shared branch that many people use. It sounds like a classic "test" or "pre-production" branch.
You would expect people to commit cleaned up commits ready to be merged to production into that branch. Any exploratory work would be done outside "dev" (either in developer local repositories or personal branches). It is also important that no commits are ever made to the production branch directly, otherwise your branches would become out of sync.
If you'd like the branches to stay identical over time you can stick to fast-forward merges. The downside is that you lose information about when merges to production were made, which may or may not be a problem for you.
The problems with repeated commits are not caused by your merge strategy. The are caused by people doing strange things with git. It might be hard to say exactly what after the fact, and without seeing the commit history we can only speculate. Perhaps many merges were made with unrelated branches, and what you are seeing is merge commits? Independently of how you proceed with this, you need to educate the users, otherwise they will keep polluting the commit history.
I didn't understand the part about removing the repository and take it to be a joke that went above my head. But obviously this can't be done without altering history. It is, after all, the history that is messed up. Fixing this means identifying which commits can be squeezed or removed from history. You have to look at these commits to find out how. Then you just rebase all commits from the beginning to a new history, and then everyone involved needs to keep working on top of that. Just remember to give the old history a name with a branch or a tag before you get to work, so you have something to reset to, should you mess up.
> I didn't understand the part about removing the repository
He was just saying that it could all be "solved" by simply nuking the current repository and starting afresh from the current version of the code. They'd lose all history but they'd also have a clean slate and not have to deal with the errors of the past.
> obviously the problem could be immediately "solved" with rm -rf .git && git init, but...
No buts about it. This is the way to go. That commit history is mostly junk. Its value is extremely limited and it's hanging around only because of fear and nostalgia. Archive it somewhere as read-only so that the worry warts will hush up and can reference when they need to (which IRL will be once or never), but just get rid of it in the live repo. The headaches you'll save yourself by starting fresh will vastly outweigh the problems you're afraid will happen.
I’ve seen #2 happen once or twice but never did figure out exactly what the person who generated the duplicate commits had done. I suspect that they made some mistakes while rebasing a feature branch.
1. If the large objects are not referenced by any commit anymore, they should automatically get garbage collected eventually, but you can force this process with git gc: https://git-scm.com/docs/git-gc
2. Never seen that one. There's probably an arcane filter-branch command that could detect the duplicates and squash them, but personally I'd just leave it alone. How often do you look at many-years old history anyways.
You should use git-lfs for these files. Probably some wizardry with `filter-branch` and adding these files to lfs could debloat your repo once you do a fresh clone from that.
Yeah, we used git-lfs for a while but ultimately decided we didn't need to version the static files and just plopped them on S3 and added a step to the deploy to pull them in.
Basically, we initially thought we needed point-in-time versions of a bunch of data files, and later decided we didn't care about point-in-time.
Since I just saw this comment: Note that git-lfs stores your binaries in its own directory inside .git (I currently have no git-lfs-enabled repo at hand, but I think it was `.git/lfs/objects`).
That means that there might still be a lot of binaries inside your .git directory because LFS put them there. Those object files are not managed by Git itself, so forcing GCs through Git won't help in this regard, either.
Regarding 1, I’ve found this Bash script [1] to be helpful in the past. In short, it lists all Git objects sorted by size — a good gut check for whether BFG or git filter-branch worked as you intended.
Excited to give this a try and see if it will help me expand my day-to-day git-fu.
I like interactive learning resources like this. A similar one was posted to hn a while back for postgres and going through it taught me a lot. https://pgexercises.com/
This was great. The exercises towards the end definitely taught me some new tricks. `git reflog` and `git bisect` weren't part of my git lexicon, but they absolutely will be now.
> That is the first alias configured by the script described above. It initializes the first exercise that is on master branch. Read the instructions and solve it!
Any idea why that is? (It ultimately wasn't a problem once I switched to master, but without doing so there would be no configure.sh script either and I'm asking in case someone else runs into the same issue).
It looks like `.git/refs/remotes/origin/HEAD` doesn't exist, which might explain why a fresh clone results in a detached HEAD (i.e. Git should default to being on the branch referenced by this entry).
As for how you can end up in such a state on the remote, I'd actually be interested to know too.
Is it, though? If the instructions in your "How to start?" section don't work, that feels more like a bug. I went through all the exercises and didn't find a point where this was relevant.
Currently, the initial HEAD actually contains the solution for a future exercise. It was also created by a different user than the rest of the exercises.
Maybe the detached HEAD should contain a README with instructions like "You are in detached HEAD mode, reach the master branch to continue"
I liked that a few of the exercises had multiple solutions; it's often the case in real-world use that various approaches can work. The exercises themselves do cover many common situations such as having to edit a commit that's not the HEAD, or porting changes from a branch to another, or finding the source of a bug.
I had only ever used git-bisect once but had the same impression then as I did now with these exercises: that it was super powerful for this particular situation.
Is there a way to restart a challenge when you've made a mistake on(e.g. an extra commit)? `git start <challenge_name>` appears to keep the same state of the git branches associated with the challenge. I'm also noticing when trying to rollback to a previous commit with `git reset HEAD --hard <commit_id>` that I'm getting `fatal: Cannot do hard reset with paths.`
I have a fairly large Git repo with 5 years of commits from numerous team members including a bunch of non-technical people who had never used Git before.
There were two major issues:
1. We started off storing binary files -- mostly images, but also a ton of raw data files that got versioned every day or two -- in this repro and it spiralled out of control size-wise. I ended up using "BFG" to nuke the binary files and I think that worked, but it still feels like the repo is way too large in terms of file size. Is it possible that orphaned old versions of files are floating around somewhere in .git/? What are the best practices here?
2. The branching strategy was badly wrong. We used two branches: production and dev. New commits are made to dev, dev is merged into production periodically via GitHub PRs. Dev is NOT deleted and we did not use a squash strategy on merges. Somehow this resulted in the repo having 2, 3, 4, or 5 copies of the same commit immediately in a row. I think that somehow a branch got merged into itself? I don't know how that would be possible, but I can tell you the symptom is that there's a period of time several years ago where every commit appears in quintuplicate, and this slowly decreases until every commit is just doubled, and then at some point we're back to a correct commit history. What's the likely cause here and what's the likely solution?
We would like to fix both these things while preserving history (obviously the problem could be immediately "solved" with rm -rf .git && git init, but we'd like to avoid that at least partially so that no one who worked on the project has their historical commit record broken and so we can still use blame to know who most recently touched some of the older stuff)
My own git-fu is not great, so, thanks for posting these exercises.