05 February 2014

Remove big files from Git repositories permanently

Everything is possible with Git but I can't remember all the command line options. So here is my aid to memory to make a repository slim again by removing unwanted files. I'm using git version 1.8.4 on Windows.
First you need to rewrite history:
git filter-branch --index-filter "git rm -r --cached --ignore-unmatch *.gem" --tag-name-filter cat -- --all
Note the –r and the use of wildcards inside the index-filter command. With the other options this means that all *.gem files in all commits and tags are found and removed. This command prints all objects its deleted. If it doesn't print anything useful you have made an error!
Now delete the backup created by git filter-branch:
rd /q /s ".git/refs/original"
Some magic to get rid of orphaned objects inside the git repository:
git reflog expire --expire=now --all
git gc --prune=now
Verify that all files are really gone with git log -- *.gem and then repack your repository.
git gc --prune=now --aggressive
Finally, push your shrinked repository to the origin.
git push origin --force
The next time you clone the repository you clone the repository you get the shrinked version.
Update: But as soon as you do a git pull (--rebase) all the unneeded and painfully removed objects are downloaded again to your hard disk. The only way to prevent this is by deleting the repository on GitHub and replacing it with the shrinked one (without changing names or urls). Astonishingly, existing clones continued to work with the replaced repository.
Update2: On GitHub itself is now a nice article explaining the process of cleaning/shrinking repositories, including a link to a tool called BFG Repo Cleaner that is specialized for this task.