We finally moved Django to GitHub late yesterday. Here's a postmortem, to keep the community updated and for the benefit of any projects that take this leap in the future.
We've used Subversion to manage our code since originally open-sourcing in July 2005. Over the last few years, we started to feel Subversion's limitations, namely:
- The difficulty of branching. We used tools like svnmerge to keep track of which parts of branches had been updated from trunk, and some of us on the core team used Git/Mercurial on top of Subversion, but this was all unnecessarily complicated -- to the point of being stifling.
- Lack of decentralization. When I would hack on Django on an airplane, for example, I couldn't make a bunch of commits locally, then push all of those to the master repository; I'd have to put everything in a single commit. With Subversion, it's all or nothing -- you push everything to a centralized server as you do it (or you use a branch, but that's painful, as noted above).
- Slowness. After you use Git for a while, Subversion feels sluggish. This is due to a bunch of design and implementation differences.
(Of course, it's 2012 now, and these are all obvious, well-documented points. To the people who responded to our GitHub news by saying "finally!" -- I totally agree.)
Aside from that, we had set up a GitHub mirror (now called django-old) a few years ago, and lots of people were getting code and forking it there anyway.
Why Git/GitHub, as opposed to Mercurial/Bitbucket or some other system? Because it's very well-made, and it's where the people are. Clearly GitHub has won the majority of open-source developers' mindshare. John Lennon said: "If I'd lived in Roman times, I'd have lived in Rome. Where else?" GitHub is Rome.
The authors file
The first thing we considered was to simply start using our existing GitHub mirror -- turn off the Subversion stuff and start committing there directly. But the problem there was that we'd never set up an authors file.
Basically, an authors file maps Subversion committer names to standard names and email addresses, so that GitHub knows that a commit by "adrian" in Subversion maps to the adrianholovaty GitHub account. With that mapping established, you get niceties like GitHub commits linking to appropriate GitHub user pages and displaying proper user avatar images. More importantly, it gives all of our contributors proper credit within the GitHub ecosystem for the full history of their work on Django -- which has value these days, considering companies are looking at GitHub involvement for job applicants, etc.
So the first step was creating that authors file, which Brian Rosner organized, with the help of several other people. We ended up accounting for every one of the 58 people who have ever committed to Django, except for somebody named "cell" who was given temporary commit access during a sprint six years ago.
One crucial detail is that we couldn't simply change the commit data retroactively in the existing GitHub repository. That's because Git uses the committer data in creating hashes. Changing the commit data would change the hashes, which would break all existing forks of that repository. (We ended up breaking existing forks anyway, of course, but it was cleaner to do it from scratch.)
Nuts and bolts of the process
Once we finalized the authors file, doing the migration was actually kind of easy, thanks to git-svn. I took many missteps along the way, got a lot of help from people in #django-dev on IRC and ended up doing three dry runs. Here are the final steps I ended up taking:
1. Copied the Subversion repository from code.djangoproject.com to my laptop, to make the migration faster.
# On the server: svnadmin dump /home/svn/django | gzip > svndump.gz # On my laptop: scp djangoproject.com:svndump.gz . gunzip svndump.gz svnadmin create /Users/adrian/code/django-svn svnadmin load /Users/adrian/code/django-svn < svndump
On my first run of
git-svn, I ran it from my laptop and pointed it at code.djangoproject.com, and it took 3.5 hours! After I copied the repo to my laptop and tried it again, it took a little over an hour. But the caveat here is that I also changed the
git-svn command between those two runs, so I'm not sure how much of the speed improvement was because of the local SVN repo.
2. Ran git-svn (with the correct arguments!).
git svn --authors-file=authors.txt --trunk=trunk clone file:///Users/adrian/code/django-svn/django/ django-dry-run
This took a little over an hour, and it created a Git repository called
django-dry-run. Note that
authors.txt is the authors file, as explained above.
The trickiest thing about this was determining the correct arguments to use -- specifically, whether to use
--branches explicitly or
--stdlayout. As you can see, I ended up using neither.
Originally, the plan was to migrate all of the branches from our Subversion history -- classics such as magic-removal, new-admin, newforms-admin, unicode, queryset-refactor and multidb -- so that the branches' commit histories (which have all since been merged to trunk) could be preserved in our new Git history. Many of those branches were very involved, with a lot of commits, and there's a lot of value in being able to isolate specific commits in the branch, rather than one large merge commit. (Imagine you're investigating the original reason we added a line of code, for example.)
But as we discussed this over IRC, we decided it wasn't worth the effort, we could always do it later and
git-svn wouldn't actually do it the way we wanted. Ideally, I'd like these branches' histories to be migrated such that they're treated like merged branches in Git -- a merge commit that knows the individual commits on the branch. If you know how to pull this off, and it can be done without altering the Git hashes, please let me know.
3. Changed git-svn-id to point at code.djangoproject.com instead of my laptop.
git filter-branch --msg-filter "sed \"s|^git-svn-id: file:///Users/adrian/code/django-svn/django/trunk|git-svn-id: http://code.djangoproject.com/svn/django/trunk|g\"" -- master
git-svn adds a "git-svn-id" section to each commit message in the resulting Git repository. It includes a URL pointing to the commit in the original Subversion repository, which is very useful.
But, because I did the import from a local repository, the git-svn-id's were all pointing at my laptop. So I ran
git filter-branch to clean it up.
4. Renamed old GitHub django repository to django-old.
(Done via the GitHub Web site.) This was the scary part, because it meant there was no turning back. :-)
Originally we'd talked about deleting the repository outright, but that would have deleted all pull requests and likely would have broken some other things. So I just renamed it to django-old. Not sure how long we'll keep this around.
4. Imported the new repository into GitHub.
git remote add origin firstname.lastname@example.org:django/django.git git push -u origin master
I spotted an error in the repository after the first time I did it, so I had to delete it -- which I thought made for a rare and amusing screenshot:
Then I cleaned up the repository and did it again. I mistakenly created it as a private repository, so I marked it as public, which led GitHub to believe I had just open-sourced Django. :-)
And that's it!
- Final number of commits in our Subversion repository: 17,942.
- Size of Subversion repository: 339 MB gzipped. (That's for the dump file as generated by
- Number of commits created in Git by git-svn: 11,883. (This is less than 17,942 because we only migrated trunk. Any commit to our repository that didn't touch Django trunk -- such as commits to the django_website project or commits to branches -- did not get migrated.)
- Number of forks of the old (mirror) GitHub repository, as of this writing: 783.
- The old Subversion repository will remain indefinitely, for the benefit of scripts out there that do automatic updates, and general stability of the Django world. There won't be any more commits there, obviously.
- If we ever need to dive into the history of one of the big merged branches -- such as magic-removal -- we can do so in the Subversion history. Or we can consider copying the branch history into Git somehow (see above).
- I'd like us to provide some documentation on how to convert your previous Django fork (from the django-old repository) to track the new repository. Any volunteers?
- We still have a bunch of work to do fixing places in our documentation and code.djangoproject.com that refer to Subversion. Bear with us.
Filing bugs / pull requests / the ticket system
GitHub's ticket system is a bit too simple for our needs, given the Django triage process, so we're sticking with our Trac installation, at least for the time being.
But, of course, we want to take advantage of GitHub pull requests at the same time. So we'll need to figure out the right balance between pull requests and Trac tickets, such that we maintain our sanity, we don't make people jump through hoops, and we optimize for contributor and committer productivity.
Personally, I want to avoid a situation (and culture) where we force contributors to use Trac if they post pull requests, especially ones that contain trivial changes. But at the same time, it'll likely become a maintenance nightmare if we have lots of tickets in two places, with no coordination. So, this is an open issue we'll be working to figure out. Jacob has been working on a technological solution.
Thanks to all the people who helped with this transition, and I look forward to the much happier development and collaboration experiences we get with GitHub. The commits and pull requests I've already handled have been a pleasure.
Posted by Dana Woodman on April 29, 2012, at 12:55 a.m.:
Thanks for the write up Adrian! It's nice to hear the details of why certain decisions were made. I appreciate your - and the Django community as a whole - openness and transparency. Hopefully the process of getting all the kinks worked out goes smoothly.
Posted by Brad on April 29, 2012, at 1:36 a.m.:
Thanks Adrian. I echo Dana's statement. We appreciate the work of you and others in the django community.
Posted by Brian Ray on April 29, 2012, at 2:11 a.m.:
That is pretty interesting. I do like git and github. I wonder how well it will work for such a large scale project like Django. Good luck.
Posted by Matt Todd on April 29, 2012, at 5:11 a.m.:
We're happy to have Django on GitHub! We think open source contributors will take to heart the old adage "when in Rome" and start forking and opening pull requests. :)
Posted by Animuchan on April 29, 2012, at 6:25 a.m.:
Nice. Subversion community mourns the loss :3 but the decision clearly was right.
Now that it became way easier to fork and pull-request, I'd guess there will be more incoming patches, too.
Posted by Diederik van der Boor on April 29, 2012, at 11:04 a.m.:
wow., this is f**cking awesome!! :D
Django on GitHub being a reality :D
For anyone considering to move all branches of their svn repository, this is what the KDE guys came up with: http://gitorious.org/svn2git/ It's an awesome tool to rewrite all weird branches using a ruleset. Since the branches of a svn repository can be stored anywhere, with artibary roots, such tool is invaluable for migrating huge repositories.
Posted by Adam Nelson on April 29, 2012, at 3:34 p.m.:
Kudos on getting this done through all the technological (and political) hurdles. Your analogy of Github to Rome is spot-on - thanks for making it.
Also, congratulations for just doing it and worrying about the archival branches later (i.e. Getting Things Done). As you say, the SVN repository is still there. It's not like historians don't have to read more than one book to get the history of something - it seems totally fine to keep the SVN repo for posterity rather than be fastidious about bringing along every old branch into the current repo.
I predict that this will be viewed as a milestone in the Django community and a step-function increase in productivity will be the result.
Posted by Pedro Costa on April 30, 2012, at 10:36 a.m.:
The analogy of Github to Rome is nonsense! If "If I'd lived in Roman times, I'd have lived in Rome. Where else?", why someone would choose Django? RAILS was Rome and probably still is!!! ... but I when switching from Java choose Django (even before the 1.0 release) because it was a Python Framework!
IMHO, Bitbucket would be a better choice! They use Django! and are long time supporters! With Bitbucket git could also be used... but I again think that Hg would be a better choice: Easier!, better windows support (I' a Mac user, but this a point to consider!) and its made with PYTHON.
Posted by Christian on April 30, 2012, at 12:24 p.m.:
This is what we've been waiting for a long time.
I don't know if Github is the right place for the django project - but GIT is for sure the right choice from my point of view.
Posted by Patrick Taylor on April 30, 2012, at 2:50 p.m.:
@Pedro, Bitbucket is quite good and I tend to agree that it's good to support those that support you, but the number of people that used Github for DVCS access to Django is at least an order of magnitude higher:
That's not a difference that I think the core team could ignore.
Posted by zodman on April 30, 2012, at 5:24 p.m.:
i think can implement a Jira with github integration changing the trac. Jira is very powerfull and workflow enabled.
Posted by Mitar on April 30, 2012, at 7:21 p.m.:
I would be interested in more details how are you planing to integrate GitHub with Trac. I am also doing this just at this moment for one our open source project, where we will probably be syncing GitHub repository with local clone for Trac and using Trac tickets for development. I am familiar with Trac plugins and have made a similar plugin for syncing repositories for Bitbucket. And now I was thinking of making the same for GitHub: https://bitbucket.org/mitar/trac-bitbucketsync/overview
If you need any help or would like to collaborate on this matter, feel free to contact me. (Contact on my website.)
Posted by Diederik van der Boor on May 1, 2012, at 7:50 a.m.:
@Mitar: trac needs a locally stored git repository, which you can create using:
git clone --mirror http://example.com/path-to-repos/
That way, you get a 'bare' repository with all the server branches as local branches. Trac can read the contents of that repository. In a cronjob, you can run a 'git fetch' to keep the repository up to date.
Posted by Lurker on May 1, 2012, at 4:27 p.m.:
Vey nice article about a git-svn/svn2git migration adventure, read by someone who is now forced to "make do" with the seriously rotten (buggy beyond belief) infrastructure provided by the Microsoft TFS "ecosystem" (SvnBridge), to try to achieve a sufficiently git-enabled (or possibly Mercurial) workflow against all odds - which would actually be *much* better than the previous VSS-hampered Something (yuck!). The jury is still out on whether a drastically patched-up SvnBridge or git-tfs (probably hosted on Windows server side, or perhaps the Mono Trojan Horse) is the way to go to achieve sufficient service quality, though. Still, definitely not my premier choice of tooling...
Posted by Mitar on May 2, 2012, at 7:29 a.m.:
@Diederik: I know, but you can also do GitHub service hook and call to Trac on every push to GitHub, to sync the local git repository. I have made a similar plugin (link in my previous post) for Bitbucket&Mercurial.
I would like to know what is this technological solution Jacob has been working on because I could maybe help.
Posted by Rob on May 2, 2012, at 2:47 p.m.:
Thanks for posting your adventure in detail!
It's nice to see you talk about doing a migration over again, and why, because I've done that.
Examples I can remember doing:
- CVS to SVN
- Bugzilla (old, extra fields) to Bugzilla (then-new 3.0, standard fields)
- some other database migrations
- XML content migrations
- DTD to newer DTD
All of them required test runs and second runs. Testing with subsets has also been useful. It's important to remember as a community that, after initial analysis, the pattern of ((try, fail, fix) repeat) is a valuable tool in software development, way faster than analysis paralysis.
Posted by Eric Rohlfs on May 3, 2012, at 12:33 p.m.:
Posted by Mitar on May 4, 2012, at 12:52 p.m.:
I made this Trac plugin to keep GitHub repository synced with my local Trac git mirror repository:
It allows us to have GitHub repository and push there and work there, but still use Trac for ticketing and wiki.
Posted by Larry Martell on May 6, 2012, at 12:05 a.m.:
Last month I set up a system and when I cloned django from git I got version 1.4. Now I an setting up another system, and when I clone django from git I get 1.5, and my app is failing. How I can get 1.4? I see a django-nonrel / django-1.4 but it says "Work in progress 1.4 port, DON'T USE"
Posted by Ramiro on May 6, 2012, at 3:16 p.m.:
Look at https://github.com/django/djangoproject.com/tree/github/django_website/gitrachub
Posted by Paul on May 10, 2012, at 4:35 p.m.:
It would be a huge flop for the whole community if you forget to opensource the project afterwards =)
Posted by Andrés Torres Marroquín on June 6, 2012, at 12:49 p.m.:
This is awesome man, I think we can go forward faster because GitHub pull requests.
Posted by Nicol on June 12, 2012, at 1:09 a.m.:
I have to highly aaulppd the WebFaction team for getting this working and making it available to their users. I have been trying with great difficulty to get this working on my own VPS till a friend (and WebFaction customer) told me to take a look here the timing was almost perfect as I had been dealing with this on and off for over three months and was despairing. This combination of Trac and Git alone is prompting me to seriously investigate migrating all my hosting to WebFaction. Well done WF!
Posted by qwerqwert on July 23, 2012, at 1:18 a.m.:
Your image http://www.holovaty.com/images/2012-04-28github2.png is a broken link (404).
Comments have been turned off for this page.