Distributed version control for HUGE projects - is it feasible?

问题

We're pretty happy with SVN right now, but Joel's tutorial intrigued me. So I was wondering - would it be feasible in our situation too?

The thing is - our SVN repository is HUGE. The software itself has a 15 years old legacy and has survived several different source control systems already. There are over 68,000 revisions (changesets), the source itself takes up over 100MB and I cant even begin to guess how many GB the whole repository consumes.

The problem then is simple - a clone of the whole repository would probably take ages to make, and would consume far more space on the drive that is remotely sane. And since the very point of distributed version control is to have a as many repositories as needed, I'm starting to get doubts.

How does Mercurial (or any other distributed version control) deal with this? Or are they unusable for such huge projects?

Added: To clarify - the whole thing is one monolithic beast of a project which compiles to a single .EXE and cannot be split up.

Added 2: Second thought - The Linux kernel repository uses git and is probably an order of magnitude or two bigger than mine. So how do they make it work?

回答1:

100MB of source code is less than the Linux kernel. Changelog between Linux kernel 2.6.33 and 2.6.34-rc1 has 6604 commits. Your repository scale doesn't sound intimidating to me.

Linux kernel 2.6.34-rc1 uncompressed from .tar.bz2 archive: 445MB
Linux kernel 2.6 head checked out from main Linus tree: 827MB

Twice as much, but still peanuts with the big hard drives we all have.

回答2:

Distributed version control for HUGE projects - is it feasible?

Absolutely! As you know, Linux is massive and uses Git. Mercurial is used for some major projects too, such as Python, Mozilla, OpenSolaris and Java.

We're pretty happy with SVN right now, but Joel's tutorial intrigued me. So I was wondering - would it be feasible in our situation too?

Yes. And if you're happy with Subversion now, you're probably not doing much branching and merging!

The thing is - our SVN repository is HUGE. [...] There are over 68,000 revisions (changesets), the source itself takes up over 100MB

As others have pointed out, that's actually not so big compared to many existing projects.

The problem then is simple - a clone of the whole repository would probably take ages to make, and would consume far more space on the drive that is remotely sane.

Both Git and Mercurial are very efficient at managing the storage, and their repositories take up far less space than the equivalent Subversion repo (having converted a few). And once you have an initial checkout, you're only pushing deltas around, which is very fast. They are both significantly faster in most operations. The initial clone is a one-time cost, so it doesn't really matter how long it takes (and I bet you'd be surprised!).

And since the very point of distributed version control is to have a as many repositories as needed, I'm starting to get doubts.

Disk space is cheap. Developer productivity matters far more. So what if the repo takes up 1GB? If you can work smarter, it's worth it.

How does Mercurial (or any other distributed version control) deal with this? Or are they unusable for such huge projects?

It is probably worth reading up on how projects using Mercurial such as Mozilla managed the conversion process. Most of these have multiple repos, which each contain major components. Mercurial and Git both have support for nested repositories too. And there are tools to manage the conversion process - Mercurial has built-in support for importing from most other systems.

Added: To clarify - the whole thing is one monolithic beast of a project which compiles to a single .EXE and cannot be split up.

That makes it easier, as you only need the one repository.

Added 2: Second thought - The Linux kernel repository uses git and is probably an order of magnitude or two bigger than mine. So how do they make it work?

Git is designed for raw speed. The on-disk format, the wire protocol, the in-memory algorithms are all heavily optimized. And they have developed sophisticated workflows, where patches flow from individual developers, up to subsystem maintainers, up to lieutenants, and eventually up to Linus. One of the best things about DVCS is that they are so flexible they enable all sorts of workflows.

I suggest you read the excellent book on Mercurial by Bryan O'Sullivan, which will get you up to speed fast. Download Mercurial and work through the examples, and play with it in some scratch repos to get a feel for it.

Then fire up the convert command to import your existing source repository. Then try making some local changes, commits, branches, view logs, use the built-in web server, and so on. Then clone it to another box and push around some changes. Time the most common operations, and see how it compares. You can do a complete evaluation at no cost but some of your time.

回答3:

Don't worry about repository space requirements. My anecdote: when I converted our codebase from SVN to git (full history - I think), I found that the clone used less space than just the WVN working directory. SVN keeps a pristine copy of all your checked-out files: look at $PWD/.svn/text-base/ in any SVN checkout. With git the entire history takes less space.

What really surprised me was how network-efficient git is. I did a git clone of a project at a well-connected place, then took it home on a flash disk, where I keep it up to date with git fetch / git pull, with just my puny little GPRS connection. I wouldn't dare to do the same in an SVN-controlled project.

You really owe it to yourself to at least try it. I think you'll be amazed at just how wrong your centralised-VCS-centric assumptions were.

回答4:

Do you need all history? If you only need the last year or two, you could consider leaving the current repository in a read-only state for historical reference. Then create a new repository with only recent history by performing svnadmin dump with the lower bound revision, which forms the basis for your new distributed repository.

I do agree with the other answer that 100MB working copy and 68K revisions isn't that big. Give it a shot.

回答5:

You say you're happy with SVN... so why change?

As far as distributed version control systems go, Linux uses git and Sun use Mercurial. Both are impressively large source code repositories, and they work just fine. Yes, you end up with all revisions on all workstations, but that's the price you pay for decentralisation. Remember storage is cheap - my development laptop currently has 1TB (2x500GB) of hard disk storage on board. Have you tested pulling your SVN repo into something like Git or Mercurial to actually see how much space it would take?

My question would be - are you ready as an organisation to go decentralised? For a software shop it usually makes much more sense to keep a central repository (regular backups, hook ups to CruiseControl or FishEye, easier to control and administer).

And if you just want something faster or more scalable than SVN, then just buy a commercial product - I've used both Perforce and Rational ClearCase and they scale up to huge projects without any problems.

回答6:

You'd split your one huge repository into lots of smaller repositories, each for each module in your old repo. That way people would simply hold as repositories whatever SVN projects they would have had before. Not much more space required than before.

回答7:

I am using git on a fairly large c#/.net project (68 projects in 1 solution) and the TFS footprint of a fresh fetch of the full tree is ~500Mb. The git repo, storing a fair amount of commits locally weighs in at ~800Mb. The compaction and the way that storage works internally in git is excellent. It is amazing to see so many changes packed in to such a small amount of space.

回答8:

From my experience, Mercurial is pretty good at handling a large number of files and a huge history. The drawback is that you shouldn't check-in files bigger than 10 Mb. We used Mercurial to keep an history of our compiled DLL. It's not recommend to put binaries in a source countrol but we tried it anyway (it was a repository dedicated to the binaries). The repository was about 2 Gig and we are not too sure that we will be able to keep doing that in the future. Anyway, for source code I don't think you need to worry.

回答9:

Git can obviously work with a project as big as yours since, as you pointed, Linux kernel alone is bigger.

The challenge (don't know if you manage big files) with Mercurial and Git is that they can't manage big files (so far).

I've experience moving a project your size (and around for 15 years too) from CVS/SVN (a mix of the two actually) into Plastic SCM for distributed and centralized (the two workflows happening inside the same organization at the same time) development.

The move will never be seamless since it's not only a tech problem but involves a lot of people (a project as big as yours probably involves several hundreds of developers, doesn't it?), but there are importers to automate migration and training can be done very fast too.

回答10:

No, does not work. You dont want anything that requires signiciant storage on client side then. If you get that large (by toring fo rexample images etc.), the storage requires more than a normal workstation has anyway to be efficient.

You better go with something centralized then. Simple math - it simlpy is not feasible to have tond of gb on every workstation AND be efficient there. It simply makes no sense.

来源：https://stackoverflow.com/questions/2476356/distributed-version-control-for-huge-projects-is-it-feasible

标签

svn

mercurial

scalability

dvcs