How git branches and tags are stored in disks?

拟墨画扇 提交于 2019-12-12 08:15:26

问题


I recently checked one of my git repositories at work, which had more than 10,000 branches and more than 30000 tags. The total size of the repo, after a fresh clone is 12Gigs. I am sure there is no reason to have 10000 branches. So I believe they would occupy considerable amount of space in the disks. So, my questions are as follows

  1. How branches and tags are stored in disks, like what data-structure used, what information is stored for every branch?
  2. How do I get the metadata about the branches? like when that branch was created, what the size of the branch is.

回答1:


All git references (branches, tags, notes, stashes, etc) use the same system. These are:

  • the references themselves, and
  • "reflogs"

Reflogs are stored in .git/logs/refs/ based on the reference-name, with one exception: reflogs for HEAD are stored in .git/logs/HEAD rather than .git/logs/refs/HEAD.

References come either "loose" or "packed". Packed refs are in .git/packed-refs, which is a flat file of (SHA-1, refname) pairs for simple refs, plus extra information for annotated tags. "Loose" refs are in .git/refs/name. These files contain either a raw SHA-1 (probably the most common), or the literal string ref: followed by the name of another reference for symbolic refs (usually only for HEAD but you can make others). Symbolic refs are not packed (or at least, I can't seem to make that happen :-) ).

Packing tags and "idle" branch heads (those that are not being updated actively) saves space and time. You can use git pack-refs to do this. However, git gc invokes git pack-refs for you, so generally you don't need to do this yourself.




回答2:


So, I’m going to expand on the topic a bit and explain how Git stores what. Doing so will explain what information is stored, and what exactly matters for the size of the repository. As a fair warning: this answer is rather long :)

Git objects

Git is essentially a database of objects. Those objects come in four different types and are all identified by a SHA1 hash of their contents. The four types are blobs, trees, commits and tags.

Blob

A blob is the simplest type of objects. It stores the content of a file. So for each file content you store within your Git repository, a single blob object exists in the object database. As it stores only the file content, and not metadata like file names, this is also the mechanism that prevents files with identical content from being stored multiple times.

Tree

Going one level up, the tree is the object that puts the blobs into a directory structure. A single tree corresponds to a single directory. It is essentially a list of files and subdirectories, with each entry containing a file mode, a file or directory name, and a reference to the Git object that belongs to the entry. For subdirectories, this reference points to the tree object that describes the subdirectory; for files, this reference points to the blob object storing the file contents.

Commit

Blobs and trees are already enough to represent a complete file system. To add the versioning on top of that, we have commit objects. Commit objects are created whenever you commit something in Git. Each commit represents a snapshot in the history of revisions.

It contains a reference to the tree object describing the root directory of the repository. This also means that every commit that actually introduces some changes at least requires a new tree object (likely more).

A commit also contains a reference to its parent commits. While there is usually just a single parent (for a linear history), a commit can have any number of parents in which case it’s usually called a merge commit. Most workflows will only ever make you do merges with two parents, but you can really have any other number too.

And finally, a commit also contains the meta data you expect a commit to have: Author and committer (name and time) and of course the commit message.

That is all that is necessary to have a full version control system; but of course there is one more object type:

Tag

Tag objects are one way to store tags. To be precise, tag objects store annotated tags, that are tags that have—similar to commits—some meta information. They are created by git tag -a (or when creating a signed tag) and require a tag message. They also contain a reference to the commit object they are pointing at, and a tagger (name and time).

References

Up until now, we have a full versioning system, with annotated tags, but all our objects are identified by their SHA1 hash. That’s of course a bit annoying to use, so we have some other thing to make it easier: References.

References come in different flavors, but the most important thing about them is this: They are simple text files containing 40 characters—the SHA1 hash of the object they are pointing to. Because they are this simple, they are very cheap, so working with many references is no problem at all. It creates no overhead and there is no reason not to use them.

There are usually three “types” of references: Branches, tags and remote branches. They really work the same and all point to commit objects; except for annotated tags which point to tag objects (normal tags are just commit references though too). The difference between them is how you create them, and in which subpath of /refs/ they are stored. I won’t cover this now though, as this is explained in nearly every Git tutorial; just remember: References, i.e. branches, are extremely cheap, so don’t hesitate to create them for just about everything.

Compression

Now because torek mentioned something about Git’s compression in his answer, I want to clarify this a bit. Unfortunately he mixed a few things up.

So, usually for new repositories, all Git objects are stored in .git/objects as files identified by their SHA1 hash. The first two characters are stripped from the filename and are used to partition the files into multiple folders, just so it gets a bit easier to navigate.

At some point, when the history gets bigger or when it is triggered by something else, Git will start to compress objects. It does this by packing multiple objects into a single pack file. How this exactly works is not really that important; it will reduce the amount of individual Git objects and efficiently store them in single, indexed archives (at this point, Git will use delta compression btw.). The pack files are then stored in .git/objects/pack and can easily get a few hundred MiB in size.

For references, the situation is somewhat similar, although a lot simpler. All current references are stored in .git/refs, e.g. branches in .git/refs/heads, tags in .git/refs/tags and remote branches in .git/refs/remotes/<remote>. As mentioned above, they are simple text files containing only the 40 character identifier of the object they are pointing at.

At some point, Git will move older references—of any type—into a single lookup file: .git/packed-refs. That file is just a long list of hashes and reference names, one entry per line. References that are kept in there are removed from the refs directory.

Reflogs

Torek mentioned those as well, reflogs are essentially just logs for references. They keep track of what happens to references. If you do anything that affects a reference (commit, checkout, reset, etc.) then a new log entry is added simply to log what happened. It also provides a way to go back after you did something wrong. A common use case for example is to access the reflog after accidentally resetting a branch to somewhere it wasn’t supposed to go. You can then use git reflog to look at the log and see where the reference was pointing at before. As loose Git objects are not immediately deleted (objects that are part of the history are never deleted), you can usually restore the previous situation easily.

Reflogs are however local: They only keep track of what happens to your local repository. They are not shared with remotes, and are never transferred. A freshly cloned repository will have a reflog with a single entry, it being the clone action. They are also limited to a certain length after which older actions are pruned, so they won’t become a storage problem.

Some final words

So, getting back to your actual question. When you clone a repository, Git will usually already receive the repository in a packed format. This is already done to save transfer time. References are very cheap, so they are never the cause of big repositories. However, because of Git’s nature, a single current commit object has a whole acyclic graph in it that eventually will reach the very first commit, the very first tree, and the very first blob. So a repository will always contain all the information for all revisions. That is what makes repositories with a long history big. Unfortunately, there is not really much you can do about it. Well, you could cut off older history at some part but that will leave you with a broken repository (you do this by cloning with the --depth parameter).

And as for your second question, as I explained above, branches are just references to commits, and references are only pointers to Git objects. So no, there is not really any metadata about branches you can get from them. The only thing that might give you an idea is the first commit you made when branching off in your history. But having branches does not automatically mean that there is actually a branch kept in the history (fast-foward merging and rebasing works against it), and just because there is some branching-off in the history that does not mean that the branch (the reference, the pointer) still exists.




回答3:


Note: regarding pack-refs, the process of creating them should be much faster with Git 2.2+ (November 2014)

See commit 9540ce5 by Jeff King (peff):

refs: write packed_refs file using stdio

We write each line of a new packed-refs file individually using a write() syscall (and sometimes 2, if the ref is peeled). Since each line is only about 50-100 bytes long, this creates a lot of system call overhead.

We can instead open a stdio handle around our descriptor and use fprintf to write to it. The extra buffering is not a problem for us, because nobody will read our new packed-refs file until we call commit_lock_file (by which point we have flushed everything).

On a pathological repository with 8.5 million refs, this dropped the time to run git pack-refs from 20s to 6s.


Update Sept 2016: Git 2.11+ will include chained tags inpack-refs ("chained tags and git clone --single-branch --branch tag")

And the same Git 2.11 will now use fully pack bitmap.

See commit 645c432, commit 702d1b9 (10 Sep 2016) by Kirill Smelkov (navytux).
Helped-by: Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 7f109ef, 21 Sep 2016)

pack-objects: use reachability bitmap index when generating non-stdout pack

Pack bitmaps were introduced in Git 2.0 (commit 6b8fda2, Dec. 2013), from google's work for JGit.

We use the bitmap API to perform the Counting Objects phase in pack-objects, rather than a traditional walk through the object graph.

Now (2016):

Starting from 6b8fda2 (pack-objects: use bitmaps when packing objects), if a repository has bitmap index, pack-objects can nicely speedup "Counting objects" graph traversal phase.
That however was done only for case when resultant pack is sent to stdout, not written into a file.

One might want to generate on-disk packfiles for a specialized object transfer.
It would be useful to have some way of overriding this heuristic:
to tell pack-objects that even though it should generate on-disk files, it is still OK to use the reachability bitmaps to do the traversal.


Note: GIt 2.12 illlustrates that using bitmap has a side-effect on git gc --auto

See commit 1c409a7, commit bdf56de (28 Dec 2016) by David Turner (csusbdt).
(Merged by Junio C Hamano -- gitster -- in commit cf417e2, 18 Jan 2017)

The bitmap index only works for single packs, so requesting an incremental repack with bitmap indexes makes no sense.

Incremental repacks are incompatible with bitmap indexes


Git 2.14 refines pack-objects

See commit da5a1f8, commit 9df4a60 (09 May 2017) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 137a261, 29 May 2017)

pack-objects: disable pack reuse for object-selection options

If certain options like --honor-pack-keep, --local, or --incremental are used with pack-objects, then we need to feed each potential object to want_object_in_pack() to see if it should be filtered out.
But when the bitmap reuse_packfile optimization is in effect, we do not call that function at all, and in fact skip adding the objects to the to_pack list entirely.

This means we have a bug: for certain requests we will silently ignore those options and include objects in that pack that should not be there.

The problem has been present since the inception of the pack-reuse code in 6b8fda2 (pack-objects: use bitmaps when packing objects, 2013-12-21), but it was unlikely to come up in practice.
These options are generally used for on-disk packing, not transfer packs (which go to stdout), but we've never allowed pack reuse for non-stdout packs (until 645c432, we did not even use bitmaps, which the reuse optimization relies on; after that, we explicitly turned it off when not packing to stdout).



来源:https://stackoverflow.com/questions/20666331/how-git-branches-and-tags-are-stored-in-disks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!