I\'m wondering about what git is doing when it pushes up changes, and why it seems to occasionally push way more data than the changes I\'ve made. I made some changes to two
I just realized that there is very realistic scenario which can result in unusually big push.
What objects push does send? Which do not yet exist on server. Or, rather which it did not detect as existing. How does it check object existence? In the beginning of push, server sends references (branches and tags) which is has. So, for example, if they have following commits:
CLIENT SERVER
(foo) -----------> aaaaa1
|
(origin/master) -> aaaaa0 (master) -> aaaaa0
| |
... ...
Then client will get the something like /refs/heads/master aaaaa0
, and find that it has to send only what is new in commit aaaaa1
.
But, if somebody has pushed anything to remote master, it is different:
CLIENT SERVER
(foo) -----------> aaaaa1 (master) --> aaaaa2
| /
(origin/master) -> aaaaa0 aaaaa0
| |
... ...
Here, client gets refs/heads/master aaaaa2
, but it does not know anything about aaaaa2, so it cannot deduce that aaaaa0
exists on the server. So, in this simple case of only 2 branches the whole history will be sent instead of only incremental one.
This is unlikely to happen in grown up, being actively developed, project, which has tags and many branches some of which become stale and are not updated. So users might be sending a bit more, but it does not become that big difference as in your case, and goes unspotted. But in very small teams it can happen more often and the difference would be significant.
To avoid it, you could run git fetch
before push. Then, in my example, the aaaaa2
commit would already exist at client and git push foo
would know that it should not send aaaaa0
and older history.
Read here for the push implementation in protocol.
PS: the recent git commit graph feature might help with it, but I have not tried it.
When I went to push that data up to origin, git turned that into over 47mb of data..
Looks like your repository contains a lot of binaries data.
git-push
- Update remote refs along with associated objects
associated objects
?After each commit you do git perform a pack
of your data into files named
XX.pack
&& `XX.idx'
A good reading about the packing is here
The packed archive format
.pack
is designed to be self-contained so that it can be unpacked without any further information.
Therefore, each object that a delta depends upon must be present within the pack.A pack index file
.idx
is generated for fast, random access to the objects in the pack.Placing both the index file
.idx
and the packed archive.pack
in thepack
subdirectory of$GIT_OBJECT_DIRECTORY
(or any of the directories on$GIT_ALTERNATE_OBJECT_DIRECTORIES
) enables Git to read from the pack archive.
When git pack your files it does it in a smart way so it will be very fast to extract data.
In order to achieve this git use pack-heuristics which is basically looking for similar part of content in your pack and storing them as single one, meaning - if you have the same header (License agreement for example) in many files, git will "find" it and will store it once.
Now all the files which include this license will contain pointer to the header code. In this case git doesn't have to store the same code over and over so the pack size is minimal.
This is one of the reasons why it's not a good idea and not recommended to store binary files in git since the chance of having similarity is very low so the pack size will not be optimal.
Git store your data in a zipped format to reduce space so again binary will not be optimal as well whcn zipped (size wize).
Here is a sample of the git blob using the zipped compression: