I have a repository with branches master and A and lots of merge activity between the two. How can I find the commit in my repository when branch A was created based on mast
Sometimes it is effectively impossible (with some exceptions of where you might be lucky to have additional data) and the solutions here wont work.
Git doesn't preserve ref history (which includes branches). It only stores the current position for each branch (the head). This means you can lose some branch history in git over time. Whenever you branch for example, it's immediately lost which branch was the original one. All a branch does is:
git checkout branch1 # refs/branch1 -> commit1
git checkout -b branch2 # branch2 -> commit1
You might assume that the first commited to is the branch. This tends to be the case but it's not always so. There's nothing stopping you from commiting to either branch first after the above operation. Additionally, git timestamps aren't guaranteed to be reliable. It's not until you commit to both that they truly become branches structurally.
While in diagrams we tend to number commits conceptually, git has no real stable concept of sequence when the commit tree branches. In this case you can assume the numbers (indicating order) are determined by timestamp (it might be fun to see how a git UI handles things when you set all the timestamps to the same).
This is what a human expect conceptually:
After branch:
C1 (B1)
/
-
\
C1 (B2)
After first commit:
C1 (B1)
/
-
\
C1 - C2 (B2)
This is what you actually get:
After branch:
- C1 (B1) (B2)
After first commit (human):
- C1 (B1)
\
C2 (B2)
After first commit (real):
- C1 (B1) - C2 (B2)
You would assume B1 to be the original branch but it could infact simply be a dead branch (someone did checkout -b but never committed to it). It's not until you commit to both that you get a legitimate branch structure within git:
Either:
/ - C2 (B1)
-- C1
\ - C3 (B2)
Or:
/ - C3 (B1)
-- C1
\ - C2 (B2)
You always know that C1 came before C2 and C3 but you never reliably know if C2 came before C3 or C3 came before C2 (because you can set the time on your workstation to anything for example). B1 and B2 is also misleading as you can't know which branch came first. You can make a very good and usually accurate guess at it in many cases. It is a bit like a race track. All things generally being equal with the cars then you can assume that a car that comes in a lap behind started a lap behind. We also have conventions that are very reliable, for example master will nearly always represent the longest lived branches although sadly I have seen cases where even this is not the case.
The example given here is a history preserving example:
Human:
- X - A - B - C - D - F (B1)
\ / \ /
G - H ----- I - J (B2)
Real:
B ----- C - D - F (B1)
/ / \ /
- X - A / \ /
\ / \ /
G - H ----- I - J (B2)
Real here is also misleading because we as humans read it left to right, root to leaf (ref). Git does not do that. Where we do (A->B) in our heads git does (A<-B or B->A). It reads it from ref to root. Refs can be anywhere but tend to be leafs, at least for active branches. A ref points to a commit and commits only contain a like to their parent/s, not to their children. When a commit is a merge commit it will have more than one parent. The first parent is always the original commit that was merged into. The other parents are always commits that were merged into the original commit.
Paths:
F->(D->(C->(B->(A->X)),(H->(G->(A->X))))),(I->(H->(G->(A->X))),(C->(B->(A->X)),(H->(G->(A->X)))))
J->(I->(H->(G->(A->X))),(C->(B->(A->X)),(H->(G->(A->X)))))
This is not a very efficient representation, rather an expression of all the paths git can take from each ref (B1 and B2).
Git's internal storage looks more like this (not that A as a parent appears twice):
F->D,I | D->C | C->B,H | B->A | A->X | J->I | I->H,C | H->G | G->A
If you dump a raw git commit you'll see zero or more parent fields. If there are zero, it means no parent and the commit is a root (you can actually have multiple roots). If there's one, it means there was no merge and it's not a root commit. If there is more than one it means that the commit is the result of a merge and all of the parents after the first are merge commits.
Paths simplified:
F->(D->C),I | J->I | I->H,C | C->(B->A),H | H->(G->A) | A->X
Paths first parents only:
F->(D->(C->(B->(A->X)))) | F->D->C->B->A->X
J->(I->(H->(G->(A->X))) | J->I->H->G->A->X
Or:
F->D->C | J->I | I->H | C->B->A | H->G->A | A->X
Paths first parents only simplified:
F->D->C->B->A | J->I->->G->A | A->X
Topological:
- X - A - B - C - D - F (B1)
\
G - H - I - J (B2)
When both hit A their chain will be the same, before that their chain will be entirely different. The first commit another two commits have in common is the common ancestor and from whence they diverged. there might be some confusion here between the terms commit, branch and ref. You can in fact merge a commit. This is what merge really does. A ref simply points to a commit and a branch is nothing more than a ref in the folder .git/refs/heads, the folder location is what determines that a ref is a branch rather than something else such as a tag.
Where you lose history is that merge will do one of two things depending on circumstances.
Consider:
/ - B (B1)
- A
\ - C (B2)
In this case a merge in either direction will create a new commit with the first parent as the commit pointed to by the current checked out branch and the second parent as the commit at the tip of the branch you merged into your current branch. It has to create a new commit as both branches have changes since their common ancestor that must be combined.
/ - B - D (B1)
- A /
\ --- C (B2)
At this point D (B1) now has both sets of changes from both branches (itself and B2). However the second branch doesn't have the changes from B1. If you merge the changes from B1 into B2 so that they are syncronised then you might expect something that looks like this (you can force git merge to do it like this however with --no-ff):
Expected:
/ - B - D (B1)
- A / \
\ --- C - E (B2)
Reality:
/ - B - D (B1) (B2)
- A /
\ --- C
You will get that even if B1 has additional commits. As long as there aren't changes in B2 that B1 doesn't have, the two branches will be merged. It does a fast forward which is like a rebase (rebases also eat or linearise history), except unlike a rebase as only one branch has a change set it doesn't have to apply a changeset from one branch on top of that from another.
From:
/ - B - D - E (B1)
- A /
\ --- C (B2)
To:
/ - B - D - E (B1) (B2)
- A /
\ --- C
If you cease work on B1 then things are largely fine for preserving history in the long run. Only B1 (which might be master) will advance typically so the location of B2 in B2's history successfully represents the point that it was merged into B1. This is what git expects you to do, to branch B from A, then you can merge A into B as much as you like as changes accumulate, however when merging B back into A, it's not expected that you will work on B and further. If you carry on working on your branch after fast forward merging it back into the branch you were working on then your erasing B's previous history each time. You're really creating a new branch each time after fast forward commit to source then commit to branch. You end up with when you fast forward commit is lots of branches/merges that you can see in the history and structure but without the ability to determine what the name of that branch was or if what looks like two separate branches is really the same branch.
0 1 2 3 4 (B1)
/-\ /-\ /-\ /-\ /
---- - - - -
\-/ \-/ \-/ \-/ \
5 6 7 8 9 (B2)
1 to 3 and 5 to 8 are structural branches that show up if you follow the history for either 4 or 9. There's no way in git to know which of this unnamed and unreferenced structural branches belong to with of the named and references branches as the end of the structure. You might assume from this drawing that 0 to 4 belongs to B1 and 4 to 9 belongs to B2 but apart from 4 and 9 was can't know which branch belongs to which branch, I've simply drawn it in a way that gives the illusion of that. 0 might belong to B2 and 5 might belong to B1. There are 16 different possibilies in this case of which named branch each of the structural branches could belong to. This is assuming that none of these structural branches came from a deleted branch or as a result of merging a branch into itself when pulling from master (the same branch name on two repos is infact two branches, a separate repository is like branching all branches).
There are a number of git strategies that work around this. You can force git merge to never fast forward and always create a merge branch. A horrible way to preserve branch history is with tags and/or branches (tags are really recommended) according to some convention of your choosing. I realy wouldn't recommend a dummy empty commit in the branch you're merging into. A very common convention is to not merge into an integration branch until you want to genuinely close your branch. This is a practice that people should attempt to adhere to as otherwise you're working around the point of having branches. However in the real world the ideal is not always practical meaning doing the right thing is not viable for every situation. If what you're doing on a branch is isolated that can work but otherwise you might be in a situation where when multiple developers are working one something they need to share their changes quickly (ideally you might really want to be working on one branch but not all situations suit that either and generally two people working on a branch is something you want to avoid).