Taking the following example:
>>> df1 = pd.DataFrame({\"x\":[1, 2, 3, 4, 5],
\"y\":[3, 4, 5, 6, 7]},
ind
First, OP misunderstood the rows and columns in his/her dataframe.
But the acutal output considers rows that are found in both dataframes.(the only common row element 'y')
OP thought the label y
is for row. However, y
is a column name.
df1 = pd.DataFrame(
{"x":[1, 2, 3, 4, 5], # <-- looks like row x but actually col x
"y":[3, 4, 5, 6, 7]}, # <-- looks like row y but actually col y
index=['a', 'b', 'c', 'd', 'e'])
print(df1)
\col x y
index or row\
a 1 3 | a
b 2 4 v x
c 3 5 r i
d 4 6 o s
e 5 7 w 0
-> column
a x i s 1
It is very easy to be misled since in the dictionary, it looks like y
and x
are two rows.
If you generate df1
from a list of list, it should be more intuitive:
df1 = pd.DataFrame([[1,3],
[2,4],
[3,5],
[4,6],
[5,7]],
index=['a', 'b', 'c', 'd', 'e'], columns=["x", "y"])
So back to the problem, concat
is a shorthand for concatenate (means to link together in a series or chain on this way [source]) Performing concat
along axis 0 means to linking two objects along axis 0.
1
1 <-- series 1
1
^ ^ ^
| | | 1
c a a 1
o l x 1
n o i gives you 2
c n s 2
a g 0 2
t | |
| V V
v
2
2 <--- series 2
2
So... think you have the feeling now. What about sum function in pandas? What does sum(axis=0)
means?
Suppose data looks like
1 2
1 2
1 2
Maybe...summing along axis 0, you may guess. Yes!!
^ ^ ^
| | |
s a a
u l x
m o i gives you two values 3 6 !
| n s
v g 0
| |
V V
What about dropna? Suppose you have data
1 2 NaN
NaN 3 5
2 4 6
and you only want to keep
2
3
4
On the documentation, it says Return object with labels on given axis omitted where alternately any or all of the data are missing
Should you put dropna(axis=0)
or dropna(axis=1)
? Think about it and try it out with
df = pd.DataFrame([[1, 2, np.nan],
[np.nan, 3, 5],
[2, 4, 6]])
# df.dropna(axis=0) or df.dropna(axis=1) ?
Hint: think about the word along.
This is my trick with axis: just add the operation in your mind to make it sound clear:
If you “sum” through axis=0, you are summing all rows, and the output will be a single row with the same number of columns. If you “sum” through axis=1, you are summing all columns, and the output will be a single column with the same number of rows.
Data:
In [55]: df1
Out[55]:
x y
a 1 3
b 2 4
c 3 5
d 4 6
e 5 7
In [56]: df2
Out[56]:
y z
b 1 9
c 3 8
d 5 7
e 7 6
f 9 5
Concatenated horizontally (axis=1), using index elements found in both DFs (aligned by indexes for joining):
In [57]: pd.concat([df1, df2], join='inner', axis=1)
Out[57]:
x y y z
b 2 4 1 9
c 3 5 3 8
d 4 6 5 7
e 5 7 7 6
Concatenated vertically (DEFAULT: axis=0), using columns found in both DFs:
In [58]: pd.concat([df1, df2], join='inner')
Out[58]:
y
a 3
b 4
c 5
d 6
e 7
b 1
c 3
d 5
e 7
f 9
If you don't use the inner
join method - you will have it this way:
In [62]: pd.concat([df1, df2])
Out[62]:
x y z
a 1.0 3 NaN
b 2.0 4 NaN
c 3.0 5 NaN
d 4.0 6 NaN
e 5.0 7 NaN
b NaN 1 9.0
c NaN 3 8.0
d NaN 5 7.0
e NaN 7 6.0
f NaN 9 5.0
In [63]: pd.concat([df1, df2], axis=1)
Out[63]:
x y y z
a 1.0 3.0 NaN NaN
b 2.0 4.0 1.0 9.0
c 3.0 5.0 3.0 8.0
d 4.0 6.0 5.0 7.0
e 5.0 7.0 7.0 6.0
f NaN NaN 9.0 5.0
Interpret axis=0 to apply the algorithm down each column, or to the row labels (the index).. A more detailed schema here.
If you apply that general interpretation to your case, the algorithm here is concat
. Thus for axis=0, it means:
for each column, take all the rows down (across all the dataframes for concat
) , and do contact them when they are in common (because you selected join=inner
).
So the meaning would be to take all columns x
and concat them down the rows which would stack each chunk of rows one after another. However, here x
is not present everywhere, so it is not kept for the final result. The same applies for z
. For y
the result is kept as y
is in all dataframes. This is the result you have.
If someone needs visual description, here is the image: