Pandas: Subtract row mean from each element in row

前端 未结 2 1204
面向向阳花
面向向阳花 2020-12-29 07:39

I have a dataframe with rows indexed by chemical element type and columns representing different samples. The values are floats representing the degree of presence of the ro

2条回答
  •  囚心锁ツ
    2020-12-29 08:19

    Additionally to @ajcr's excellent answer, you might want to consider rearranging how you store your data.

    The way you're doing it at the moment, with different samples in different columns, is the way it would be represented if you were using a spreadsheet, but this might not be the most helpful way to represent your data.

    Normally, each column represents a unique piece of information about a single real-world entity. The typical example of this kind of data is a person:

    id  name  hair_colour  Age
    1   Bob   Brown        25
    

    Really, your different samples are different real-world entities.

    I would therefore suggest having a two-level index to describe each single piece of information. This makes manipulating your data in the way you want far more convenient.

    Thus:

    >>> df = pd.DataFrame([['Sn',1,2,3],['Pb',2,4,6]],
                          columns=['element', 'A', 'B', 'C']).set_index('element')
    >>> df.columns.name = 'sample'
    >>> df # This is how your DataFrame looks at the moment
    sample   A  B  C
    element         
    Sn       1  2  3
    Pb       2  4  6
    >>> # Now make those columns into a second level of index
    >>> df = df.stack()
    >>> df
    element  sample
    Sn       A         1
             B         2
             C         3
    Pb       A         2
             B         4
             C         6
    

    We now have all the delicious functionality of groupby at our disposal:

    >>> demean = lambda x: x - x.mean()
    >>> df.groupby(level='element').transform(demean)
    element  sample
    Sn       A        -1
             B         0
             C         1
    Pb       A        -2
             B         0
             C         2
    

    When you view your data in this way, you'll find that many, many use cases which used to be multi-column DataFrames are in fact MultiIndexed Series, and you have much more power over how the data is represented and transformed.

提交回复
热议问题