Converting Index into MultiIndex (hierarchical index) in Pandas

后端 未结 3 1247
广开言路
广开言路 2021-02-04 08:49

In the data I am working with the index is compound - i.e. it has both item name and a timestamp, e.g. name@domain.com|2013-05-07 05:52:51 +0200.

I want to

相关标签:
3条回答
  • 2021-02-04 08:52

    Once we have a DataFrame

    import pandas as pd
    df = pd.read_csv("input.csv", index_col=0)  # or from another source
    

    and a function mapping each index to a tuple (below, it is for the example from this question)

    def process_index(k):
        return tuple(k.split("|"))
    

    we can create a hierarchical index in the following way:

    df.index = pd.MultiIndex.from_tuples([process_index(k) for k,v in df.iterrows()])
    

    An alternative approach is to create two columns then set them as the index (the original index will be dropped):

    df['e-mail'] = [x.split("|")[0] for x in df.index] 
    df['date'] = [x.split("|")[1] for x in df.index]
    df = df.set_index(['e-mail', 'date'])
    

    or even shorter

    df['e-mail'], df['date'] = zip(*map(process_index, df.index))
    df = df.set_index(['e-mail', 'date'])
    
    0 讨论(0)
  • 2021-02-04 08:52

    In pandas>=0.16.0, we can use the .str accessor on indices. This makes the following possible:

    df.index = pd.MultiIndex.from_tuples(df.index.str.split('|').tolist())
    

    (Note: I tried the more intuitive: pd.MultiIndex.from_arrays(df.index.str.split('|')) but for some reason that gives me errors.)

    0 讨论(0)
  • 2021-02-04 09:08

    My preference would be to initially read this in as a column (i.e. not as an index), then you can use the str split method:

    csv = '\n'.join(['name@domain.com|2013-05-07 05:52:51 +0200, 42'] * 3)
    df = pd.read_csv(StringIO(csv), header=None)
    
    In [13]: df[0].str.split('|')
    Out[13]:
    0    [name@domain.com, 2013-05-07 05:52:51 +0200]
    1    [name@domain.com, 2013-05-07 05:52:51 +0200]
    2    [name@domain.com, 2013-05-07 05:52:51 +0200]
    Name: 0, dtype: object
    

    And then feed this into a MultiIndex (perhaps this can be done cleaner?):

    m = pd.MultiIndex.from_arrays(zip(*df[0].str.split('|')))
    

    Delete the 0th column and set the index to the new MultiIndex:

    del df[0]
    df.index = m
    
    In [17]: df
    Out[17]:
                                                1
    name@domain.com 2013-05-07 05:52:51 +0200  42
                    2013-05-07 05:52:51 +0200  42
                    2013-05-07 05:52:51 +0200  42
    
    0 讨论(0)
提交回复
热议问题