Pandas: Modify a particular level of Multiindex

后端 未结 3 1481
旧巷少年郎
旧巷少年郎 2020-12-02 14:40

I have a dataframe with Multiindex and would like to modify one particular level of the Multiindex. For instance, the first level might be strings and I may want to remove t

相关标签:
3条回答
  • 2020-12-02 14:47

    As mentioned in the comments, indexes are immutable and must be remade when modifying, but you do not have to use reset_index for that, you can create a new multi-index directly:

    df.index = pd.MultiIndex.from_tuples([(x[0], x[1].replace(' ', ''), x[2]) for x in df.index])
    

    This example is for a 3-level index, where you want to modify the middle level. You need to change the size of the tuple for different level sizes.

    Update

    John's improvement is great performance-wise, but as mentioned in the comments it causes an error. So here's the corrected implementation with small improvements:

    df.index.set_levels(
        df.index.levels[0].str.replace(' ',''), 
        level=0,
        inplace=True,  # If False, you will need to use `df.index = ...`
    )
    

    If you'd like to use level names instead of numbers, you'll need another small variation:

    df.index.set_levels(
        df.index.levels[df.index.names.index('level_name')].str.replace(' ',''), 
        level='level_name',
        inplace=True,
    )
    
    0 讨论(0)
  • 2020-12-02 14:52

    The other answers are working fine. Depending on the structure of the multi-index, it can be considerably faster to apply a map directly on the levels instead of constructing a new multi-index.

    I use the following function to modify a particular index level. It works also on single-level indices.

    def map_index_level(index, mapper, level=0):
        """
        Returns a new Index or MultiIndex, with the level values being mapped.
        """
        assert(isinstance(index, pd.Index))
        if isinstance(index, pd.MultiIndex):
            new_level = index.levels[level].map(mapper)
            new_index = index.set_levels(new_level, level=level)
        else:
            # Single level index.
            assert(level==0)
            new_index = index.map(mapper)
        return new_index
    

    Usage:

    df = pd.DataFrame([[1,2],[3,4]])
    df.index = pd.MultiIndex.from_product([["a"],["i","ii"]])
    df.columns = ["x","y"]
    
    df.index = map_index_level(index=df.index, mapper=str.upper, level=1)
    df.columns = map_index_level(index=df.columns, mapper={"x":"foo", "y":"bar"})
    
    # Result:
    #       foo  bar
    # a I     1    2
    #   II    3    4
    

    Note: The above works only if mapper preserves the uniqueness of level values! In the above example, mapper = {"i": "new", "ii": "new"} will fail in set_index() with a ValueError: Level values must be unique. One could disable the integrity check modifying above code to:

    new_index = index.set_levels(new_level, level=level,
                                 verify_integrity=False)
    

    But better don't! See the docs of set_levels.

    0 讨论(0)
  • 2020-12-02 15:04

    Thanks to @cxrodgers's comment, I think the fastest way to do this is:

    df.index = df.index.set_levels(df.index.levels[0].str.replace(' ', ''), level=0)
    

    Old, longer answer:

    I found that the list comprehension suggested by @Shovalt works but felt slow on my machine (using a dataframe with >10,000 rows).

    Instead, I was able to use .set_levels method, which was quite a bit faster for me.

    %timeit pd.MultiIndex.from_tuples([(x[0].replace(' ',''), x[1]) for x in df.index])
    1 loop, best of 3: 394 ms per loop
    
    %timeit df.index.set_levels(df.index.get_level_values(0).str.replace(' ',''), level=0)
    10 loops, best of 3: 134 ms per loop
    

    In actuality, I just needed to prepend some text. This was even faster with .set_levels:

    %timeit pd.MultiIndex.from_tuples([('00'+x[0], x[1]) for x in df.index])
    100 loops, best of 3: 5.18 ms per loop
    
    %timeit df.index.set_levels('00'+df.index.get_level_values(0), level=0)
    1000 loops, best of 3: 1.38 ms per loop
    
    %timeit df.index.set_levels('00'+df.index.levels[0], level=0)
    1000 loops, best of 3: 331 µs per loop
    

    This solution is based on the answer in the link from the comment by @denfromufa ...

    python - Multiindex and timezone - Frozen list error - Stack Overflow

    0 讨论(0)
提交回复
热议问题