Convert pandas dataframe to NumPy array

前端 未结 15 2357
别那么骄傲
别那么骄傲 2020-11-21 23:57

I am interested in knowing how to convert a pandas dataframe into a NumPy array.

dataframe:

import numpy as np
import pandas as pd

index = [1, 2, 3,         


        
相关标签:
15条回答
  • 2020-11-22 00:40

    df.to_numpy() is better than df.values, here's why.

    It's time to deprecate your usage of values and as_matrix().

    pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:

    1. to_numpy(), which is defined on Index, Series, and DataFrame objects, and
    2. array, which is defined on Index and Series objects only.

    If you visit the v0.24 docs for .values, you will see a big red warning that says:

    Warning: We recommend using DataFrame.to_numpy() instead.

    See this section of the v0.24.0 release notes, and this answer for more information.



    Towards Better Consistency: to_numpy()

    In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.

    # Setup
    df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, 
                      index=['a', 'b', 'c'])
    
    # Convert the entire DataFrame
    df.to_numpy()
    # array([[1, 4, 7],
    #        [2, 5, 8],
    #        [3, 6, 9]])
    
    # Convert specific columns
    df[['A', 'C']].to_numpy()
    # array([[1, 7],
    #        [2, 8],
    #        [3, 9]])
    

    As mentioned above, this method is also defined on Index and Series objects (see here).

    df.index.to_numpy()
    # array(['a', 'b', 'c'], dtype=object)
    
    df['A'].to_numpy()
    #  array([1, 2, 3])
    

    By default, a view is returned, so any modifications made will affect the original.

    v = df.to_numpy()
    v[0, 0] = -1
     
    df
       A  B  C
    a -1  4  7
    b  2  5  8
    c  3  6  9
    

    If you need a copy instead, use to_numpy(copy=True).


    pandas >= 1.0 update for ExtensionTypes

    If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted.

    a = pd.array([1, 2, None], dtype="Int64")                                  
    a                                                                          
    
    <IntegerArray>
    [1, 2, <NA>]
    Length: 3, dtype: Int64 
    
    # Wrong
    a.to_numpy()                                                               
    # array([1, 2, <NA>], dtype=object)  # yuck, objects
    
    # Correct
    a.to_numpy(dtype='float', na_value=np.nan)                                 
    # array([ 1.,  2., nan])
    
    # Also correct
    a.to_numpy(dtype='int', na_value=-1)
    # array([ 1,  2, -1])
    

    This is called out in the docs.


    If you need the dtypes in the result...

    As shown in another answer, DataFrame.to_records is a good way to do this.

    df.to_records()
    # rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
    #           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
    

    This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:

    v = df.reset_index()
    np.rec.fromrecords(v, names=v.columns.tolist())
    # rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
    #           dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
    

    Performance wise, it's nearly the same (actually, using rec.fromrecords is a bit faster).

    df2 = pd.concat([df] * 10000)
    
    %timeit df2.to_records()
    %%timeit
    v = df2.reset_index()
    np.rec.fromrecords(v, names=v.columns.tolist())
    
    12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    


    Rationale for Adding a New Method

    to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

    Specifically, the docs mention the rationale:

    [...] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. [...]

    to_numpy aim to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.



    Critique of Other Solutions

    DataFrame.values has inconsistent behaviour, as already noted.

    DataFrame.get_values() is simply a wrapper around DataFrame.values, so everything said above applies.

    DataFrame.as_matrix() is deprecated now, do NOT use!

    0 讨论(0)
  • 2020-11-22 00:45

    I went through the answers above. The "as_matrix()" method works but its obsolete now. For me, What worked was ".to_numpy()".

    This returns a multidimensional array. I'll prefer using this method if you're reading data from excel sheet and you need to access data from any index. Hope this helps :)

    0 讨论(0)
  • 2020-11-22 00:46

    Here is my approach to making a structure array from a pandas DataFrame.

    Create the data frame

    import pandas as pd
    import numpy as np
    import six
    
    NaN = float('nan')
    ID = [1, 2, 3, 4, 5, 6, 7]
    A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
    B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
    C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
    columns = {'A':A, 'B':B, 'C':C}
    df = pd.DataFrame(columns, index=ID)
    df.index.name = 'ID'
    print(df)
    
          A    B    C
    ID               
    1   NaN  0.2  NaN
    2   NaN  NaN  0.5
    3   NaN  0.2  0.5
    4   0.1  0.2  NaN
    5   0.1  0.2  0.5
    6   0.1  NaN  0.5
    7   0.1  NaN  NaN
    

    Define function to make a numpy structure array (not a record array) from a pandas DataFrame.

    def df_to_sarray(df):
        """
        Convert a pandas DataFrame object to a numpy structured array.
        This is functionally equivalent to but more efficient than
        np.array(df.to_array())
    
        :param df: the data frame to convert
        :return: a numpy structured array representation of df
        """
    
        v = df.values
        cols = df.columns
    
        if six.PY2:  # python 2 needs .encode() but 3 does not
            types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
        else:
            types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
        dtype = np.dtype(types)
        z = np.zeros(v.shape[0], dtype)
        for (i, k) in enumerate(z.dtype.names):
            z[k] = v[:, i]
        return z
    

    Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

    sa = df_to_sarray(df.reset_index())
    sa
    
    array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
           (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
           (7L, 0.1, nan, nan)], 
          dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
    

    EDIT: Updated df_to_sarray to avoid error calling .encode() with python 3. Thanks to Joseph Garvin and halcyon for their comment and solution.

    0 讨论(0)
  • 2020-11-22 00:46

    Further to meteore's answer, I found the code

    df.index = df.index.astype('i8')
    

    doesn't work for me. So I put my code here for the convenience of others stuck with this issue.

    city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
    # the field 'city_en' is a string, when converted to Numpy array, it will be an object
    city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
    descr=city_cluster_arr.dtype.descr
    # change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
    descr[1]=(descr[1][0], "S20")
    newArr=city_cluster_arr.astype(np.dtype(descr))
    
    0 讨论(0)
  • 2020-11-22 00:50

    Try this:

    a = numpy.asarray(df)
    
    0 讨论(0)
  • 2020-11-22 00:50

    Just had a similar problem when exporting from dataframe to arcgis table and stumbled on a solution from usgs (https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table). In short your problem has a similar solution:

    df
    
          A    B    C
    ID               
    1   NaN  0.2  NaN
    2   NaN  NaN  0.5
    3   NaN  0.2  0.5
    4   0.1  0.2  NaN
    5   0.1  0.2  0.5
    6   0.1  NaN  0.5
    7   0.1  NaN  NaN
    
    np_data = np.array(np.rec.fromrecords(df.values))
    np_names = df.dtypes.index.tolist()
    np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])
    
    np_data
    
    array([( nan,  0.2,  nan), ( nan,  nan,  0.5), ( nan,  0.2,  0.5),
           ( 0.1,  0.2,  nan), ( 0.1,  0.2,  0.5), ( 0.1,  nan,  0.5),
           ( 0.1,  nan,  nan)], 
          dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))
    
    0 讨论(0)
提交回复
热议问题