Dataframe into numpy array with values comma seperated

后端 未结 3 698
-上瘾入骨i
-上瘾入骨i 2021-01-14 21:53

The Scenario

I\'ve read a csv (which is \\t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type

相关标签:
3条回答
  • 2021-01-14 22:48

    It seems you need read_csv for DataFrame first with filter only second and third column first and then convert to numpy array by values: import pandas as pd from sklearn.cluster import KMeans from pandas.compat import StringIO

    temp=u"""col,iid,rat
    4,1,0
    5,2,4
    6,3,3
    7,4,1"""
    #after testing replace 'StringIO(temp)' to 'filename.csv'
    df = pd.read_csv(StringIO(temp), usecols = [1,2])
    print (df)
       iid  rat
    0    1    0
    1    2    4
    2    3    3
    3    4    1
    
    X = df.values 
    print (X)
    [[1 0]
     [2 4]
     [3 3]
     [4 1]]
    
    kmeans = KMeans(n_clusters=2)
    a = kmeans.fit(X)
    print (a)
    KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
        n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
        random_state=None, tol=0.0001, verbose=0)
    
    0 讨论(0)
  • 2021-01-14 22:54

    Why don't you just import the 'csv' as a numpy array?

    import numpy as np 
    def read_file( fname): 
        return np.genfromtxt( fname, delimiter="/t", comments="%", unpack=True) 
    
    0 讨论(0)
  • 2021-01-14 22:56

    Use label-based selection and the .values attribute of the resulting pandas objects, which will be some sort of numpy array:

    >>> df
       uid  iid  rat
    0  196  242  3.0
    1  186  302  3.0
    2   22  377  1.0
    >>> df.loc[:,['iid','rat']]
       iid  rat
    0  242  3.0
    1  302  3.0
    2  377  1.0
    >>> df.loc[:,['iid','rat']].values
    array([[ 242.,    3.],
           [ 302.,    3.],
           [ 377.,    1.]])
    

    Note, your integer column will get promoted to float.

    Also note, this particular selection could be approached in different ways:

    >>> df.iloc[:, 1:] # integer-position based
       iid  rat
    0  242  3.0
    1  302  3.0
    2  377  1.0
    >>> df[['iid','rat']] # plain indexing performs column-based selection
       iid  rat
    0  242  3.0
    1  302  3.0
    2  377  1.0
    

    I like label-based because it is more explicit.

    Edit

    The reason you aren't seeing commas is an artifact of how numpy arrays are printed:

    >>> df[['iid','rat']].values
    array([[ 242.,    3.],
           [ 302.,    3.],
           [ 377.,    1.]])
    >>> print(df[['iid','rat']].values)
    [[ 242.    3.]
     [ 302.    3.]
     [ 377.    1.]]
    

    And actually, it is the difference between the str and repr results of the numpy array:

    >>> print(repr(df[['iid','rat']].values))
    array([[ 242.,    3.],
           [ 302.,    3.],
           [ 377.,    1.]])
    >>> print(str(df[['iid','rat']].values))
    [[ 242.    3.]
     [ 302.    3.]
     [ 377.    1.]]
    
    0 讨论(0)
提交回复
热议问题