How to import csv data file into scikit-learn?

后端 未结 4 1846
轮回少年
轮回少年 2021-01-30 03:16

From my understanding, the scikit-learn accepts data in (n-sample, n-feature) format which is a 2D array. Assuming I have data in the form ...

Stock prices    in         


        
相关标签:
4条回答
  • 2021-01-30 03:43

    You can look up the loadtxt function in numpy.

    To get the optional inputs into the loadtxt method.

    A simple change for csv is

    data =  np.loadtxt(fname = f, delimiter = ',')
    
    0 讨论(0)
  • 2021-01-30 03:52

    This is not a CSV file; this is just a space separated file. Assuming there are no missing values, you can easily load this into a Numpy array called data with

    import numpy as np
    
    f = open("filename.txt")
    f.readline()  # skip the header
    data = np.loadtxt(f)
    

    If the stock price is what you want to predict (your y value, in scikit-learn terms), then you should split data using

    X = data[:, 1:]  # select columns 1 through end
    y = data[:, 0]   # select column 0, the stock price
    

    Alternatively, you might be able to massage the standard Python csv module into handling this type of file.

    0 讨论(0)
  • 2021-01-30 03:53

    Use numpy to load csvfile

    import numpy as np
    dataset = np.loadtxt('./example.csv', delimiter=',')
    
    0 讨论(0)
  • 2021-01-30 03:56

    A very good alternative to numpy loadtxt is read_csv from Pandas. The data is loaded into a Pandas dataframe with the big advantage that it can handle mixed data types such as some columns contain text and other columns contain numbers. You can then easily select only the numeric columns and convert to a numpy array with as_matrix. Pandas will also read/write excel files and a bunch of other formats.

    If we have a csv file named "mydata.csv":

    point_latitude,point_longitude,line,construction,point_granularity
    30.102261, -81.711777, Residential, Masonry, 1
    30.063936, -81.707664, Residential, Masonry, 3
    30.089579, -81.700455, Residential, Wood   , 1
    30.063236, -81.707703, Residential, Wood   , 3
    30.060614, -81.702675, Residential, Wood   , 1
    

    This will read in the csv and convert the numeric columns into a numpy array for scikit_learn, then modify the order of columns and write it out to an excel spreadsheet:

    import numpy as np
    import pandas as pd
    
    input_file = "mydata.csv"
    
    
    # comma delimited is the default
    df = pd.read_csv(input_file, header = 0)
    
    # for space delimited use:
    # df = pd.read_csv(input_file, header = 0, delimiter = " ")
    
    # for tab delimited use:
    # df = pd.read_csv(input_file, header = 0, delimiter = "\t")
    
    # put the original column names in a python list
    original_headers = list(df.columns.values)
    
    # remove the non-numeric columns
    df = df._get_numeric_data()
    
    # put the numeric column names in a python list
    numeric_headers = list(df.columns.values)
    
    # create a numpy array with the numeric values for input into scikit-learn
    numpy_array = df.as_matrix()
    
    # reverse the order of the columns
    numeric_headers.reverse()
    reverse_df = df[numeric_headers]
    
    # write the reverse_df to an excel spreadsheet
    reverse_df.to_excel('path_to_file.xls')
    
    0 讨论(0)
提交回复
热议问题