Sample datasets in Pandas

前端 未结 4 1244
醉酒成梦
醉酒成梦 2021-01-29 23:43

When using R it\'s handy to load \"practice\" datasets using

data(iris)

or

data(mtcars)

Is there something s

相关标签:
4条回答
  • 2021-01-29 23:43

    The rpy2 module is made for this:

    from rpy2.robjects import r, pandas2ri
    pandas2ri.activate()
    
    r['iris'].head()
    

    yields

       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
    1           5.1          3.5           1.4          0.2  setosa
    2           4.9          3.0           1.4          0.2  setosa
    3           4.7          3.2           1.3          0.2  setosa
    4           4.6          3.1           1.5          0.2  setosa
    5           5.0          3.6           1.4          0.2  setosa
    

    Up to pandas 0.19 you could use pandas' own rpy interface:

    import pandas.rpy.common as rcom
    iris = rcom.load_data('iris')
    print(iris.head())
    

    yields

       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
    1           5.1          3.5           1.4          0.2  setosa
    2           4.9          3.0           1.4          0.2  setosa
    3           4.7          3.2           1.3          0.2  setosa
    4           4.6          3.1           1.5          0.2  setosa
    5           5.0          3.6           1.4          0.2  setosa
    

    rpy2 also provides a way to convert R objects into Python objects:

    import pandas as pd
    import rpy2.robjects as ro
    import rpy2.robjects.conversion as conversion
    from rpy2.robjects import pandas2ri
    pandas2ri.activate()
    
    R = ro.r
    
    df = conversion.ri2py(R['mtcars'])
    print(df.head())
    

    yields

        mpg  cyl  disp   hp  drat     wt   qsec  vs  am  gear  carb
    0  21.0    6   160  110  3.90  2.620  16.46   0   1     4     4
    1  21.0    6   160  110  3.90  2.875  17.02   0   1     4     4
    2  22.8    4   108   93  3.85  2.320  18.61   1   1     4     1
    3  21.4    6   258  110  3.08  3.215  19.44   1   0     3     1
    4  18.7    8   360  175  3.15  3.440  17.02   0   0     3     2
    
    0 讨论(0)
  • 2021-01-29 23:59

    Any publically available .csv file can be loaded into pandas extremely quickly using its URL. Here is an example using the iris dataset originally from the UCI archive.

    import pandas as pd
    
    file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
    df = pd.read_csv(file_name)
    df.head()
    

    The output here being the .csv file header you just loaded from the given URL.

    >>> df.head()
       sepal_length  sepal_width  petal_length  petal_width species
    0           5.1          3.5           1.4          0.2  setosa
    1           4.9          3.0           1.4          0.2  setosa
    2           4.7          3.2           1.3          0.2  setosa
    3           4.6          3.1           1.5          0.2  setosa
    4           5.0          3.6           1.4          0.2  setosa
    

    A memorable short URL for the same is https://j​.mp/iriscsv. This short URL will work only if it's typed and not if it's copy-pasted.

    0 讨论(0)
  • 2021-01-30 00:04

    Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.

    Seaborn

    The brilliant plotting package seaborn has several built-in sample data sets.

    import seaborn as sns
    
    iris = sns.load_dataset('iris')
    iris.head()
    
       sepal_length  sepal_width  petal_length  petal_width species
    0           5.1          3.5           1.4          0.2  setosa
    1           4.9          3.0           1.4          0.2  setosa
    2           4.7          3.2           1.3          0.2  setosa
    3           4.6          3.1           1.5          0.2  setosa
    4           5.0          3.6           1.4          0.2  setosa
    

    Pandas

    If you do not want to import seaborn, but still want to access its sample data sets, you can use @andrewwowens's approach for the seaborn sample data:

    iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    

    Note that the sample data sets containing categorical columns have their column type modified by sns.load_dataset() and the result might not be the same by getting it from the url directly. The iris and tips sample data sets are also available in the pandas github repo here.

    R sample datasets

    Since any dataset can be read via pd.read_csv(), it is possible to access all R's sample data sets by copying the URLs from this R data set repository.

    Additional ways of loading the R sample data sets include statsmodel

    import statsmodels.api as sm
    
    iris = sm.datasets.get_rdataset('iris').data
    

    and PyDataset

    from pydataset import data
    
    iris = data('iris')
    

    scikit-learn

    scikit-learn returns sample data as numpy arrays rather than a pandas data frame.

    from sklearn.datasets import load_iris
    
    iris = load_iris()
    # `iris.data` holds the numerical values
    # `iris.feature_names` holds the numerical column names
    # `iris.target` holds the categorical (species) values (as ints)
    # `iris.target_names` holds the unique categorical names
    

    Quilt

    Quilt is a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as several from the uciml sample repository. The quick start page shows how to install and import the iris data set:

    # In your terminal
    $ pip install quilt
    $ quilt install uciml/iris
    

    After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.

    import quilt.data.uciml.iris as ir
    
    iris = ir.tables.iris()
    
       sepal_length  sepal_width  petal_length  petal_width        class
    0           5.1          3.5           1.4          0.2  Iris-setosa
    1           4.9          3.0           1.4          0.2  Iris-setosa
    2           4.7          3.2           1.3          0.2  Iris-setosa
    3           4.6          3.1           1.5          0.2  Iris-setosa
    4           5.0          3.6           1.4          0.2  Iris-setosa
    

    Quilt also support dataset versioning and include a short description of each dataset.

    0 讨论(0)
  • 2021-01-30 00:06

    The builtin pandas testing DataFrame is very convenient.

    makeMixedDataFrame():

    In [22]: import pandas as pd
    
    In [23]: pd.util.testing.makeMixedDataFrame()
    Out[23]:
         A    B     C          D
    0  0.0  0.0  foo1 2009-01-01
    1  1.0  1.0  foo2 2009-01-02
    2  2.0  0.0  foo3 2009-01-05
    3  3.0  1.0  foo4 2009-01-06
    4  4.0  0.0  foo5 2009-01-07
    

    other testing DataFrame options:

    makeDataFrame():

    In [24]: pd.util.testing.makeDataFrame().head()
    Out[24]:
                       A         B         C         D
    acKoIvMLwE  0.121895 -0.781388  0.416125 -0.105779
    jc6UQeOO1K -0.542400  2.210908 -0.536521 -1.316355
    GlzjJESv7a  0.921131 -0.927859  0.995377  0.005149
    CMhwowHXdW  1.724349  0.604531 -1.453514 -0.289416
    ATr2ww0ctj  0.156038  0.597015  0.977537 -1.498532
    

    makeMissingDataframe():

    In [27]: pd.util.testing.makeMissingDataframe().head()
    Out[27]:
                       A         B         C         D
    qyXLpmp1Zg -1.034246  1.050093       NaN       NaN
    v7eFDnbQko  0.581576  1.334046 -0.576104 -0.579940
    fGiibeTEjx -1.166468 -1.146750 -0.711950 -0.205822
    Q8ETSRa6uY  0.461845 -2.112087  0.167380 -0.466719
    7XBSChaOyL -1.159962 -1.079996  1.585406 -1.411159
    

    makeTimeDataFrame():

    In [28]: pd.util.testing.makeTimeDataFrame().head()
    Out[28]:
                       A         B         C         D
    2000-01-03 -0.641226  0.912964  0.308781  0.551329
    2000-01-04  0.364452 -0.722959  0.322865  0.426233
    2000-01-05  1.042171  0.005285  0.156562  0.978620
    2000-01-06  0.749606 -0.128987 -0.312927  0.481170
    2000-01-07  0.945844 -0.854273  0.935350  1.165401
    
    0 讨论(0)
提交回复
热议问题