Are there any example data sets for Python?

前端 未结 6 2178
伪装坚强ぢ
伪装坚强ぢ 2020-12-23 10:00

For quick testing, debugging, creating portable examples, and benchmarking, R has available to it a large number of data sets (in the Base R datasets package).

相关标签:
6条回答
  • 2020-12-23 10:39

    There are also datasets available from the Scikit-Learn library.

    from sklearn import datasets
    

    There are multiple datasets within this package. Some of the Toy Datasets are:

    load_boston()          Load and return the boston house-prices dataset (regression).
    load_iris()            Load and return the iris dataset (classification).
    load_diabetes()        Load and return the diabetes dataset (regression).
    load_digits([n_class]) Load and return the digits dataset (classification).
    load_linnerud()        Load and return the linnerud dataset (multivariate regression).
    
    0 讨论(0)
  • 2020-12-23 10:43

    MyMVPA is another module which provides easy access to databases. You can check the link below.

    >>> from mvpa2.tutorial_suite import *
    >>> data = [[  1,  1, -1],
    ...         [  2,  0,  0],
    ...         [  3,  1,  1],
    ...         [  4,  0, -1]]
    >>> ds = Dataset(data)
    >>> ds.shape
    (4, 3)
    >>> len(ds)
    4
    

    Example from the link

    http://www.pymvpa.org/tutorial_datasets.html

    0 讨论(0)
  • 2020-12-23 10:49

    Concretely, using @tmthydvnprt example:

    from sklearn import datasets
    iris = datasets.load_iris()
    

    The actual dataset can be called by doing iris.data.

    http://scikit-learn.org/stable/datasets/

    Running Python 3.5

    0 讨论(0)
  • You can use rpy2 package to access all R datasets from Python.

    Set up the interface:

    >>> from rpy2.robjects import r, pandas2ri
    >>> def data(name): 
    ...    return pandas2ri.ri2py(r[name])
    

    Then call data() with any dataset's name of the available datasets (just like in R)

    >>> df = data('iris')
    >>> df.describe()
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
    count    150.000000   150.000000    150.000000   150.000000
    mean       5.843333     3.057333      3.758000     1.199333
    std        0.828066     0.435866      1.765298     0.762238
    min        4.300000     2.000000      1.000000     0.100000
    25%        5.100000     2.800000      1.600000     0.300000
    50%        5.800000     3.000000      4.350000     1.300000
    75%        6.400000     3.300000      5.100000     1.800000
    max        7.900000     4.400000      6.900000     2.500000
    

    To see a list of the available datasets with description for each:

    >>> print(r.data())
    


    Note: rpy2 requires R installation with setting R_HOME variable, and pandas must be installed as well.

    UPDATE:

    I just created PyDataset, which is a simple module to make loading a dataset from Python as easy as R's (and it does not require R installation, only pandas).

    To start using it, install the module:

    $ pip install pydataset

    then just load up any dataset you wish (currently around 757 datasets available) :

    from pydataset import data
    
    titanic = data('titanic')
    
    0 讨论(0)
  • 2020-12-23 10:52

    Following Joran's comment, I've since found the statsmodels module, which provides its own datasets package. The online documentation shows an example of how to import datasets available in R:

    import statsmodels.api as sm
    duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
    print duncan_prestige.__doc__
    
    0 讨论(0)
  • 2020-12-23 10:55

    I originally posted this over at the related question Sample Datasets in Pandas, but since it is relevant outside pandas I am including it here as well.

    There are many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.

    Seaborn

    The brilliant plotting package seaborn has several built-in sample data sets.

    import seaborn as sns
    
    iris = sns.load_dataset('iris')
    iris.head()
    
       sepal_length  sepal_width  petal_length  petal_width species
    0           5.1          3.5           1.4          0.2  setosa
    1           4.9          3.0           1.4          0.2  setosa
    2           4.7          3.2           1.3          0.2  setosa
    3           4.6          3.1           1.5          0.2  setosa
    4           5.0          3.6           1.4          0.2  setosa
    

    Pandas

    If you do not want to import seaborn, but still want to access its sample data sets, you can read the seaborn sample data from its URL:

    iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    

    Note that the sample data sets containing categorical columns have their column type modified by sns.load_dataset() and the result might not be the same by getting it from the url directly. The iris and tips sample data sets are also available in the pandas github repo here.

    R sample datasets

    Since any dataset can be read via pd.read_csv(), it is possible to access all R's sample data sets by copying the URLs from this R data set repository.

    Additional ways of loading the R sample data sets include statsmodel

    import statsmodels.api as sm
    
    iris = sm.datasets.get_rdataset('iris').data
    

    and PyDataset

    from pydataset import data
    
    iris = data('iris')
    

    scikit-learn

    scikit-learn returns sample data as numpy arrays rather than a pandas data frame.

    from sklearn.datasets import load_iris
    
    iris = load_iris()
    # `iris.data` holds the numerical values
    # `iris.feature_names` holds the numerical column names
    # `iris.target` holds the categorical (species) values (as ints)
    # `iris.target_names` holds the unique categorical names
    

    Quilt

    Quilt is a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as several from the uciml sample repository. The quick start page shows how to install and import the iris data set:

    # In your terminal
    $ pip install quilt
    $ quilt install uciml/iris
    

    After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.

    import quilt.data.uciml.iris as ir
    
    iris = ir.tables.iris()
    
       sepal_length  sepal_width  petal_length  petal_width        class
    0           5.1          3.5           1.4          0.2  Iris-setosa
    1           4.9          3.0           1.4          0.2  Iris-setosa
    2           4.7          3.2           1.3          0.2  Iris-setosa
    3           4.6          3.1           1.5          0.2  Iris-setosa
    4           5.0          3.6           1.4          0.2  Iris-setosa
    

    Quilt also support dataset versioning and include a short description of each dataset.

    0 讨论(0)
提交回复
热议问题