How to import a table with headings to a data frame using pandas module

前端 未结 2 992
后悔当初
后悔当初 2021-01-20 23:53

I\'m trying to get information from a table in the internet as shown below. I\'m using jupyter notebook with python 2.7. I want to use this information in Python\'s panda modüle

相关标签:
2条回答
  • 2021-01-21 00:21

    Consider using an html web scraper like python's lxml module, html() method to scrape html table data and then migrate to a pandas dataframe. While there are automation features like pandas.read_html(), this approach provides more control over nuances in html content like the Feb 4 column span. Below uses an xpath expression on the <td> position in table using brackets, []:

    import requests
    import pandas as pd
    from lxml import etree
    
    # READ IN AND PARSE WEB DATA
    url = "https://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices"    
    rq = requests.get(url)
    htmlpage = etree.HTML(rq.content)
    
    # INITIALIZE LISTS
    dates = []  
    openstock = []
    highstock = []
    lowstock = []
    closestock = []
    volume = []
    adjclose = []
    
    # ITERATE THROUGH SEVEN COLUMNS OF TABLE
    for i in range(1,8):
        htmltable = htmlpage.xpath("//tr[td/@class='yfnc_tabledata1']/td[{}]".format(i))
    
        # APPEND COLUMN DATA TO CORRESPONDING LIST
        for row in htmltable:
            if i == 1: dates.append(row.text)
            if i == 2: openstock.append(row.text)
            if i == 3: highstock.append(row.text)
            if i == 4: lowstock.append(row.text)
            if i == 5: closestock.append(row.text)
            if i == 6: volume.append(row.text)
            if i == 7: adjclose.append(row.text)
    
    # CLEAN UP COLSPAN VALUE (AT FEB. 4)
    dates = [d for d in dates if len(d.strip()) > 3]
    del dates[7]
    del openstock[7]
    
    # MIGRATE LISTS TO DATA FRAME
    df = pd.DataFrame({'Dates':dates,
                       'Open':openstock,
                       'High':highstock,
                       'Low':lowstock,                   
                       'Close':closestock,
                       'Volume':volume,
                       'AdjClose':adjclose})
    
    #   AdjClose   Close         Dates    High     Low    Open       Volume
    #0     93.99   93.99  Feb 12, 2016   94.50   93.01   94.19   40,121,700
    #1     93.70   93.70  Feb 11, 2016   94.72   92.59   93.79   49,686,200
    #2     94.27   94.27  Feb 10, 2016   96.35   94.10   95.92   42,245,000
    #3     94.99   94.99   Feb 9, 2016   95.94   93.93   94.29   44,331,200
    #4     95.01   95.01   Feb 8, 2016   95.70   93.04   93.13   54,021,400
    #5     94.02   94.02   Feb 5, 2016   96.92   93.69   96.52   46,418,100
    #...
    #61   111.73  112.34  Nov 13, 2015  115.57  112.27  115.20   45,812,400
    #62   115.10  115.72  Nov 12, 2015  116.82  115.65  116.26   32,525,600
    #63   115.48  116.11  Nov 11, 2015  117.42  115.21  116.37   45,218,000
    #64   116.14  116.77  Nov 10, 2015  118.07  116.06  116.90   59,127,900
    #65   119.92  120.57   Nov 9, 2015  121.81  120.05  120.96   33,871,400
    
    0 讨论(0)
  • 2021-01-21 00:28

    There is a csv you can use on the page with all the data which read_csv can parse easily:

    import pandas as pd
    
    df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980&ignore=.csv")
    

    If you want certain time periods you just have to change the params in the url i.e s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980, if we change 1980 to 2015:

    df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=2015&ignore=.csv",parse_dates=0)
    
    print(df)
    

    We get:

              Date        Open        High         Low       Close     Volume  \
    0   2016-02-12   94.190002   94.500000   93.010002   93.989998   40121700   
    1   2016-02-11   93.790001   94.720001   92.589996   93.699997   49686200   
    2   2016-02-10   95.919998   96.349998   94.099998   94.269997   42245000   
    3   2016-02-09   94.290001   95.940002   93.930000   94.989998   44331200   
    4   2016-02-08   93.129997   95.699997   93.040001   95.010002   54021400   
    5   2016-02-05   96.519997   96.919998   93.690002   94.019997   46418100   
    6   2016-02-04   95.860001   97.330002   95.190002   96.599998   46471700   
    7   2016-02-03   95.000000   96.839996   94.080002   96.349998   45964300   
    8   2016-02-02   95.419998   96.040001   94.279999   94.480003   37357200   
    9   2016-02-01   96.470001   96.709999   95.400002   96.430000   40943500   
    10  2016-01-29   94.790001   97.339996   94.349998   97.339996   64416500   
    11  2016-01-28   93.790001   94.519997   92.389999   94.089996   55678800   
    12  2016-01-27   96.040001   96.629997   93.339996   93.419998  133369700   
    13  2016-01-26   99.930000  100.879997   98.070000   99.989998   75077000   
    14  2016-01-25  101.519997  101.529999   99.209999   99.440002   51794500   
    15  2016-01-22   98.629997  101.459999   98.370003  101.419998   65800500   
    16  2016-01-21   97.059998   97.879997   94.940002   96.300003   52161500   
    17  2016-01-20   95.099998   98.190002   93.419998   96.790001   72334400   
    18  2016-01-19   98.410004   98.650002   95.500000   96.660004   53087700   
    19  2016-01-15   96.199997   97.709999   95.360001   97.129997   79833900   
    20  2016-01-14   97.959999  100.480003   95.739998   99.519997   63170100   
    21  2016-01-13  100.320000  101.190002   97.300003   97.389999   62439600   
    22  2016-01-12  100.550003  100.690002   98.839996   99.959999   49154200   
    23  2016-01-11   98.970001   99.059998   97.339996   98.529999   49739400   
    24  2016-01-08   98.550003   99.110001   96.760002   96.959999   70798000   
    25  2016-01-07   98.680000  100.129997   96.430000   96.449997   81094400   
    26  2016-01-06  100.559998  102.370003   99.870003  100.699997   68457400   
    27  2016-01-05  105.750000  105.849998  102.410004  102.709999   55791000   
    28  2016-01-04  102.610001  105.370003  102.000000  105.349998   67649400   
    29  2015-12-31  107.010002  107.029999  104.820000  105.260002   40912300   
    30  2015-12-30  108.580002  108.699997  107.180000  107.320000   25213800   
    31  2015-12-29  106.959999  109.430000  106.860001  108.739998   30931200   
    32  2015-12-28  107.589996  107.690002  106.180000  106.820000   26704200   
    33  2015-12-24  109.000000  109.000000  107.949997  108.029999   13596700   
    34  2015-12-23  107.269997  108.849998  107.199997  108.610001   32657400   
    35  2015-12-22  107.400002  107.720001  106.449997  107.230003   32789400   
    36  2015-12-21  107.279999  107.370003  105.570000  107.330002   47590600   
    37  2015-12-18  108.910004  109.519997  105.809998  106.029999   96453300   
    38  2015-12-17  112.019997  112.250000  108.980003  108.980003   44772800   
    39  2015-12-16  111.070000  111.989998  108.800003  111.339996   56238500   
    40  2015-12-15  111.940002  112.800003  110.349998  110.489998   52978100   
    41  2015-12-14  112.180000  112.680000  109.790001  112.480003   64318700   
    
         Adj Close  
    0    93.989998  
    1    93.699997  
    2    94.269997  
    3    94.989998  
    4    95.010002  
    5    94.019997  
    6    96.599998  
    7    95.830001  
    8    93.970098  
    9    95.909571  
    10   96.814656  
    11   93.582196  
    12   92.915814  
    13   99.450356  
    14   98.903329  
    15  100.872638  
    16   95.780276  
    17   96.267629  
    18   96.138333  
    19   96.605790  
    20   98.982891  
    21   96.864389  
    22   99.420519  
    23   97.998236  
    24   96.436710  
    25   95.929460  
    26  100.156523  
    27  102.155677  
    28  104.781429  
    29  104.691918  
    30  106.740798  
    31  108.153132  
    32  106.243496  
    33  107.446965  
    34  108.023837  
    35  106.651287  
    36  106.750746  
    37  105.457759  
    38  108.391842  
    39  110.739099  
    40  109.893688  
    41  111.872953  
    
    0 讨论(0)
提交回复
热议问题