Can I scrape the raw data from highcharts.js?

后端 未结 1 756
不思量自难忘°
不思量自难忘° 2020-12-03 09:03

I want to scrape the data from a page that shows a graph using highcharts.js, and thus I finished to parse all the pages to get to the following page. However,

相关标签:
1条回答
  • 2020-12-03 09:24

    The data is in a script tag. You can get the script tag using bs4 and a regex. You could also extract the data using a regex but I like using /js2xml to parse js functions into a xml tree:

    from bs4 import BeautifulSoup
    import requests
    import re
    import js2xml
    
    soup = BeautifulSoup(requests.get("http://www.worldweatheronline.com/brussels-weather-averages/be.aspx").content, "html.parser")
    script = soup.find("script", text=re.compile("Highcharts.Chart")).text
    # script = soup.find("script", text=re.compile("precipchartcontainer")).text if you want precipitation data
    parsed = js2xml.parse(script)
    print js2xml.pretty_print(parsed)
    

    That gives you:

    <program>
      <functioncall>
        <function>
          <identifier name="$"/>
        </function>
        <arguments>
          <funcexpr>
            <identifier/>
            <parameters/>
            <body>
              <var name="chart"/>
              <functioncall>
                <function>
                  <dotaccessor>
                    <object>
                      <functioncall>
                        <function>
                          <identifier name="$"/>
                        </function>
                        <arguments>
                          <identifier name="document"/>
                        </arguments>
                      </functioncall>
                    </object>
                    <property>
                      <identifier name="ready"/>
                    </property>
                  </dotaccessor>
                </function>
                <arguments>
                  <funcexpr>
                    <identifier/>
                    <parameters/>
                    <body>
                      <assign operator="=">
                        <left>
                          <identifier name="chart"/>
                        </left>
                        <right>
                          <new>
                            <dotaccessor>
                              <object>
                                <identifier name="Highcharts"/>
                              </object>
                              <property>
                                <identifier name="Chart"/>
                              </property>
                            </dotaccessor>
                            <arguments>
                              <object>
                                <property name="chart">
                                  <object>
                                    <property name="renderTo">
                                      <string>tempchartcontainer</string>
                                    </property>
                                    <property name="type">
                                      <string>spline</string>
                                    </property>
                                  </object>
                                </property>
                                <property name="credits">
                                  <object>
                                    <property name="enabled">
                                      <boolean>false</boolean>
                                    </property>
                                  </object>
                                </property>
                                <property name="colors">
                                  <array>
                                    <string>#FF8533</string>
                                    <string>#4572A7</string>
                                  </array>
                                </property>
                                <property name="title">
                                  <object>
                                    <property name="text">
                                      <string>Average Temperature (°c) Graph for Brussels</string>
                                    </property>
                                  </object>
                                </property>
                                <property name="xAxis">
                                  <object>
                                    <property name="categories">
                                      <array>
                                        <string>January</string>
                                        <string>February</string>
                                        <string>March</string>
                                        <string>April</string>
                                        <string>May</string>
                                        <string>June</string>
                                        <string>July</string>
                                        <string>August</string>
                                        <string>September</string>
                                        <string>October</string>
                                        <string>November</string>
                                        <string>December</string>
                                      </array>
                                    </property>
                                    <property name="labels">
                                      <object>
                                        <property name="rotation">
                                          <number value="270"/>
                                        </property>
                                        <property name="y">
                                          <number value="40"/>
                                        </property>
                                      </object>
                                    </property>
                                  </object>
                                </property>
                                <property name="yAxis">
                                  <object>
                                    <property name="title">
                                      <object>
                                        <property name="text">
                                          <string>Temperature (°c)</string>
                                        </property>
                                      </object>
                                    </property>
                                  </object>
                                </property>
                                <property name="tooltip">
                                  <object>
                                    <property name="enabled">
                                      <boolean>true</boolean>
                                    </property>
                                  </object>
                                </property>
                                <property name="plotOptions">
                                  <object>
                                    <property name="spline">
                                      <object>
                                        <property name="dataLabels">
                                          <object>
                                            <property name="enabled">
                                              <boolean>true</boolean>
                                            </property>
                                          </object>
                                        </property>
                                        <property name="enableMouseTracking">
                                          <boolean>false</boolean>
                                        </property>
                                      </object>
                                    </property>
                                  </object>
                                </property>
                                <property name="series">
                                  <array>
                                    <object>
                                      <property name="name">
                                        <string>Average High Temp (°c)</string>
                                      </property>
                                      <property name="color">
                                        <string>#FF8533</string>
                                      </property>
                                      <property name="data">
                                        <array>
                                          <number value="6"/>
                                          <number value="8"/>
                                          <number value="11"/>
                                          <number value="14"/>
                                          <number value="19"/>
                                          <number value="21"/>
                                          <number value="23"/>
                                          <number value="23"/>
                                          <number value="19"/>
                                          <number value="15"/>
                                          <number value="9"/>
                                          <number value="6"/>
                                        </array>
                                      </property>
                                    </object>
                                    <object>
                                      <property name="name">
                                        <string>Average Low Temp (°c)</string>
                                      </property>
                                      <property name="color">
                                        <string>#4572A7</string>
                                      </property>
                                      <property name="data">
                                        <array>
                                          <number value="2"/>
                                          <number value="2"/>
                                          <number value="4"/>
                                          <number value="6"/>
                                          <number value="10"/>
                                          <number value="12"/>
                                          <number value="14"/>
                                          <number value="14"/>
                                          <number value="11"/>
                                          <number value="8"/>
                                          <number value="5"/>
                                          <number value="2"/>
                                        </array>
                                      </property>
                                    </object>
                                  </array>
                                </property>
                              </object>
                            </arguments>
                          </new>
                        </right>
                      </assign>
                    </body>
                  </funcexpr>
                </arguments>
              </functioncall>
            </body>
          </funcexpr>
        </arguments>
      </functioncall>
    </program>
    

    So to get all the data:

    In [28]: from bs4 import BeautifulSoup  
    In [29]: import requests
    In [30]: import re    
    In [31]: import js2xml    
    In [32]: from itertools import repeat    
    In [33]: from pprint import pprint as pp
    In [34]: soup = BeautifulSoup(requests.get("http://www.worldweatheronline.com/brussels-weather-averages/be.aspx").content, "html.parser")
    
    In [35]: script = soup.find("script", text=re.compile("Highcharts.Chart")).text
    
    In [36]: parsed = js2xml.parse(script)
    
    In [37]: data = [d.xpath(".//array/number/@value") for d in parsed.xpath("//property[@name='data']")]
    
    In [38]: categories = parsed.xpath("//property[@name='categories']//string/text()")
    
    In [39]: output =  list(zip(repeat(categories), data))    
    In [40]: pp(output)
    [(['January',
       'February',
       'March',
       'April',
       'May',
       'June',
       'July',
       'August',
       'September',
       'October',
       'November',
       'December'],
      ['6', '8', '11', '14', '19', '21', '23', '23', '19', '15', '9', '6']),
     (['January',
       'February',
       'March',
       'April',
       'May',
       'June',
       'July',
       'August',
       'September',
       'October',
       'November',
       'December'],
      ['2', '2', '4', '6', '10', '12', '14', '14', '11', '8', '5', '2'])]
    

    Like I said you could just use a regex but js2xml I find is more reliable as erroneous spaces etc.. won't break it.

    0 讨论(0)
提交回复
热议问题