How to scrape real time streaming data with Python?

后端 未结 3 624
故里飘歌
故里飘歌 2021-01-31 12:59

I was trying to scrape the number of flights for this webpage https://www.flightradar24.com/56.16,-49.51

The number is highlighted in the picture below:

The num

相关标签:
3条回答
  • 2021-01-31 13:18

    So based on what @Andre has found out, I wrote this code:

    import requests
    from bs4 import BeautifulSoup
    import time
    
    def get_count():
        url = "https://data-live.flightradar24.com/zones/fcgi/feed.js?bounds=59.09,52.64,-58.77,-47.71&faa=1&mlat=1&flarm=1&adsb=1&gnd=1&air=1&vehicles=1&estimated=1&maxage=7200&gliders=1&stats=1"
    
        # Request with fake header, otherwise you will get an 403 HTTP error
        r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    
        # Parse the JSON
        data = r.json()
        counter = 0
    
        # Iterate over the elements to get the number of total flights
        for element in data["stats"]["total"]:
            counter += data["stats"]["total"][element]
    
        return counter
    
    while True:
        print(get_count())
        time.sleep(8)
    

    The code should be self explaining, everything it does is printing the actual flight count every 8 seconds :)

    Note: The values are similar to the ones on the website, but not the same. This is because it's unlikely, that the Python script and the website are sending a request at the same time. If you want to get more accurate results, just make a request every 4 seconds for example.

    Use this code as you want, extend it or whatever. Hope this helps!

    0 讨论(0)
  • 2021-01-31 13:25

    The problem with your approach is that the page first loads a view, then performs regular requests to refresh the page. If you look at the network tab in the developer console in Chrome (for example), you'll see the requests to https://data-live.flightradar24.com/zones/fcgi/feed.js?bounds=59.09,52.64,-58.77,-47.71&faa=1&mlat=1&flarm=1&adsb=1&gnd=1&air=1&vehicles=1&estimated=1&maxage=7200&gliders=1&stats=1

    The response is regular json:

    {
      "full_count": 11879,
      "version": 4,
      "afefdca": [
        "A86AB5",
        56.4288,
        -56.0721,
        233,
        38000,
        420,
        "0000",
        "T-F5M",
        "B763",
        "N641UA",
        1473852497,
        "LHR",
        "ORD",
        "UA929",
        0,
        0,
        "UAL929",
        0
      ],
      ...
      "aff19d9": [
        "A12F78",
        56.3235,
        -49.3597,
        251,
        36000,
        436,
        "0000",
        "F-EST",
        "B752",
        "N176AA",
        1473852497,
        "DUB",
        "JFK",
        "AA291",
        0,
        0,
        "AAL291",
        0
      ],
      "stats": {
        "total": {
          "ads-b": 8521,
          "mlat": 2045,
          "faa": 598,
          "flarm": 152,
          "estimated": 464
        },
        "visible": {
          "ads-b": 0,
          "mlat": 0,
          "faa": 6,
          "flarm": 0,
          "estimated": 3
        }
      }
    }
    

    I'm not sure if this API is protected in any way, but it seems like I can access it without any issues using curl.

    More info:

    • aviation.stackexchange - Is there an API to get real-time FAA flight data?
    • Flightradar24 Forum - API access (meaning your use case is probably discouraged)
    0 讨论(0)
  • 2021-01-31 13:39

    You can use selenium to crawl a webpage with dynamic content added by javascript.

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    browser = webdriver.PhantomJS()
    browser.get('https://www.flightradar24.com/56.16,-49.51/3')
    
    soup = BeautifulSoup(browser.page_source, "html.parser")
    result = soup.find_all("span", {"id": "menuPlanesValue"})
    
    for item in result:
        print(item.text)
    
    browser.quit()
    
    0 讨论(0)
提交回复
热议问题