How do I get this information out of this website?

前端 未结 2 672
北海茫月
北海茫月 2021-01-17 06:15

I found this link: https://search.roblox.com/catalog/json?Category=2&Subcategory=2&SortType=4&Direction=2

The original is: https://www.roblox.com/catalog

2条回答
  •  有刺的猬
    2021-01-17 06:44

    You can scrape all the item information without BeautifulSoup or Selenium - you just need requests. That being said, it's not super straight-forward, so I'll try to break it down:

    When you visit a URL, your browser makes many requests to external resources. These resources are hosted on a server (or, nowadays, on several different servers), and they make up all the files/data that your browser needs to properly render the webpage. Just to list a few, these resources can be images, icons, scripts, HTML files, CSS files, fonts, audio, etc. Just for reference, loading www.google.com in my browser makes 36 requests in total to various resources.

    The very first resource you make a request to will always be the actual webpage itself, so an HTML-like file. The browser then figures out which other resources it needs to make requests to by looking at that HTML.

    For example's sake, let's say the webpage contains a table containing data we want to scrape. The very first thing we should ask ourselves is "How did that table get on that page?". What I mean is, there are different ways in which a webpage is populated with elements/html tags. Here's one such way:

    • The server receives a request from our browser for the page.html resource
    • That resource contains a table, and that table needs data, so the server communicates with a database to retrieve the data for the table
    • The server takes that table-data and bakes it into the HTML
    • Finally, the server serves that HTML file to you
    • What you receive is an HTML with baked in table-data. There is no way that you can communicate with the previously mentioned database - this is fine and desirable
    • Your browser renders the HTML

    When scraping a page like this, using BeautifulSoup is standard procedure. You know that the data you're looking for is baked into the HTML, so BeautifulSoup will be able to see it.

    Here's another way in which webpages can be populated with elements:

    • The server receives a request from our browser for the page.html resource
    • That resource requires another resource - a script, whose job it is to populate the table with data at a later point in time
    • The server serves that HTML file to you (it contains no actual table-data)

    When I say "a later point in time", that time interval is negligible and practically unnoticeable for actual human beings using actual browsers to view pages. However, the server only served us a "bare-bones" HTML. It's just an empty template, and it's relying on a script to populate its table. That script makes a request to a web API, and the web API replies with the actual table-data. All of this takes a finite amount of time, and it can only start once the script resource is loaded to begin with.

    When scraping a page like this, you cannot use BeautifulSoup, because it will only see the "bare-bones" template HTML. This is typically where you would use Selenium to simulate a real browsing session.

    To get back to your roblox page, this page is the second type.

    The approach I'm suggesting (which is my favorite, and in my opinion, should be the approach you always try first), simply involves figuring out what web API potential scripts are making requests to, and then imitating a request to get the data you want. The reason this is my favorite approach is because these web APIs often serve JSON, which is trivial to parse. It's super clean because you only need one third-party module (requests).

    The first step is to log all the traffic/requests to resources that your browser makes. I'll be using Google Chrome, but other modern browsers probably have similar features:

    1. Open Google Chrome and navigate to the target page (https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4)
    2. Hit F12 to open the Chrome Developer Tools menu
    3. Click on the "Network" tab
    4. Click on the "Filter" button (The icon is funnel-shaped), and then change the filter selection from "All" to "XHR" (XMLHttpRequest or XHR resources are objects which interact with servers. We want to only look at XHR resources because they potentially communicate with web APIs)
    5. Click on the round "Record" button (or press CTRL + E) to enable logging - the icon should turn red once enabled
    6. Press CTRL + R to refresh the page and begin logging traffic
    7. After refreshing the page, you should see the resource-log start to fill up. This is a list of all resources our browser requested - we'll only see XHR objects though since we set up our filter (if you're curious, you can switch the filter back to "All" to see a list of all requests to resources made)

    Click on one of the items in the list. A panel should open on the right with several tabs. Click on the "Headers" tab to view the request URL, the request- and response headers as well as any cookies (view the "Cookies" tab for a prettier view). If the Request URL contains any query string parameters you can also view them in a prettier format in this tab. Here's what that looks like (sorry for the large image):

    This tab tells us everything we want to know about imitating our request. It tells us where we should make the request, and how our request should be formulated in order to be accepted. An ill-formed request will be rejected by the web API - not all web APIs care about the same header fields. For example, some web APIs desperately care about the "User-Agent" header, but in our case, this field is not required. The only reason I know that is because I copy and pasted request headers until the web API wouldn't reject my request anymore - in my solution I'll use the bare minimum to make a valid request.

    However, we need to actually figure out which of these XHR objects is responsible for talking to the correct web API - the one that returns the actual information we want to scrape. Select any XHR object from the list and then click on the "Preview" tab to view a parsed version of the data returned by the web API. The assumption is that the web API returned JSON to us - you may have to expand and collapse the tree-structure for a bit before you find what you're looking for, but once you do, you know this XHR object is the one whose request we need to imitate. I happen to know that the data we're interested in is in the XHR object named "details". Here's what part of the expanded JSON looks like in the "Preview" tab:

    As you can see, the response we got from this web API (https://catalog.roblox.com/v1/catalog/items/details) contains all the interesting data we want to scrape!

    This is where things get sort of esoteric, and specific to this particular webpage (everything up until now you can use to scrape stuff from other pages via web APIs). Here's what happens when you visit https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4:

    • Your browser gets some cookies that persist and a CSRF/XSRF Token is generated and baked into the HTML of the page
    • Eventually, one of the XHR objects (the one that starts with "items?") makes an HTTP GET request (cookies required!) to the web API https://catalog.roblox.com/v1/search/items?category=Collectibles&limit=60&sortType=4&subcategory=Collectibles (notice the query string parameters) The response is JSON. It contains a list of item-descriptor things, it looks like this:

    Then, some time later, another XHR object ("details") makes an HTTP POST request to the web API https://catalog.roblox.com/v1/catalog/items/details (refer to first and second screenshots). This request is only accepted by the web API if it contains the right cookies and the previously mentioned CSRF/XSRF token. In addition, this request also needs a payload containing the asset ids whose information we want to scrape - failure to provide this also results in a rejection.

    So, it's a bit tricky. The request of one XHR object depends on the response of another.

    So, here's the script. It first creates a requests.Session to keep track of cookies. We define a dictionary params (which is really just our query string) - you can change these values to suit your needs. The way it's written now, it pulls the first 60 items from the "Collectibles" category. Then, we get the CSRF/XSRF token from the HTML body with a regular expression. We get the ids of the first 60 items according to our params, and generate a dictionary/payload that the final web API request will accept. We make the final request, create a list of items (dictionaries), and print the keys and values of the first item of our query.

    def get_csrf_token(session):
    
        import re
    
        url = "https://www.roblox.com/catalog/"
    
        response = session.get(url)
        response.raise_for_status()
    
        token_pattern = "setToken\\('(?P[^\\)]+)'\\)"
    
        match = re.search(token_pattern, response.text)
        assert match
        return match.group("csrf_token")
    
    def get_assets(session, params):
    
        url = "https://catalog.roblox.com/v1/search/items"
    
        response = session.get(url, params=params, headers={})
        response.raise_for_status()
    
        return {"items": [{**d, "key": f"{d['itemType']}_{d['id']}"} for d in response.json()["data"]]}
    
    def get_items(session, csrf_token, assets):
    
        import json
    
        url = "https://catalog.roblox.com/v1/catalog/items/details"
    
        headers = {
            "Content-Type": "application/json;charset=UTF-8",
            "X-CSRF-TOKEN": csrf_token
        }
    
        response = session.post(url, data=json.dumps(assets), headers=headers)
        response.raise_for_status()
    
        items = response.json()["data"]
        return items
    
    def main():
    
        import requests
    
        session = requests.Session()
    
        params = {
            "category": "Collectibles",
            "limit": "60",
            "sortType": "4",
            "subcategory": "Collectibles"
        }
    
        csrf_token = get_csrf_token(session)
        assets = get_assets(session, params)
        items = get_items(session, csrf_token, assets)
    
        first_item = items[0]
    
        for key, value in first_item.items():
            print(f"{key}: {value}")
        
        return 0
    
    
    if __name__ == "__main__":
        import sys
        sys.exit(main())
    

    Output:

    id: 76692143
    itemType: Asset
    assetType: 8
    name: Chaos Canyon Sugar Egg
    description: This highly collectible commemorative egg recalls that *other* classic ROBLOX level, the one that was never quite as popular as Crossroads.
    productId: 11837951
    genres: ['All']
    itemStatus: []
    itemRestrictions: ['Limited']
    creatorType: User
    creatorTargetId: 1
    creatorName: ROBLOX
    lowestPrice: 400
    purchaseCount: 7714
    favoriteCount: 2781
    >>> 
    

提交回复
热议问题