I found this link: https://search.roblox.com/catalog/json?Category=2&Subcategory=2&SortType=4&Direction=2
The original is: https://www.roblox.com/catalog
You can scrape all the item information without BeautifulSoup
or Selenium
- you just need requests
. That being said, it's not super straight-forward, so I'll try to break it down:
When you visit a URL, your browser makes many requests to external resources. These resources are hosted on a server (or, nowadays, on several different servers), and they make up all the files/data that your browser needs to properly render the webpage. Just to list a few, these resources can be images, icons, scripts, HTML files, CSS files, fonts, audio, etc. Just for reference, loading www.google.com
in my browser makes 36 requests in total to various resources.
The very first resource you make a request to will always be the actual webpage itself, so an HTML-like file. The browser then figures out which other resources it needs to make requests to by looking at that HTML.
For example's sake, let's say the webpage contains a table containing data we want to scrape. The very first thing we should ask ourselves is "How did that table get on that page?". What I mean is, there are different ways in which a webpage is populated with elements/html tags. Here's one such way:
page.html
resourceWhen scraping a page like this, using BeautifulSoup
is standard procedure. You know that the data you're looking for is baked into the HTML, so BeautifulSoup
will be able to see it.
Here's another way in which webpages can be populated with elements:
page.html
resourceWhen I say "a later point in time", that time interval is negligible and practically unnoticeable for actual human beings using actual browsers to view pages. However, the server only served us a "bare-bones" HTML. It's just an empty template, and it's relying on a script to populate its table. That script makes a request to a web API, and the web API replies with the actual table-data. All of this takes a finite amount of time, and it can only start once the script resource is loaded to begin with.
When scraping a page like this, you cannot use BeautifulSoup
, because it will only see the "bare-bones" template HTML. This is typically where you would use Selenium
to simulate a real browsing session.
To get back to your roblox page, this page is the second type.
The approach I'm suggesting (which is my favorite, and in my opinion, should be the approach you always try first), simply involves figuring out what web API potential scripts are making requests to, and then imitating a request to get the data you want. The reason this is my favorite approach is because these web APIs often serve JSON, which is trivial to parse. It's super clean because you only need one third-party module (requests
).
The first step is to log all the traffic/requests to resources that your browser makes. I'll be using Google Chrome, but other modern browsers probably have similar features:
https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4
)XMLHttpRequest
or
XHR
resources are objects which interact with servers. We want to
only look at XHR resources because they potentially communicate with
web APIs)Click on one of the items in the list. A panel should open on the right with several tabs. Click on the "Headers" tab to view the request URL, the request- and response headers as well as any cookies (view the "Cookies" tab for a prettier view). If the Request URL contains any query string parameters you can also view them in a prettier format in this tab. Here's what that looks like (sorry for the large image):
This tab tells us everything we want to know about imitating our request. It tells us where we should make the request, and how our request should be formulated in order to be accepted. An ill-formed request will be rejected by the web API - not all web APIs care about the same header fields. For example, some web APIs desperately care about the "User-Agent" header, but in our case, this field is not required. The only reason I know that is because I copy and pasted request headers until the web API wouldn't reject my request anymore - in my solution I'll use the bare minimum to make a valid request.
However, we need to actually figure out which of these XHR objects is responsible for talking to the correct web API - the one that returns the actual information we want to scrape. Select any XHR object from the list and then click on the "Preview" tab to view a parsed version of the data returned by the web API. The assumption is that the web API returned JSON to us - you may have to expand and collapse the tree-structure for a bit before you find what you're looking for, but once you do, you know this XHR object is the one whose request we need to imitate. I happen to know that the data we're interested in is in the XHR object named "details". Here's what part of the expanded JSON looks like in the "Preview" tab:
As you can see, the response we got from this web API (https://catalog.roblox.com/v1/catalog/items/details
) contains all the interesting data we want to scrape!
This is where things get sort of esoteric, and specific to this particular webpage (everything up until now you can use to scrape stuff from other pages via web APIs). Here's what happens when you visit https://www.roblox.com/catalog/?Category=2&Subcategory=2&SortType=4
:
https://catalog.roblox.com/v1/search/items?category=Collectibles&limit=60&sortType=4&subcategory=Collectibles
(notice the query string parameters)
The response is JSON. It contains a list of item-descriptor things, it looks like this:Then, some time later, another XHR object ("details") makes an HTTP POST request to the web API https://catalog.roblox.com/v1/catalog/items/details
(refer to first and second screenshots). This request is only accepted by the web API if it contains the right cookies and the previously mentioned CSRF/XSRF token. In addition, this request also needs a payload containing the asset ids whose information we want to scrape - failure to provide this also results in a rejection.
So, it's a bit tricky. The request of one XHR object depends on the response of another.
So, here's the script. It first creates a requests.Session
to keep track of cookies. We define a dictionary params
(which is really just our query string) - you can change these values to suit your needs. The way it's written now, it pulls the first 60 items from the "Collectibles" category. Then, we get the CSRF/XSRF token from the HTML body with a regular expression. We get the ids of the first 60 items according to our params
, and generate a dictionary/payload that the final web API request will accept. We make the final request, create a list of items (dictionaries), and print the keys and values of the first item of our query.
def get_csrf_token(session):
import re
url = "https://www.roblox.com/catalog/"
response = session.get(url)
response.raise_for_status()
token_pattern = "setToken\\('(?P[^\\)]+)'\\)"
match = re.search(token_pattern, response.text)
assert match
return match.group("csrf_token")
def get_assets(session, params):
url = "https://catalog.roblox.com/v1/search/items"
response = session.get(url, params=params, headers={})
response.raise_for_status()
return {"items": [{**d, "key": f"{d['itemType']}_{d['id']}"} for d in response.json()["data"]]}
def get_items(session, csrf_token, assets):
import json
url = "https://catalog.roblox.com/v1/catalog/items/details"
headers = {
"Content-Type": "application/json;charset=UTF-8",
"X-CSRF-TOKEN": csrf_token
}
response = session.post(url, data=json.dumps(assets), headers=headers)
response.raise_for_status()
items = response.json()["data"]
return items
def main():
import requests
session = requests.Session()
params = {
"category": "Collectibles",
"limit": "60",
"sortType": "4",
"subcategory": "Collectibles"
}
csrf_token = get_csrf_token(session)
assets = get_assets(session, params)
items = get_items(session, csrf_token, assets)
first_item = items[0]
for key, value in first_item.items():
print(f"{key}: {value}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
id: 76692143
itemType: Asset
assetType: 8
name: Chaos Canyon Sugar Egg
description: This highly collectible commemorative egg recalls that *other* classic ROBLOX level, the one that was never quite as popular as Crossroads.
productId: 11837951
genres: ['All']
itemStatus: []
itemRestrictions: ['Limited']
creatorType: User
creatorTargetId: 1
creatorName: ROBLOX
lowestPrice: 400
purchaseCount: 7714
favoriteCount: 2781
>>>