问题
I want to get a link to a kind of json document that some webpages download after getting loaded. For instance on this webpage :
But it can be a very different document on a different webpage. Unfortunately I can't find the link in the source page with Beautfiul soup.
So far I tried this :
import requests
import json
data = {
"Device[udid]": "",
"API_KEY": "",
"API_SECRET": "",
"Device[change]": "",
"fbToken": ""
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
url = "https://data.electionsportal.ge/en/event_type/1/event/38/shape/69898/shape_type/1?data_type=official"
r = requests.post(url, data=data, headers=headers)
data = r.json()
But it returns a json decode error :
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-72-189954289109> in <module>
17
18 r = requests.post(url, data=data, headers=headers)
---> 19 data = r.json()
20
C:\ProgramData\Anaconda3\lib\site-packages\requests\models.py in json(self, **kwargs)
895 # used.
896 pass
--> 897 return complexjson.loads(self.text, **kwargs)
898
899 @property
C:\ProgramData\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
346 parse_int is None and parse_float is None and
347 parse_constant is None and object_pairs_hook is None and not kw):
--> 348 return _default_decoder.decode(s)
349 if cls is None:
350 cls = JSONDecoder
C:\ProgramData\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
339 if end != len(s):
C:\ProgramData\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
回答1:
The JSON you are trying to find in the HTML content is loaded by the client through javascript with XMLHttpRequests. That means that you will not be able to use BeautifulSoup to find the tag in the HTML that contains the URL, it is inside either in a <script>
block or loaded externally.
Besides, you are trying to convert a webpage written in HTML into JSON. And attempting to access a key (coins) which is nowhere defined inside the webpage or the JSON content..
Solution
Load that JSON directly, without attempting to find the JSON URL with BeautifulSoup in the aforementioned website. By doing so, you would then be able to run
requests.json()
flawlessly.Otherwise, check out Selenium, it's a web driver that allows you to run javascript.
Hope that clears it out.
回答2:
This works for both links in your post :
from bs4 import BeautifulSoup
import requests
url = 'https://data.electionsportal.ge/en/event_type/1/event/38/shape/69898/shape_type/1?data_type=official'
r = requests.get(url)
soup = BeautifulSoup(r.text)
splits = [item.split('=',1)[-1] for item in str(soup.script).split(';')]
filtered_splits = [item.replace('"','') for item in splits if 'json' in item and not 'xxx' in item]
links_to_jsons = ["https://data.electionsportal.ge" + item for item in filtered_splits]
for item in links_to_jsons:
r = requests.get(item)
print(r.json()) # change as you want
Btw. I am guessing that you can construct json links by changing number 69898 to number which is in similar position in another webpage ( but still data.electionsportal.ge).
来源:https://stackoverflow.com/questions/58239549/how-to-extract-xhr-response-data-from-the-a-website