Python - Issue Scraping with BeautifulSoup

问题

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:

Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
There's a difference between how the links are ordered in the source code and my output.

Here's my code. Any help on this will be greatly appreciated:

import bs4
import urllib.request
import re


#Obtaining source code to parse

sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()

soup = bs4.BeautifulSoup(sauce, 'html.parser')

snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)

print(strsnippet)

joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)

print("Urls: ",joburls)
print(len(joburls))

回答1:

Disclaimer: I did some asking of my own for a part of this answer.

from bs4 import BeautifulSoup
import requests
import json

# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')

s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]

print(len(urls))
50

Process:

Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

来源：https://stackoverflow.com/questions/44957324/python-issue-scraping-with-beautifulsoup

标签

python-3.x

web-scraping

beautifulsoup

urllib