Use Beatiful Soup in scraping multiple websites

我与影子孤独终老i 提交于 2020-07-19 06:19:30

问题


I want to know why lists all_links and all_titles don't want to receive any records from lists titles and links. I have tried also .extend() method and it didn't help.

import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []

def title_link(page_num):
    page = requests.get(
    'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
    % (page_num, page_num, page_num))
    soup = BeautifulSoup(page.content, 'html.parser')
    links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
    titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")] 
    print(titles)

for i in range(1,5+1):
    title_link(i)
    all_links = all_links + links
    all_titles = all_titles + titles
    i+=1
    print(all_links)

import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)
#df.to_csv("./gumtree_page_1.csv", sep=';',index=False, encoding = 'utf-8')
#df.to_excel('./gumtree_page_1.xlsx')

回答1:


When I ran your code, I got

NameError                                 Traceback (most recent call last)
<ipython-input-3-6fff0b33d73b> in <module>
     16 for i in range(1,5+1):
     17     title_link(i)
---> 18     all_links = all_links + links
     19     all_titles = all_titles + titles
     20     i+=1

NameError: name 'links' is not defined

That points to a problem - variable named links is not defined in a global scope (where you add it to all_links). You can read about python scopes here. You'd need to return links and titles from title_link. Something similar to this:

def title_link(page_sum):
    # your code here
    return links, titles


for i in range(1,5+1):
    links, titles = title_link(i)
    all_links = all_links + links
    all_titles = all_titles + titles
    print(all_links)



回答2:


This code is exhibits confusion about scoping. titles and links inside of title_link are local to that function. When the function ends, the data disappears and it cannot be accessed from another scope such as main. Use the return keyword to return values from functions. In this case, you'd need to return a tuple pair of titles and links like return titles, links.

Since functions should do one task only, having to return a pair shows reveals a possible design flaw. A function like title_link is overloaded and should probably be two separate functions, one to get titles and one to get links.

Having said that, the functions here seem like premature abstractions since the operations can be done directly.

Here's a suggested rewrite:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d"
data = {"title": [], "link": []}

for i in range(1, 6):
    page = requests.get(url % (i, i, i))
    soup = BeautifulSoup(page.content, "html.parser")
    titles = soup.find_all("a", class_="href-link tile-title-text")
    data["title"].extend([x.next_element for x in titles])
    data["link"].extend("https://www.gumtree.pl" + x.get("href") for x in titles)

df = pd.DataFrame(data)
print(df.head(100))

Other remarks:

  • i+=1 is unnecessary; for loops move forward automatically in Python.
  • (1,5+1) is clearer as (1, 6).
  • List comprehensions are great, but if they run multiple lines, consider writing them as normal loops or creating an intermediate variable or two.
  • Imports should be at the top of a file only. See PEP-8.
  • list.extend(other_list) is preferable to list = list + other_list, which is slow and memory-intensive, creating a whole copy of the list.



回答3:


Try this:

import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []

def title_link(page_num):
    page = requests.get(
    'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
    % (page_num, page_num, page_num))
    page.encoding = 'utf-8'
    soup = BeautifulSoup(page.content, 'html.parser', from_encoding='utf-8')
    links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
    titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")]
    print(titles)
    return links, titles

for i in range(1,5+1):
    links, titles = title_link(i)
    all_links.extend(links)
    all_titles.extend(titles)
    # i+=1 not needed in python
    print(all_links)

import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)

I think you just needed to get links and titles out of title_link(page_num).

Edit: removed the manual incrementing per comments

Edit: changed the all_links = all_links + links to all_links.extend(links)

Edit: website is utf-8 encoded, added page.encoding = 'utf-8' and as extra (probably unnecessary) measure, from_encoding='utf-8' to the BeautifulSoup



来源:https://stackoverflow.com/questions/60698567/use-beatiful-soup-in-scraping-multiple-websites

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!