Multiprocessing for WebScrapping wont start on Windows and Mac

二次信任 提交于 2020-02-25 21:56:48

问题


I asked a question here about multiprocessing a few days ago, and one user sent me the answer that you can see below. Only problem is that this answer worked on his machine and does not work on my machine.

I have tried on Windows (Python 3.6) and on Mac(Python 3.8). I have ran the code on basic Python IDLE that came with installation, in PyCharm on Windows and on Jupyter Notebook and nothing happens. I have 32 bit Python. This is the code:

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool
import tqdm

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

def parse(url):
    print("im in function")

    response = requests.get(url[4], headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
    all_countries = soup.find_all("span", class_ = "country__name-short")

    discipline = url[0]
    season = url[1]
    competition = url[2]
    gender = url[3]

    out = []
    for name, country in zip(all_skier_names , all_countries):
        skier_name = name.text.strip().title()
        country = country.text.strip()
        out.append([discipline, season,  competition,  gender,  country,  skier_name])

    return out

all_urls = [['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode=']]

with Pool(processes=2) as pool, tqdm.tqdm(total=len(all_urls)) as pbar:
    all_data = []
    print("im in pool")

    for data in pool.imap_unordered(parse, all_urls):
        print("im in data")

        all_data.extend(data)
        pbar.update()

print(all_data) 

The only thing that I see when I run the code is progress bar, thats always at 0%:

  0%|          | 0/8 [00:00<?, ?it/s]

I set the couple of print statements in the parse(url) function and in for loop at the end of the code but still, only thing thats printed is "im in pool". It seams like code does not enter the function at all, and it does not go in for loop at the end of the code.

The code should execute in 5-8 seconds, but Im waiting for 10 minutes and nothing is happening. I have also tried to do this without progress bar, but the result is the same.

Do you know whats the problem? Is it the problem with version of Python that im using (Python 3.6 32 bit) or version of some lib, IDK what to do...

来源:https://stackoverflow.com/questions/59892469/multiprocessing-for-webscrapping-wont-start-on-windows-and-mac

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!