web-crawler

golang force net/http client to use IPv4 / IPv6

旧城冷巷雨未停 提交于 2020-06-12 07:41:10
问题 I' using go1.11 net/http and want to decect if a domain is ipv6-only. What did you do? I create my own DialContext because want I to detect if a domain is ipv6-only. code below package main import ( "errors" "fmt" "net" "net/http" "syscall" "time" ) func ModifiedTransport() { var MyTransport = &http.Transport{ DialContext: (&net.Dialer{ Timeout: 30 * time.Second, KeepAlive: 30 * time.Second, DualStack: false, Control: func(network, address string, c syscall.RawConn) error { if network ==

Is it possible to scrape all text messages from Whatsapp Web with Scrapy?

房东的猫 提交于 2020-06-11 05:45:40
问题 I've been experimenting with web scraping using Scrapy, and I was interested in retrieving all text messages from all chats on Whatsapp to use as training data for a Machine Learning project. I know there are websites that block web crawlers/scrapers, so I would like to know if it is possible to use Scrapy to obtain these messages, and if it isn't possible, what are some alternatives I can use? I understand that I can click on the "Email chat" option for each chat, but this might not be

Removing null value from scraped data without removing entire

怎甘沉沦 提交于 2020-06-01 07:38:07
问题 Am using scrapy to scrape data off the new york times website, but the scraped data are full of null values I don't want so in order to clean my extracted data I have changed the pipeline.py script. and it worked when I extract a single value or two it works like a charm. but when I extract multiple values and since there is at least one null value on each extracted row the algorithm ends up deleting almost all my data. is there a way to stop this from happening ? here is my spider file : # -

How to iterate pages to scrape web news

落爺英雄遲暮 提交于 2020-06-01 05:12:27
问题 I've been trying to figure out how to iterate pages to scrape multiple news articles. This is the page I want to scrape: (and its following pages) https://www.startribune.com/search/?page=1&q=China%20COVID-19&refresh=true I tried out the below code, but it doesn't return a correct result: def scrap(url): user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'} urls = [f"{url}{x}" for x in range(1,10)] params = { 'q': 'China%20COVID-19' } for

How extract all URLs in a website using BeautifulSoup

孤人 提交于 2020-05-25 08:55:26
问题 I'm working on a project that require to extract all links from a website, with using this code I'll get all of links from single URL: import requests from bs4 import BeautifulSoup, SoupStrainer source_code = requests.get('https://stackoverflow.com/') soup = BeautifulSoup(source_code.content, 'lxml') links = [] for link in soup.find_all('a'): links.append(str(link)) problem is that if I want to extract all URLs, I have to write another for loop and then another one ... . I want to extract all

Python Selenium save data-id and data-name in file

北城以北 提交于 2020-05-17 06:37:08
问题 I have this HTML Code <div id="availables" style="height: 472px; overflow-y: scroll;"> <div class="_instance _personInstance _volunteer" data-id="980200" data-name="Name1"> <div class="_addButtonPerson"> </div> Name1 </div> <div class="_instance _personInstance _volunteer" data-id="14069" data-name="Name2"> <div class="_addButtonPerson"> </div> Name2 </div> <div class="_instance _personInstance _volunteer" data-id="514633" data-name="Name3"> <div class="_addButtonPerson"> </div> Name3 </div>

Multiprocessing with threading?

若如初见. 提交于 2020-05-11 07:42:06
问题 when I trying to make my script multi-threading, I've found out multiprocessing, I wonder if there is a way to make multiprocessing work with threading? cpu 1 -> 3 threads(worker A,B,C) cpu 2 -> 3 threads(worker D,E,F) ... Im trying to do it myself but I hit so much problems. is there a way to make those two work together? 回答1: You can generate a number of Processes , and then spawn Threads from inside them. Each Process can handle almost anything the standard interpreter thread can handle,

Multiprocessing with threading?

半世苍凉 提交于 2020-05-11 07:42:06
问题 when I trying to make my script multi-threading, I've found out multiprocessing, I wonder if there is a way to make multiprocessing work with threading? cpu 1 -> 3 threads(worker A,B,C) cpu 2 -> 3 threads(worker D,E,F) ... Im trying to do it myself but I hit so much problems. is there a way to make those two work together? 回答1: You can generate a number of Processes , and then spawn Threads from inside them. Each Process can handle almost anything the standard interpreter thread can handle,

Save complete web page (incl css, images) using python/selenium

空扰寡人 提交于 2020-04-29 07:20:20
问题 I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want: from selenium import webdriver URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome' SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA

selenium.common.exceptions.WebDriverException: Message: Service

喜你入骨 提交于 2020-04-13 07:50:35
问题 I had a trouble when i use selenium to control my Chrome. Here is my code: from selenium import webdriver driver = webdriver.Chrome() When i tried to operate it ,it runs successfully at first,the Chrome pop on the screen. However, it shut down at the few seconds. Traceback (most recent call last): File "<pyshell#3>", line 1, in <module> driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\chrome.exe') File "C:\Users\35273\AppData\Local\Programs\Python\Python35\lib\site-packages