Going from Ruby to Python : Crawlers [closed]

前端未结

关注

 4  2079

轮回少年

相关标签:

4条回答

梦谈多话

2021-01-15 04:30
1. Between lxml and beautiful soup, lxml is more equivalent to nokogiri because it is based on libxml2 and it has xpath/css support.
2. The equivalent of net/http is urllib2
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2021-01-15 04:46
Well

Mainly you have to separate the 'scraper'/crawler the python lib/program/function that will download the files/data from the webserver and the Parser that will read this data and interpret the data. In my case I had to scrap and get some govt info that is 'open' but not download/data friendly. For this project I used scrapy[1].

Mainly I set the 'starter_urls' that are the urls my robot will crawl/get and after I use a function 'parser' to retrieve/parse this data.

For parsing/retrieving you are going to need some html,lxml extractor as the 90% of your data will be that.

Now focusing in your question:

For data crawling
1. Scrapy
2. Requests [2]
3. Urllib [3]
For parsing data
1. Scrapy/lxml or scrapy+other
2. lxml[4]
3. beatiful soup [5]
And please remember 'crawling' and scrapping is not only for web, emails too. you can check another question about that here [6]

[1] = http://scrapy.org/

[2] - http://docs.python-requests.org/en/latest/

[3] - http://docs.python.org/library/urllib.html

[4] - http://lxml.de/

[5] - http://www.crummy.com/software/BeautifulSoup/

[6] - Python read my outlook email mailbox and parse messages
0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2021-01-15 04:51

I also use Beautiful Soup, its very easy way how to parse HTML. When i was crawling some web pages i also use The ElementTree XML API. Personally, i really like The ElementTree library(its easy to parse XML).

0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2021-01-15 04:56

The de facto real world HTML parser in Python is beautiful soup. The Python requests library is popular these days for HTTP (although the standard library has similar functionality but with a rather cumbersome API).

The scrappy and harvestman projects are real world crawlers that have been custom built just for the purpose of crawling.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题