Writing a program to scrape forums

自古美人都是妖i 提交于 2019-12-13 12:09:48

问题


I need to write a program to scrape forums.

Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy?

Thanks


回答1:


I would choose Python due to superior libxml2 bindings, specifically things like lxml.html and pyQuery. Scrapy has its own libxml2 bindings, I haven't looked at them to test them, though skimming the Scrapy documentation didn't leave me very impressed (I've done lots of scraping just using these parsers and manual coding). With any of these you get a truly superior HTML parser, querying via XPath, and with lxml.html and pyquery (also built on lxml) you get CSS selectors.

If you are doing a small job scraping a forum, I'd skip a scraping framework and just do it by hand -- it's easy and parallelizing etc is not really needed.




回答2:


I wouldn't use PHP for a new application that I'm writing. I don't like the language for various reasons.

Also, it's strength is as a server side scripting language to deliver dynamic pages over the web. Not as a general purpose programming language. That's another minus point. I'd stick with Python.

As for which framework to use, there are lots of them around. Harvestman, Scrapy etc. There's also the 80legs cloud based crawler than you might be able to use.

Update : People have been downvoting this answer probably because I said I didn't like PHP. Here's a list of reasons why. Not entirely accurate but a decent summary nevertheless http://wiki.python.org/moin/PythonVsPhp



来源:https://stackoverflow.com/questions/2980519/writing-a-program-to-scrape-forums

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!