Is Erlang the right choice for a webcrawler?

后端 未结 3 906
梦毁少年i
梦毁少年i 2021-02-04 09:06

I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Vi

3条回答
  •  你的背包
    2021-02-04 09:57

    If you're familiar and comfortable with erlang then I'd stick with it if I were you, although I'm not familiar with erlang. With that noted, I'll give you some pointers:

    1. Don't use regular expressions to parse HTML, use XPATH instead.
      HTML, while structured, is still quite difficult to parse in the wild and regular expressions are fairly slow and unreliable for parsing HTML.
    2. Determine what your crawler architecture is going to be and what is your re-visit policy.
    3. Find the best selection policy for you and implement it.

    A web crawler is a fairly complex system to build and you have to be concerned about speed, performance, scalability and concurrency. Some of the most notable crawlers are written in C++ and Java, but I have not heard of any crawlers written in erlang.

提交回复
热议问题