问题
I'd like to learn,
- how crawler4j works?
- Does it fetch web page then download its content and extract it ?
- What about .db and .cvs file and its structures?
Generally ,What sequences it follows?
please, I want a descriptive content
Thanks
回答1:
General Crawler Process
The process for a typical multi-threaded crawler is as follows:
We have a queue data structure, which is called
frontier
. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.Crawler threads then obtain URLs from the
frontier
and schedule them for later processing.The actual processing starts:
- The
robots.txt
for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable) - Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
- The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
- If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in
crawler4j
this can be controlled viashouldVisit(...)
).
- The
The whole process is repeated until no new URLs are added to the
frontier
.
General (Focused) Crawler Architecture
Besides the implementation details of crawler4j
a more or less general (focused) crawler architecture (on a single server/pc) looks like this:
Disclaimer: Image is my own work. Please respect this by referencing this post.
来源:https://stackoverflow.com/questions/53351712/what-sequence-of-steps-does-crawler4j-follow-to-fetch-data