Common crawl - getting WARC file

问题

I would like to retrieve a web page using common crawl but am getting lost.

I would like to get the warc file for www.example.com. I see that this link (http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=https%3A%2F%2Fwww.example.com&output=json) produces the following json.

{"urlkey": "com,example)/", "timestamp": "20170820000102", "mime": "text/html", "digest": "B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A", "filename": "crawl-data/CC-MAIN-2017-34/segments/1502886105955.66/robotstxt/CC-MAIN-20170819235943-20170820015943-00613.warc.gz", "mime-detected": "text/html", "status": "200", "offset": "1109728", "length": "1166", "url": "http://www.example.com"}

Can someone please point me in the right direction how I can use these json elements to retrieve the HTML.

Thanks for helping a noob!

回答1:

Take filename, offset and length from the JSON result to fill a HTTP range request from $offset to ($offset+$length-1). Add https://commoncrawl.s3.amazonaws.com/ as prefix to filename and decompress the result with gzip, e.g.

curl -s -r1109728-$((1109728+1166-1)) \
   "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886105955.66/robotstxt/CC-MAIN-20170819235943-20170820015943-00613.warc.gz" \
| gzip -dc

Of course, on AWS this can be done using Boto3 or the AWS-CLI:

aws --no-sign-request s3api get-object \
 --bucket commoncrawl \
 --key crawl-data/CC-MAIN-2017-34/segments/1502886105955.66/robotstxt/CC-MAIN-20170819235943-20170820015943-00613.warc.gz \
 --range bytes=1109728-$((1109728+1166-1)) response.gz

If it's only for few documents and it doesn't matter that the documents are modified you could use the index server directly: http://index.commoncrawl.org/CC-MAIN-2017-34/http://www.example.com

来源：https://stackoverflow.com/questions/46307663/common-crawl-getting-warc-file

标签

common-crawl