Common crawl - getting WARC file
问题 I would like to retrieve a web page using common crawl but am getting lost. I would like to get the warc file for www.example.com. I see that this link (http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=https%3A%2F%2Fwww.example.com&output=json) produces the following json. {"urlkey": "com,example)/", "timestamp": "20170820000102", "mime": "text/html", "digest": "B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A", "filename": "crawl-data/CC-MAIN-2017-34/segments/1502886105955.66/robotstxt/CC-MAIN