common-crawl

Common crawl - getting WARC file

允我心安 提交于 2021-02-18 08:20:10
问题 I would like to retrieve a web page using common crawl but am getting lost. I would like to get the warc file for www.example.com. I see that this link (http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=https%3A%2F%2Fwww.example.com&output=json) produces the following json. {"urlkey": "com,example)/", "timestamp": "20170820000102", "mime": "text/html", "digest": "B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A", "filename": "crawl-data/CC-MAIN-2017-34/segments/1502886105955.66/robotstxt/CC-MAIN

Unzipping a gz file in c# : System.IO.InvalidDataException: 'The archive entry was compressed using an unsupported compression method.'

爱⌒轻易说出口 提交于 2021-02-11 17:01:58
问题 I have followed Microsoft's recommended way to unzip a .gz file : https://docs.microsoft.com/en-us/dotnet/api/system.io.compression.gzipstream?view=netcore-3.1 I am trying to download and parse files from the CommonCrawl. I can successfully download them, and unzip them with 7-zip However, in c# I get: System.IO.InvalidDataException: 'The archive entry was compressed using an unsupported compression method.' public static void Decompress(FileInfo fileToDecompress) { using (FileStream