I\'m trying to extract WET files from the public CommonCrawl data hosted on S3 from my EMR cluster. To do this, CommonCrawl has a cc-pyspark repo where they provide examples and