I\'m wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. Th
After more research, I'm convinced this is not yet supported as evidenced here.
Others have suggested taking a more manual approach (see this blog post) which suggests to list s3 keys using boto, then parallelize that list using Spark to read each object.
The problem here (and I don't yet see how they themselves get around it) is that the s3 objects given back from listing within a bucket are not serializable/pickle-able (remember : it's suggested that these objects are given to the workers to read in independent processes via map or flatMap). Furthering the problem is that the boto s3 client itself isn't serializable (which is reasonable in my opinion).
What we're left with is the only choice of recreating the assumed-role s3 client per file, which isn't optimal or feasible past a certain point.
If anyone sees any flaws in this reasoning or an alternative solution/approach, I'd love to hear it.