Whats the Efficient way to call http request and read inputstream in spark MapTask

浪尽此生 提交于 2019-12-04 18:55:01

Currently you are using

map function

which creates a url request for each row in the partition.

You can use

mapPartition

Which will make the code run faster as it creates connection to the server only once , that is only one connection per partition.

A big cost here is setting up TCP/HTTPS connections. This is exacerbated by the fact that Even if you only read the first (short) line of a large file, in an attempt to re-use HTTP/1.1 connections better, modern HTTP clients try to read() to the end of the file, so avoiding aborting the connection. This is a good strategy for small files, but not for those in MB.

There is a solution there: set the content-length on the read, so that only a smaller block is read in, reducing the cost of the close(); the connection recycling then reduces HTTPS setup costs. This is what the latest Hadoop/Spark S3A client does if you set fadvise=random on the connection: requests blocks rather than the entire multi-GB file. Be aware though: that design is actually really bad if you are going byte-by-byte through a file...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!