Whats the Efficient way to call http request and read inputstream in spark MapTask

问题

Please see the below code sample

JavaRDD<String> mapRDD = filteredRecords
            .map(new Function<String, String>() {

                @Override
                public String call(String url) throws Exception {
                    BufferedReader in = null;
                    URL formatURL = new URL((url.replaceAll("\"", ""))
                            .trim());
                    try {
                        HttpURLConnection con = (HttpURLConnection) formatURL
                                .openConnection();
                        in = new BufferedReader(new InputStreamReader(con
                                .getInputStream()));

                        return in.readLine();
                    } finally {
                        if (in != null) {
                            in.close();
                        }
                    }
                }
            });

here url is http GET request. example

http://ip:port/cyb/test?event=movie&id=604568837&name=SID&timestamp_secs=1460494800&timestamp_millis=1461729600000&back_up_id=676700166

This piece of code is very slow . IP and port are random and load is distributed so ip can have 20 different value with port so I dont see bottleneck .

When I comment

 in = new BufferedReader(new InputStreamReader(con
                            .getInputStream()));

                    return in.readLine();

The code is too fast. NOTE: Input data to process is 10GB. Using spark to read from S3.

is there anything wrong I am doing with BufferedReader or InputStreamReader any alternative . I cant use foreach in spark as I have to get the response back from server and need to save JAVARdd as textFile on HDFS.

if we use mappartition code something as below

JavaRDD<String> mapRDD = filteredRecords.mapPartitions(new FlatMapFunction<Iterator<String>, String>() {

        @Override
        public Iterable<String> call(Iterator<String> tuple) throws Exception {

            final List<String> rddList = new ArrayList<String>();
            Iterable<String> iterable = new Iterable<String>() {

                @Override
                public Iterator<String> iterator() {
                    return rddList.iterator();
                }
            };
            while(tuple.hasNext()) {
                URL formatURL = new URL((tuple.next().replaceAll("\"", ""))
                        .trim());
                HttpURLConnection con = (HttpURLConnection) formatURL
                        .openConnection();
                try(BufferedReader br = new BufferedReader(new InputStreamReader(con
                        .getInputStream()))) {

                    rddList.add(br.readLine());

                } catch (IOException ex) {
                    return rddList;
                }
            }
            return iterable;
        }
    });

here also for each record we are doing same .. isnt it ?

回答1:

Currently you are using

map function

which creates a url request for each row in the partition.

You can use

mapPartition

Which will make the code run faster as it creates connection to the server only once , that is only one connection per partition.

回答2:

A big cost here is setting up TCP/HTTPS connections. This is exacerbated by the fact that Even if you only read the first (short) line of a large file, in an attempt to re-use HTTP/1.1 connections better, modern HTTP clients try to read() to the end of the file, so avoiding aborting the connection. This is a good strategy for small files, but not for those in MB.

There is a solution there: set the content-length on the read, so that only a smaller block is read in, reducing the cost of the close(); the connection recycling then reduces HTTPS setup costs. This is what the latest Hadoop/Spark S3A client does if you set fadvise=random on the connection: requests blocks rather than the entire multi-GB file. Be aware though: that design is actually really bad if you are going byte-by-byte through a file...

来源：https://stackoverflow.com/questions/41544129/whats-the-efficient-way-to-call-http-request-and-read-inputstream-in-spark-mapta

标签

amazon-web-services

apache-spark

amazon-s3

java-7