Please see the below code sample
JavaRDD<String> mapRDD = filteredRecords
.map(new Function<String, String>() {
public String call(String url) throws Exception {
BufferedReader in = null;
URL formatURL = new URL((url.replaceAll("\"", ""))
try {
HttpURLConnection con = (HttpURLConnection) formatURL
in = new BufferedReader(new InputStreamReader(con
return in.readLine();
} finally {
if (in != null) {
here url is http GET request. example
This piece of code is very slow . IP and port are random and load is distributed so ip can have 20 different value with port so I dont see bottleneck .
When I comment
in = new BufferedReader(new InputStreamReader(con
return in.readLine();
The code is too fast. NOTE: Input data to process is 10GB. Using spark to read from S3.
is there anything wrong I am doing with BufferedReader or InputStreamReader any alternative . I cant use foreach in spark as I have to get the response back from server and need to save JAVARdd as textFile on HDFS.
if we use mappartition code something as below
JavaRDD<String> mapRDD = filteredRecords.mapPartitions(new FlatMapFunction<Iterator<String>, String>() {
public Iterable<String> call(Iterator<String> tuple) throws Exception {
final List<String> rddList = new ArrayList<String>();
Iterable<String> iterable = new Iterable<String>() {
public Iterator<String> iterator() {
return rddList.iterator();
while(tuple.hasNext()) {
URL formatURL = new URL(("\"", ""))
HttpURLConnection con = (HttpURLConnection) formatURL
try(BufferedReader br = new BufferedReader(new InputStreamReader(con
.getInputStream()))) {
} catch (IOException ex) {
return rddList;
return iterable;
here also for each record we are doing same .. isnt it ?
Currently you are using
map function
which creates a url request for each row in the partition.
You can use
Which will make the code run faster as it creates connection to the server only once , that is only one connection per partition.
A big cost here is setting up TCP/HTTPS connections. This is exacerbated by the fact that Even if you only read the first (short) line of a large file, in an attempt to re-use HTTP/1.1 connections better, modern HTTP clients try to read() to the end of the file, so avoiding aborting the connection. This is a good strategy for small files, but not for those in MB.
There is a solution there: set the content-length on the read, so that only a smaller block is read in, reducing the cost of the close(); the connection recycling then reduces HTTPS setup costs. This is what the latest Hadoop/Spark S3A client does if you set fadvise=random on the connection: requests blocks rather than the entire multi-GB file. Be aware though: that design is actually really bad if you are going byte-by-byte through a file...