Ruby paging over API response dataset causes memory spike

问题

I'm experiencing an issue with a large memory spike when I page through a dataset returned by an API. The API is returning ~150k records, I'm requesting 10k records at a time and paging through 15 pages of data. The data is an array of hashes, each hash containing 25 keys with ~50-character string values. This process kills my 512mb Heroku dyno.

I have a method used for paging an API response dataset.

def all_pages value_key = 'values', &block
  response = {}
  values = []
  current_page = 1
  total_pages = 1
  offset = 0

  begin
    response = yield offset

    #The following seems to be the culprit
    values += response[value_key] if response.key? value_key

    offset = response['offset']
    total_pages = (response['totalResults'].to_f / response['limit'].to_f).ceil if response.key? 'totalResults'
  end while (current_page += 1) <= total_pages

  values
end

I call this method as so:

all_pages("items") do |current_page|
  get "#{data_uri}/data", query: {offset: current_page, limit: 10000}
end

I know it's the concatenation of the arrays that is causing the issue as removing that line allows the process to run with no memory issues. What am I doing wrong? The whole dataset is probably no larger than 20mb - how is that consuming all the dyno memory? What can I do to improve the effeciency here?

Update

Response looks like this: {"totalResults":208904,"offset":0,"count":1,"hasMore":true, limit:"10000","items":[...]}

Update 2

Running with report shows the following:

[HTTParty] [2014-08-13 13:11:22 -0700] 200 "GET 29259/data" -
Memory 171072KB
[HTTParty] [2014-08-13 13:11:26 -0700] 200 "GET 29259/data" -
Memory 211960KB
  ... removed for brevity ...
[HTTParty] [2014-08-13 13:12:28 -0700] 200 "GET 29259/data" -
Memory 875760KB
[HTTParty] [2014-08-13 13:12:33 -0700] 200 "GET 29259/data" -
Errno::ENOMEM: Cannot allocate memory - ps ax -o pid,rss | grep -E "^[[:space:]]*23137"

Update 3

I can recreate the issue with the basic script below. The script is hard coded to only pull 100k records and already consumes over 512MB of memory on my local VM.

#! /usr/bin/ruby
require 'uri'
require 'net/http'
require 'json'

uri = URI.parse("https://someapi.com/data")
offset = 0
values = []

begin
  http = Net::HTTP.new(uri.host, uri.port)
  http.use_ssl = true
  http.set_debug_output($stdout)

  request = Net::HTTP::Get.new(uri.request_uri + "?limit=10000&offset=#{offset}")
  request.add_field("Content-Type", "application/json")
  request.add_field("Accept", "application/json")

  response = http.request(request)
  json_response = JSON.parse(response.body)

  values << json_response['items']
  offset += 10000

end while offset < 100_000

values

Update 4

I've made a couple of improvements which seem to help but not completely alleviate the issue.

1) Using symbolize_keys turned out to consume less memory. This is because the keys of each hash are the same and it's cheaper to symbolize them then to parse them as seperate Strings.

2) Switching to ruby-yajl for JSON parsing consumes significantly less memory as well.

Memory consumption of processing 200k records:

JSON.parse(response.body): 861080KB (Before completely running out of memory)

JSON.parse(response.body, symbolize_keys: true): 573580KB

Yajl::Parser.parse(response.body): 357236KB

Yajl::Parser.parse(response.body, symbolize_keys: true): 264576KB

This is still an issue though.

Why does a dataset that's no more than 20MB take that much memory to process?
What is the "right way" to process large datasets like this?
What does one do when the dataset becomes 10x larger? 100x larger?

I will buy a beer for anyone who can thoroughly answer these three questions!

Thanks a lot in advance.

回答1:

You've identified the problem to be using += with your array. So the likely solution is to add the data without creating a new array each time.

values.push response[value_key] if response.key? value_key

Or use the <<

values << response[value_key] if response.key? value_key

You should only use += if you actually want a new array. It doesn't appear you do actually want a new array, but actually just want all the elements in a single array.

来源：https://stackoverflow.com/questions/25278660/ruby-paging-over-api-response-dataset-causes-memory-spike

标签

ruby

out-of-memory