问题
I'm experiencing an issue with a large memory spike when I page through a dataset returned by an API. The API is returning ~150k records, I'm requesting 10k records at a time and paging through 15 pages of data. The data is an array of hashes, each hash containing 25 keys with ~50-character string values. This process kills my 512mb Heroku dyno.
I have a method used for paging an API response dataset.
def all_pages value_key = 'values', &block
response = {}
values = []
current_page = 1
total_pages = 1
offset = 0
begin
response = yield offset
#The following seems to be the culprit
values += response[value_key] if response.key? value_key
offset = response['offset']
total_pages = (response['totalResults'].to_f / response['limit'].to_f).ceil if response.key? 'totalResults'
end while (current_page += 1) <= total_pages
values
end
I call this method as so:
all_pages("items") do |current_page|
get "#{data_uri}/data", query: {offset: current_page, limit: 10000}
end
I know it's the concatenation of the arrays that is causing the issue as removing that line allows the process to run with no memory issues. What am I doing wrong? The whole dataset is probably no larger than 20mb - how is that consuming all the dyno memory? What can I do to improve the effeciency here?
Update
Response looks like this: {"totalResults":208904,"offset":0,"count":1,"hasMore":true, limit:"10000","items":[...]}
Update 2
Running with report
shows the following:
[HTTParty] [2014-08-13 13:11:22 -0700] 200 "GET 29259/data" -
Memory 171072KB
[HTTParty] [2014-08-13 13:11:26 -0700] 200 "GET 29259/data" -
Memory 211960KB
... removed for brevity ...
[HTTParty] [2014-08-13 13:12:28 -0700] 200 "GET 29259/data" -
Memory 875760KB
[HTTParty] [2014-08-13 13:12:33 -0700] 200 "GET 29259/data" -
Errno::ENOMEM: Cannot allocate memory - ps ax -o pid,rss | grep -E "^[[:space:]]*23137"
Update 3
I can recreate the issue with the basic script below. The script is hard coded to only pull 100k records and already consumes over 512MB of memory on my local VM.
#! /usr/bin/ruby
require 'uri'
require 'net/http'
require 'json'
uri = URI.parse("https://someapi.com/data")
offset = 0
values = []
begin
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.set_debug_output($stdout)
request = Net::HTTP::Get.new(uri.request_uri + "?limit=10000&offset=#{offset}")
request.add_field("Content-Type", "application/json")
request.add_field("Accept", "application/json")
response = http.request(request)
json_response = JSON.parse(response.body)
values << json_response['items']
offset += 10000
end while offset < 100_000
values
Update 4
I've made a couple of improvements which seem to help but not completely alleviate the issue.
1) Using symbolize_keys
turned out to consume less memory. This is because the keys of each hash are the same and it's cheaper to symbolize them then to parse them as seperate Strings.
2) Switching to ruby-yajl
for JSON parsing consumes significantly less memory as well.
Memory consumption of processing 200k records:
JSON.parse(response.body)
: 861080KB (Before completely running out of memory)
JSON.parse(response.body, symbolize_keys: true)
: 573580KB
Yajl::Parser.parse(response.body)
: 357236KB
Yajl::Parser.parse(response.body, symbolize_keys: true)
: 264576KB
This is still an issue though.
- Why does a dataset that's no more than 20MB take that much memory to process?
- What is the "right way" to process large datasets like this?
- What does one do when the dataset becomes 10x larger? 100x larger?
I will buy a beer for anyone who can thoroughly answer these three questions!
Thanks a lot in advance.
回答1:
You've identified the problem to be using +=
with your array. So the likely solution is to add the data without creating a new array each time.
values.push response[value_key] if response.key? value_key
Or use the <<
values << response[value_key] if response.key? value_key
You should only use +=
if you actually want a new array. It doesn't appear you do actually want a new array, but actually just want all the elements in a single array.
来源:https://stackoverflow.com/questions/25278660/ruby-paging-over-api-response-dataset-causes-memory-spike