I have some CSV data I need to process, and having trouble figuring out a way to match the duplicates.
data looks a bit like this:
line id name item_1 item_2 item_3 item_4
1 251 john foo foo foo foo
2 251 john foo bar bar bar
3 251 john foo bar baz baz
4 251 john foo bar baz pat
lines 1-3 are duplicates in this case.
line id name item_1 item_2 item_3 item_4
5 347 bill foo foo foo foo
6 347 bill foo bar bar bar
in this case only line 5 is a duplicate
line id name item_1 item_2 item_3 item_4
7 251 mary foo foo foo foo
8 251 mary foo bar bar bar
9 251 mary foo bar baz baz
here lines 7 and 8 are the duplicates
so basically if the pattern adds a new "item" the previous line is a duplicate. I want to end up with a single line for each person, regardless of how many items they have
I am using Ruby 1.9.3 like this:
require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
CSV.open("output-file", "wb") do |csv|
#write the first row (header) to the output file
csv << people[0]
people.each do |p|
... logic to test for dupe ...
csv << p.unique
end
end
First, there's a slight bug with your code. Instead of:
csv << people[0]
You would need to do the following if you don't want to change your loop code:
csv << people.shift
Now, the following solution will add only the first occurrence of a person, discarding any subsequent duplicates as determined by id (as I am assuming ids are unique).
require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
ids = [] # or you could use a Set
CSV.open("output-file", "wb") do |csv|
#write the first row (header) to the output file
csv << people.shift
people.each do |p|
# If the id of the current records is in the ids array, we've already seen
# this person
next if ids.include?(p[0])
# Now add the new id to the front of the ids array since the example you gave
# the duplicate records directly follow the original, this will be slightly
# faster than if we added the array to the end, but above we still check the
# entire array to be safe
ids.unshift p[0]
csv << p
end
end
Note that there is a more performant solution if your duplicate records always directly follow the original, you would only need to keep the last original id and check the current record's id rather than inclusion in an entire array. The difference may be negligible if your input file doesn't contain many records.
That would look like this:
require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
previous_id = nil
CSV.open("output-file", "wb") do |csv|
#write the first row (header) to the output file
csv << people.shift
people.each do |p|
next if p[0] == previous_id
previous_id = p[0]
csv << p
end
end
It sounds like you're trying to get a list of unique items associated with each person, where a person is identified by an id and a name. If that's right, you can do something like this:
peoplehash = {}
maxitems = 0
people.each do |id, name, *items|
(peoplehash[[id, name]] ||= []) += items
peoplehash.keys.each do |k|
peoplehash[k].uniq!
peoplehash[k].sort!
maxitems = [maxitems, peoplehash[k].size].max
This'll give you a structure like:
{
[251, "john"] => ["bar", "bat", "baz", "foo"],
[347, "bill"] => ["bar", "foo"]
}
and a maxitems
that tells you how long the longest items array is, which you can then use for whatever you need.
You can use 'uniq'
irb(main):009:0> row= ['ruby', 'rails', 'gem', 'ruby']
irb(main):010:0> row.uniq
=> ["ruby", "rails", "gem"]
or
row.uniq!
=> ["ruby", "rails", "gem"]
irb(main):017:0> row
=> ["ruby", "rails", "gem"]
irb(main):018:0> row = [1, 251, 'john', 'foo', 'foo', 'foo', 'foo']
=> [1, 251, "john", "foo", "foo", "foo", "foo"]
irb(main):019:0> row.uniq
=> [1, 251, "john", "foo"]
来源:https://stackoverflow.com/questions/9602334/ruby-csv-duplicate-row-parsing