I\'m trying to parse a table but I don\'t know how to save the data from it. I want to save the data in each row row to look like:
[\'Raw name 1\', 2,094, 0,017,
Your desired output is nonsense:
['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
# ~> -:1: Invalid octal digit
# ~> ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
I'll assume you want quoted numbers.
After stripping the stuff that keeps the code from working, and reducing the HTML to a more manageable example, then running it:
require 'nokogiri'
html = <
Table name
Column name 1
Column name 2
Raw name 1
2,094
0,017
Raw name 5
2,094
0,017
EOT
doc = Nokogiri::HTML(html)
tables = doc.css('table.open')
tables_data = []
tables.each do |table|
title = table.css('tr[1] > th').text # !> assigned but unused variable - title
cell_data = table.css('tr > td').text
raw_name = table.css('tr > th').text
tables_data << [cell_data, raw_name]
end
Which results in:
tables_data
# => [["2,0940,0172,0940,017",
# "Table nameColumn name 1Column name 2Raw name 1Raw name 5"]]
The first thing to notice is you're not using title
though you assign to it. Possibly that happened when you were cleaning up your code as an example.
css
, like search
and xpath
, returns a NodeSet, which is akin to an array of Nodes. When you use text
or inner_text
on a NodeSet it returns the text of each node concatenated into a single string:
Get the inner text of all contained Node objects.
This is its behavior:
require 'nokogiri'
doc = Nokogiri::HTML('foo
bar
')
doc.css('p').text # => "foobar"
Instead, you should iterate over each node found, and extract its text individually. This is covered many times here on SO:
doc.css('p').map{ |node| node.text } # => ["foo", "bar"]
That can be reduced to:
doc.css('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
The docs say this about content
, text
and inner_text
when used with a Node:
Returns the content for this Node.
Instead, you need to go after the individual node's text:
require 'nokogiri'
html = <
Table name
Column name 1
Column name 2
Column name 3
Column name 4
Column name 5
Raw name 1
2,094
0,017
0,098
0,113
0,452
Raw name 5
2,094
0,017
0,098
0,113
0,452
EOT
tables_data = []
doc = Nokogiri::HTML(html)
doc.css('table.open').each do |table|
# find all rows in the current table, then iterate over the second all the way to the final one...
table.css('tr')[1..-1].each do |tr|
# collect the cell data and raw names from the remaining rows' cells...
raw_name = tr.at('th').text
cell_data = tr.css('td').map(&:text)
# aggregate it...
tables_data += [raw_name, cell_data]
end
end
Which now results in:
tables_data
# => ["Raw name 1",
# ["2,094", "0,017", "0,098", "0,113", "0,452"],
# "Raw name 5",
# ["2,094", "0,017", "0,098", "0,113", "0,452"]]
You can figure out how to coerce the quoted numbers into decimals acceptable to Ruby, or manipulate the inner arrays however you want.