How to parse a HTML table with Nokogiri?

后端 未结 3 603
耶瑟儿~
耶瑟儿~ 2021-02-08 00:05

I\'m trying to parse a table but I don\'t know how to save the data from it. I want to save the data in each row row to look like:

[\'Raw name 1\', 2,094, 0,017,         


        
3条回答
  •  暗喜
    暗喜 (楼主)
    2021-02-08 00:54

    The key of the problem is that calling #text on multiple results will return the concatenation of the #text of each individual element.

    Lets examine what each step does:

    # Finds all s with class open
    # I'm assuming you have only one 
    so # you don't actually have to loop through # all tables, instead you can just operate # on the first one. If that is not the case, # you can use a loop the way you did tables = doc.css('table.open') # The text of all one in the table title = table.css('tr[1] > th').text # The text of all s in the table # You obviously wanted just the cell_data = table.css('tr > td').text # The text of all s in the table # You obviously wanted just the raw_name = table.css('tr > th').text

    Now that we know what is wrong, here is a possible solution:

    html = <
            
    s in
    s in all
    s in one
    s in all
    s in one
    Table name Column name 1 Column name 2 Column name 3 Column name 4 Column name 5
    Raw name 1 1001 1002 1003 1004 1005
    Raw name 2 2001 2002 2003 2004 2005
    Raw name 3 3001 3002 3003 3004 3005
    EOT

    doc = Nokogiri::HTML(html, nil, 'UTF-8')
    
    # Fetches only the first . If you have
    #  more than one, you can loop the way you
    #  originally did.
    table = doc.css('table.open').first
    
    # Fetches all rows (s)
    rows = table.css('tr')
    
    # The column names are the first row (shift returns
    #  the first element and removes it from the array).
    # On that row we get the text of each individual 
    # This will be Table name, Column name 1, Column name 2... column_names = rows.shift.css('th').map(&:text) # On each of the remaining rows text_all_rows = rows.map do |row| # We get the name () # On the first row this will be Raw name 1 # on the second - Raw name 2, etc. row_name = row.css('th').text # We get the text of each individual value () # On the first row this will be 1001, 1002, 1003... # on the second - 2001, 2002, 2003... etc row_values = row.css('td').map(&:text) # We map the name, followed by all the values [row_name, *row_values] end p column_names # => ["Table name", "Column name 1", "Column name 2", # "Column name 3", "Column name 4", "Column name 5"] p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"], # ["Raw name 2", "2001", "2002", "2003", "2004", "2005"], # ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]] # If you want to combine them text_all_rows.each do |row_as_text| p column_names.zip(row_as_text).to_h end # => # {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"} # {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"} # {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}

    提交回复
    热议问题