Ruby : How can I detect/intelligently guess the delimiter used in a CSV file?

我的梦境 提交于 2019-12-04 08:17:36

Looks like the py implementation just checks a few dialects: excel or excel_tab. So, a simple implementation of something that just checks for "," or "\t" is:

COMMON_DELIMITERS = ['","',"\"\t\""]

def sniff(path)
  first_line = File.open(path).first
  return nil unless first_line
  snif = {}
  COMMON_DELIMITERS.each {|delim|snif[delim]=first_line.count(delim)}
  snif = snif.sort {|a,b| b[1]<=>a[1]}
  snif.size > 0 ? snif[0][0] : nil
end

Note: that would return the full delimiter it finds, e.g. ",", so to get , you could change the snif[0][0] to snif[0][0][1].

Also, I'm using count(delim) because it is a little faster, but if you added a delimiter that is composed of two (or more) characters of the same type like --, then it would could each occurrence twice (or more) when weighing the type, so in that case, it may be better to use scan(delim).length.

Here is Gary S. Weaver answer as we are using it in production. Good solution that works well.

class ColSepSniffer
  NoColumnSeparatorFound = Class.new(StandardError)
  EmptyFile = Class.new(StandardError)

  COMMON_DELIMITERS = [
    '","',
    '"|"',
    '";"'
  ].freeze

  def initialize(path:)
    @path = path
  end

  def self.find(path)
    new(path: path).find
  end

  def find
    fail EmptyFile unless first

    if valid?
      delimiters[0][0][1]
    else
      fail NoColumnSeparatorFound
    end
  end

  private

  def valid?
    !delimiters.collect(&:last).reduce(:+).zero?
  end

  # delimiters #=> [["\"|\"", 54], ["\",\"", 0], ["\";\"", 0]]
  # delimiters[0] #=> ["\";\"", 54]
  # delimiters[0][0] #=> "\",\""
  # delimiters[0][0][1] #=> ";"
  def delimiters
    @delimiters ||= COMMON_DELIMITERS.inject({}, &count).sort(&most_found)
  end

  def most_found
    ->(a, b) { b[1] <=> a[1] }
  end

  def count
    ->(hash, delimiter) { hash[delimiter] = first.count(delimiter); hash }
  end

  def first
    @first ||= file.first
  end

  def file
    @file ||= File.open(@path)
  end
end

Spec

require "spec_helper"

describe ColSepSniffer do
  describe ".find" do
    subject(:find) { described_class.find(path) }

    let(:path) { "./spec/fixtures/google/products.csv" }

    context "when , delimiter" do
      it "returns separator" do
        expect(find).to eq(',')
      end
    end

    context "when ; delimiter" do
      let(:path) { "./spec/fixtures/google/products_with_semi_colon_seperator.csv" }

      it "returns separator" do
        expect(find).to eq(';')
      end
    end

    context "when | delimiter" do
      let(:path) { "./spec/fixtures/google/products_with_bar_seperator.csv" }

      it "returns separator" do
        expect(find).to eq('|')
      end
    end

    context "when empty file" do
      it "raises error" do
        expect(File).to receive(:open) { [] }
        expect { find }.to raise_error(described_class::EmptyFile)
      end
    end

    context "when no column separator is found" do
      it "raises error" do
        expect(File).to receive(:open) { [''] }
        expect { find }.to raise_error(described_class::NoColumnSeparatorFound)
      end
    end
  end
end

I'm not aware of any sniffer implementation in the CSV library included in Ruby 1.9. It will try to auto-discover the row separator, but the column separator is assumed to be a comma by default.

One idea would be to try parsing a sample number of rows (5% of total maybe?) using each of the possible separators. Whichever separator results in the same number of columns most consistently is probably the correct separator.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!