We do a LOT of file processing, bringing in customer data for analytics, so we’re always looking for ways to make things faster. The CSV class built into Ruby is great, but we found that if we could stand a few less options, we could make it a good bit faster while staying in Ruby.

Assumptions:

  • CSV file has a header
  • You want a hash emitted for each row - (headers are converted to symbols)
  • You don’t need converters - all values are strings
  • Most rows in the CSV file do not have quotes

What does it do?

For each line in the file:

  • If the line has any quoted characters use Ruby’s CSV
  • Otherwise split the line and convert blank values to nil (like CSV)
  • Zip the header with the values and yield the Hash

The code

# Assumptions:
# File has a header - this is not an option at this time!
module QuickCsv

  DEFAULT_OPTIONS = {
      col_sep: ",",
      quote_char: '"',
      file_encoding: 'utf-8'
  }

  class << self

    def foreach(input, options={})
      options = DEFAULT_OPTIONS.merge(options)
      csv_options = options.slice(:col_sep, :quote_char)
      io = input.respond_to?(:readline) ? input : File.new(input, "r:#{options[:file_encoding]}")
      header = nil

      IO.foreach(io) do |line|
        line.chomp!
        if header
          row = read_row(header, line, csv_options)
          yield row
        else
          header = read_header(line, csv_options)
        end
      end
    end

    private

    def read_header(line, options)
      fields = parse_line(line, options)
      fields.map(&:to_sym)
    end

    def read_row(header, line, options)
      fields = parse_line(line, options)
      Hash[header.zip(fields)]
    end

    def parse_line(line, options)
      if line.include?(options[:quote_char])
        CSV.parse_line(line, options)
      else
        # Split with -1 keeps trailing blanks
        # map(&:presence) converts empty strings to nil
        line.split(options[:col_sep], -1).map(&:presence)
      end
    end

  end

end

It’s very little code, but it can save some real time, especially on large files. Below are the test results with a 58 column CSV file containing 500,000 records:

                           user     system      total        real
quickcsv              95.470000   4.020000  99.490000 (104.024935)
csv                  232.130000   4.010000 236.140000 (243.680124)
  • Categories
comments powered by Disqus