We do a LOT of file processing, bringing in customer data for analytics, so we’re always looking for ways to make things faster. The CSV class built into Ruby is great, but we found that if we could stand a few less options, we could make it a good bit faster while staying in Ruby.
Assumptions:
- CSV file has a header
- You want a hash emitted for each row - (headers are converted to symbols)
- You don’t need converters - all values are strings
- Most rows in the CSV file do not have quotes
What does it do?
For each line in the file:
- If the line has any quoted characters use Ruby’s CSV
- Otherwise split the line and convert blank values to nil (like CSV)
- Zip the header with the values and yield the Hash
The code
# Assumptions:
# File has a header - this is not an option at this time!
module QuickCsv
DEFAULT_OPTIONS = {
col_sep: ",",
quote_char: '"',
file_encoding: 'utf-8'
}
class << self
def foreach(input, options={})
options = DEFAULT_OPTIONS.merge(options)
csv_options = options.slice(:col_sep, :quote_char)
io = input.respond_to?(:readline) ? input : File.new(input, "r:#{options[:file_encoding]}")
header = nil
IO.foreach(io) do |line|
line.chomp!
if header
row = read_row(header, line, csv_options)
yield row
else
header = read_header(line, csv_options)
end
end
end
private
def read_header(line, options)
fields = parse_line(line, options)
fields.map(&:to_sym)
end
def read_row(header, line, options)
fields = parse_line(line, options)
Hash[header.zip(fields)]
end
def parse_line(line, options)
if line.include?(options[:quote_char])
CSV.parse_line(line, options)
else
# Split with -1 keeps trailing blanks
# map(&:presence) converts empty strings to nil
line.split(options[:col_sep], -1).map(&:presence)
end
end
end
end
It’s very little code, but it can save some real time, especially on large files. Below are the test results with a 58 column CSV file containing 500,000 records:
user system total real
quickcsv 95.470000 4.020000 99.490000 (104.024935)
csv 232.130000 4.010000 236.140000 (243.680124)
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Email