Dealing with common and uncommon problems with parsing CSV with Ruby
Effectively parse CSV with Ruby
At first glance, parsing CSV files with Ruby seems really straightforward and non-problematic. The same for the CSV format itself, it’s just a set of data separated mainly by commas. However, the practice is usually more problematic and complex, especially when you have to parse CSV files uploaded by users.
You can assume that the file will be complete in terms of structure and appropriately encoded. The harsh truth is that files uploaded by users are often malformed, poorly formatted, or encoded. Some of them may even contain duplicated headers or empty rows. This article is a first-aid kit that you can use when you will spot any problems with parsing CSV.
All my notes are based on years of hands-on experience at iRonin.IT - a top software development company, where we provide custom software development and IT staff augmentation services for a wide array of technologies.
High memory usage
Yes, you can harm your application’s performance by parsing CSV files. This can happen when you will parse a very large file and load all contents into a memory to process rows. Every time you see code like this:
CSV.read("./users.csv").each do |row|
# …
end
or
CSV.parse(File.read("./users.csv")).each do |row|
# …
end
Keep in mind that if the file users.csv
is very big, this line of code can cause memory issues as all records will be loaded into memory. To avoid such a situation, use foreach
method:
CSV.foreach("./users.csv") do |row|
# …
end
Duplicated headers
Based on my experience, it does not happen that often as malformed encoding or missing information but it is still the case. Given we have the following CSV data with the duplicated first_name
header:
first_name,last_name,email,first_name,age\n
John,Doe,doe@gmail.com,Johnatan,30\n
Tom,Doe,tom@gmail.com,Thomas,40\n
If we are interested only in the first occurrence of the header’s values, we don’t have to do anything as the standard library will handle this automatically:
data = CSV.read("./duplications.csv", headers: true)
data.first['first_name'] # => “John”
If you are interested in accessing the second value of the duplicated header, you can convert the row into array and then to the hash. Thanks to this trick you will get a hash where the last of duplicated values is present:
data = CSV.read("./duplications.csv", headers: true)
transformed_data = Hash[data.first.to_a]
transformed_data['first_name'] # => “Johnatan”
Collecting values from duplicated headers
If you don’t want to ignore values for duplicated headers, you can easily collect all values with the inject
method that is available on the CSV row with the standard library:
rows = []
csv = "first_name,email,first_name\nTom,tom@gmail.com,Jim"
CSV.parse(csv, headers: true).each do |row|
attributes = row.inject({}) do |h, r|
h[r.first] = Array.wrap(h[r.first]) << r.last; h
end
rows << attributes.transform_values { |v| v.size == 1 ? v.first : v }
end
rows # => [{"first_name"=>["Tom", "Jim"], "email"=>"tom@gmail.com"}]
Above solution will work well if there are no duplications or there are more than 2 columns duplicated as well.
Encoding issues
Let’s consider the following case: we are processing the CSV file with the custom encoding where one of the values contains some special characters:
uid,first_name,last_name,age
1,John,Non spécifié,19
2,Tim,Doe,20
3,Marcel,Doe,30
John’s last name is not specified, and the value is in french. The file encoding is ISO 8859-1. Opening the file in a standard way will throw an error indicating that the file is malformed:
CSV.read("./users.csv") # => Invalid byte sequence in UTF-8 in line 2. (CSV::MalformedCSVError)
We can fix the code by defining the encoding explicitly:
CSV.read("./users.csv", encoding: "ISO-8859-1", headers: true).each do |row|
puts row['last_name']
end
# => Non spécifié
# => Doe
# => Doe
Non-standard separator
Usually, values in the CSV file are separated by commas or semicolons. By the default, the standard Ruby library expects a comma as the separator but you can overwrite this configuration with the col_sep
setting:
csv = "first_name|last_name\nJohn|Doe"
CSV.parse(csv, col_sep: "|", headers: true).first.to_h
# => {"first_name"=>"John", "last_name"=>"Doe"}
Without the explicit configuration, parsing won’t work:
csv = "first_name|last_name\nJohn|Doe"
CSV.parse(csv, headers: true).first.to_h
# => {"first_name|last_name"=>"John|Doe"}
Detecting separator
If you are unsure what is the separator, you can write a code that will attempt to detect it. The below code won’t load the whole file into memory, it will only fetch headers and based on this value it will try to guess which separator was used:
separators = %w[, ; |]
headers = File.open('./users.csv') { |file| file.gets }
stats = separators.inject({}) { |stat, sep| stat.merge(sep => headers.count(sep)) }
separator = stats.sort_by { |_,v| v }.last.first
CSV.foreach('./users.csv', col_sep: separator) do |row|
# ...
end
The good news is that if the CSV file was created by some software as a result of the data export, it should contain one of the standard separators.
Duplicated rows
The standard library allows duplicated rows so you should be aware of such behavior. There is no built-in option to remove duplicates so you have to implement the code from scratch.
In most cases, it’s enough to collect all rows in an array and then call uniq
on the array to get distinct values. Another approach is to remove duplicated lines from the file before you will parse the CSV.
Being aware that the rows might be duplicated is part of the success.
Empty rows
By default, the standard Ruby library for parsing CSV format does not care about the empty rows and return them while parsing so it’s your responsibility to catch them and handle them properly.
However, the good news is that you can ignore empty rows on the parsing level by passing the skip_blanks: true
flag:
CSV.read("./users.csv", headers: true, skip_blanks: true)
Formatting issues
By default, all values parsed by the CSV library will be returned as strings:
csv = "first_name,age,birth_date\nJohn,30,2000-01-15"
row = CSV.parse(csv, headers: true).first
row.to_h
# => {"first_name"=>"John", "age"=>"30", "birth_date"=>"2000-01-15"}
We can either implement our own code to transform values to the right format or take advantage of preprocessors. A preprocessor is an anonymous function that is invoked against each value in the CSV file.
Default preprocessor
The standard library provides some default preprocessors that we can use to make some additional parsing of the CSV data:
Integer
Float
Date
DateTime
If you would like to list all default preprocessors in the console, you can simply run CSV::Converters
. You can also play with each preprocessor invoking the call
method on it:
CSV::Converters[:integer].call("2") # => 2
By default, preprocessors are not used so you have to explicitly define which of them you would like to use:
csv = "first_name,age,birth_date\nJohn,30,2000-01-15"
row = CSV.parse(csv, headers: true, converters: [:integer, :date]).first
row.to_h
# => {"first_name"=>"John", "age"=>30, "birth_date"=>Sat, 15 Jan 2000}
Custom preprocessor
You can easily define your own preprocessor and apply it while parsing the contents of the CSV file. Simply define it as an anonymous function that accepts one argument, the string value of the CSV column:
money_converter = ->(value) {
if value.match?(/^[0-9]* (USD|EUR)$/)
captures = value.match(/^(?<amount>[0-9]*) (?<currency>USD|EUR)$/)
Money.from_amount(captures[:amount].to_f, captures[:currency])
else
value
end
}
The above function checks if the value is an amount with currency. If there is a match, the value is converted to the Money object (from Money gem). Otherwise, it returns the original value so other preprocessors can process it as well.
We can use our custom preprocessor almost the same way we are using the default ones:
csv = "first_name,age,salary\nJohn,30,100000 USD"
row = CSV.parse(csv, headers: true, converters: [money_converter]).first
row.to_h
# => {"first_name"=>"John", "age"=>"30", "salary"=>#<Money fractional:10000000 currency:USD>}
Of course, if you plan to use your custom converter in many places, you can add it to the standard library and reference it via symbol:
CSV::Converters[:money] = money_converter
CSV.parse(csv, headers: true, converters: [:money])
Didn't get enough of Ruby?
Check out our free books about Ruby to level up your skills and become a better software developer.