Why I Love Little Classes

If you’ve ever had to import data from end users, you’ve probably been amazed at the different ways people can find to mess up a simple csv file. For example, take a file that should have just a few columns, say name and url. I’ve seen uploaded files that have hundreds of thousands of blank rows at the end. I’ve seen files with 20 blank column headers at the end. And without a doubt, somebody will decide to rename the columns to bookmark name and bookmark URL.

While it may be tempting to just raise an exception on that kind of user error, it’s not particularly friendly. Especially if you’re dealing with less tech savvy users. Instead, it’s a good opportunity to compose a few little classes to help make your life easier.

When facing exactly this problem, my first goal was to create a way to clean up problems like blank rows and extra header columns. To do that, I created a very simple class using TDD. First, I gave it a CSV with extra rows and verified that it removed those rows. Next, I gave it blank columns in the header and made sure those were removed as well. The end result looked like this:

class CSVCleaner
  def clean(csv_string)
    strip_blank_header_fields(strip_blank_lines(csv_string))
  end

protected
  def strip_blank_lines(csv_string)
    csv_string.gsub(/^,*[\n\r]/m,"")
  end
                                  
  def strip_blank_header_fields(csv_string)
    csv_string.gsub(/^([^\n\r]+?),+([\n\r])/,'\1\2')
  end
end

That class is simple and focused. It also is easy to use. Where before we would need to pass in a string, we simply wrap that string in CSVCleaner.new.clean(string).

To tackle the problem of bad column names, I decided to create an object to try to guess the right column names. The first version of my bookmark importer expected to receive each row as a hash. I didn’t want to have to change my importer code, so I wanted whatever I built to look like a hash.

I ended up with this:

class CSVColumnNameGuesser
  def initialize(hash)
    @hash = Hash[hash.map {|k,v| [clean_key(k), v]}]
  end

  def [](key)
    exact_match(key) || close_enough(key)
  end

  def keys
    @hash.keys
  end
protected

  def exact_match(key)
    @hash[key.to_s]
  end

  def close_enough(key)
    @hash[@hash.keys.detect {|k| k =~ /#{key}/i}]
  end

  def clean_key(key)
    key.to_s.gsub(/_/,"")
  end
end 

Notice that I didn’t create a subclass of hash that performed the guessing. Instead, I wrapped hash in a simple class that has the same interface (from a duck type perspective) for my use. I made it clear what the public interface was and used well named methods to make it obvious what the code is doing. This is an example of favoring composition over inheritance.

While I like this code, it isn’t perfect. For example, it needlessly repeats the key matching. If I were to import a one million row spreadsheet with twenty columns, I would expect to spend a lot of time doing regexp processing. I could work around that by either adding a lookup cache or by creating a builder to build an optimized version of this object, but I haven’t had a need for that yet. While performance may not be optimal, the code is simple and easy to read and has really helped me handle unpredictable user input.