Thinking Functionally: Transforming Data (Part 2)

By now, I'm assuming that you've read Part 1. If you haven't, that's ok. Although this post follows thematically from Part 1, it can stand alone. Enough talk, let's get to it.

In Part 1 I talked a lot about this programming problem that I was asked to solve during a technical interview. Here's a rough reconstruction of that problem:

Use the BitTorrent encoding (Bencode) protocol to serialize the following data structures: String, Integer, List, Dictionary.

That's not the exact wording, but it's roughly equivalent. I suggest that you quickly review the Bencode protocol if you aren't familiar with it; just so that we'll be on the same page. I'll give a couple of examples from the Bencode wiki:

IN -> OUT
1: "Spam" -> 4:spam
2: 3 -> i3e
3: ["spam", "eggs"] -> l4:spam:4:eggse
4: {"cow" => "moo", "spam" => "eggs"} -> d3:cow3:moo4:spam4:eggse
5: {"spam" => ["a", "b"]} -> d4:spaml1:a1:bee
When this problem was presented to me, I approached it from the perspective of OO imperative programming. Let me tell you: that got really messy, really fast. 
The first thing working against me was that I tried to write the solution in PHP (which was the language I was most comfortable with at the time). No disrespect intended to the language, but since PHP treats Arrays and Hashes (or Associative Arrays, as they are also known in the PHP world) as essentially equivalent, there was added complexity when trying to distinguish a List from a Dictionary. My solution started simple; a Bencode class with some instance methods to take care of the data type encoding. Then came the conditionals. Which lead to a set of Interfaces to allow for type checking. Eventually I had to turn to recursion. That's the point where I should have known that I was coming at it from the wrong perspective. Like I said, it got messy.

In the end I had a very complex solution to a seemingly simple problem. That thought stuck with me for a long time. I expected that a simple problem should have a simple, or at least more elegant, solution. Especially for something that was being asked in a technical interview. I really started to understand functional programming, it dawned on me that Bencoding is just about transforming the supplied data types into a specially encoded string. The kind of problem that lends itself well to the functional paradigm. 

The String data type gets transformed into a new encoded string which consists of the length of the original String, followed by a colon, followed by the original String. That's pretty easy to accomplish. Let's see how that might look in Clojure:

(defn bencode-string [s] (str (count s) ":" s))

A function that accepts a String and returns the encoded version. Nice and simple, right? Transforming an Integer is just as simple:

(defn bencode-integer [i] (str "i" i "e"))

Integer goes in, the encoded version of the integer comes out. If you want to see these in action, you can head over to Try Clojure and paste these function definitions into the REPL. Once the functions are defined, you can call them with examples like: (bencode-string "spam") and (bencode-integer 3). The REPL will echo the return value of each function.

Things start to get a bit tricky when we look at the encoding for Lists. A List is just a collection of data types; it can contain Strings, Integers, other Lists, and/or Dictionaries. In order to Bencode a List we need to apply the appropriate bencoding function to each element in the List. Luckily, this is something that functional programming does really well. Assuming we have a generic bencode function that knows how to encode each data type, encoding a List might look like:

(defn bencode-list [l] (str "l" (reduce str (map bencode l)) "e"))

Ah, the classic Map/Reduce paradigm. Let's dig into this a little bit. We'll start with (map bencode l) since it's the first part of the function that will be processed. As I mentioned above, this makes use of a bencode function that we haven't defined yet, but that we can assume knows how to encode each of the four data types. Stick with me, we'll define bencode shortly. 

The map function takes two arguments: a function and a List. It then applies that function to each element of the List, returning a new List of the resulting values. It's roughly equivalent to the following imperative code:

new_list = []
l.each do |e|
  new_list.push(bencode(e))
end

After map has created a new List containing the bencoded versions of the elements in l, it then passes that new List to the reduce function. The reduce function takes two arguments: a function and a List. It is often used to reduce a List to a single value. A simple use case for the reduce function is summing a List, i.e. reducing a List to a single Integer value by using the addition function. In our example, (reduce str (map bencode l)) is taking the resulting List from the map function and reducing it to a single String using the str function. str is a function that can be used to concatenate Strings, e.g. (str "Hello" "World") will return "HelloWorld". reduce applies str to the first two elements of the List, takes the resulting String and applies str to it and the third element. It continues doing this until it has exhausted the List. Finally, we use str to prepend the "l" (lower case L) and append the "e" to the result of the reduce function (as per the Bencode specification). 

Given a List like ["spam", "eggs"], the map function will return a List that looks like ["4:spam", "4:eggs"], the reduce function will return a String that looks like "4:spam4:eggs" and the outermost str function will return "l4:spam4:eggse".

Before moving on to the Dictionary data type, let's take a quick look at the bencode function that bencode-list relys on. bencode is the main function will be used to bencode data. It needs to accept a data type, determine how to encode it, and return the encoded result. Based on what we've done so far, we might start with something like this:

(defn bencode [data]
  (cond 
    (string? data) (bencode-string data)
    (number? data) (bencode-integer data)
    (vector? data) (bencode-list data))) 

It's just a simple conditional structure that checks the type of the supplied data, then delegates to other functions that know how to do the actual encoding. Think back to how bencode-list uses bencode to encode each element in the List. Hopefully that is clear now! 

The only thing that our bencode function doesn't know how to do yet is encode Dictionaries. There are some really cool Clojure functions that are going to make the Dictionary encoding a snap, but those functions will need some special attention and explanation. I'm going to save that for Part 3. 

Stay tuned!
6 responses
Funny. I did this exact same problem in Clojure and pretty much every line is the same, except for the function names and the use of 'clojure.string/join' instead of 'reduce str'. Good post.
No joke, I originally started with clojure.string/join! Working through it, I realized that I could do the same thing with str. I'll post the full Gist with Part 3 and we can compare our Dictionary encoding functions. That's the fun one!
In order to improve your solution you can get rid of the hard coded cond and use multimethods or better protocols. Protocols in particular offer you a la carte polymorphism over any data type. Give them a try. You may define a bencodable protocol with a single bencode function and then extend it to any type you need.
Thanks Luca! I haven't gotten too deeply into multimethods or protocols yet. I will definitely give them a try and then probably revisit this idea in a future post
Another string join option in this case might be (apply str (map bencode l)) - although since you already have strings I think I prefer clojure.string/join.
1 visitor upvoted this post.