9 : map-reduce-filter

    One of the most joyful things about Kotlin is its support for map-reduce-filter pipelines to tranform data collections. This way of working with data is concise, elegant, and powerful.

    What is map-reduce-filter?

    Perhaps you have heard of MapReduce, that new paper coming out of Google? (Say word count one more time!) That’s not quite what we’re talking about here. Although it is worth noting that map-reduce(-filter) is a general enough approach to manipulating data that Google was able to build an entire distributed data processing system on top of it.

    What we mean when we talk about map-reduce-filter is a style of working with data that largely leverages three core operations:

    • map: map transforms each value in a list or map into some other value

    • reduce: reduce combines all the values into a collection into a single value

    • filter: filter creates a new collection using only items from the original collection that match some criteria

    It’s worth noting there that our map-reduce-filter pipelines will also take advantage of a lot of helper functions—even if they can themselves be implemented using map-reduce-filter.

    Let’s start with a simple example. Imagine we want to find the sum of the first 8 powers of two. Here’s a fairly standard way to do this using a for loop:

    But there are some things about this example that we don’t like. First, sum has to be mutable, and we’d like to avoid mutable variables. Second, the for loop introduces some syntactic overhead.

    In contrast, here’s how to do this using map-reduce-filter:

    Note how map-reduce-filter utilizes the function programming features we have seen earlier. map, reduce, and filter are higher-order functions, each of which takes a single argument—an anonymous function—which we can call without using parenthesis due to the trailing lambda syntax. The function passed to map and filter takes a single argument, allowing us to use the implicit it to make the mapping or filtering function simpler. The function passed to reduce takes two arguments, meaning that we need to name them in our anonymous function.

    Kotlin’s powerful type inference also makes the example much cleaner. Kotlin can determine that map returns a list of Int values, which can then be manipulated in reduce using the sum operator.

    At this point it may be hard to reason about why the second example is cleaner, but let’s observe a few things. First, it more clearly breaks our algorithm down into clear steps. We start with the list of integers from 0 to 7 (0 until 8). Next we compute two to the power of each of these integers using .map. Finally, we sum all the values together using .reduce.

    Speaking of those helper functions, we can utilize one here to make this example even cleaner:

    Let’s continue by examining each of our new functional collection processing tools one by one.

    map

    map is a good place to start since it is one of the easiest to understand. A mapping function takes each member of a collection and transforms it into something else. It can perform any operation on each member of the collection, as long as its yields some result for each value. The result of a map is a new collection with the items from the original collection replaced with the results of the mapping function.

    Here’s an example where we extract the ages of a few people that we know into a list:

    Note that people above has not been modified. Instead, mapping functions build a new list from an existing one. So we could repeat the operation to extract the names of the people that we know. This is a very common use for map when processing data:

    In Kotlin the mapping function receives a single argument. We can utilize the implicit it, or we can also name that argument if we want:

    Note that the mapping function, like another anonymous functions, does not use an unqualified return to return a value. The last expression of the mapping function will be used to build a new collection from the original data.

    Using a return statement in a mapping function will return from the enclosing function. This is usually not what you want:

    There is a way to specify that you want to return from the mapping function using the qualified return return@map, but it may be cleaner to rewrite your code to avoid this if possible:

    map on Maps

    map is normally used to modify the values in a collection. When applied to a list, it maintains the order of the original list:

    However, we can also map on a map either using mapValues to transform the values or mapKeys to transform the keys. Note that both mapValues and mapKeys receive a single argument that corresponds to an entry in the map, not to a value or a key.

    If we want to work with the value or key directly, we can also use destructing assignment to extract these properties from it:

    reduce

    reduce is probably the hardest of the map-reduce-filter functions to understand. Both map and filter examine individual values one at a time—whereas reduce combines values together, and therefor transmits state between each call. Let’s try to make use of reduce more clear using some examples.

    First, let’s identify the weakness of map. Say we want to sum all of the values in a list. We could do this with a map, but it requires declaring and modifying a variable outside of the processing pipeline:

    This works…​ but it requires a mutable variable and state outside our processing pipeline—both of which we’re like to avoid. Also consider what the result of the call to map is:

    While it’s nice to see the intermediate results, what we would like is a higher-order function allowing us to reduce a collection to a single value.

    Unsurprisingly, that function is called reduce. Unlike map, the reducing function receives two arguments. The second is an element of the collection—but the first is the result produced by the previous calls to reduce. Each result of the function passed to reduce is then passed as the first argument to the next call to the reducing function.

    Let’s look at an example to try and make this more clear:

    There are a few things to note here. First, 9 pairs of results are shown, even though the list has 10 elements. This is because the first time that reduce is run its first two arguments are the first two elements in the list. This result (0) is then passed to the next call to reduce, which displays 0 2. Since the result (0) never changes, we see it printed repeatedly, even as each subsequent call to reduce has the opportunity to incorporate a new value in the result.

    With this as our starting point, let’s look at how to implement array sum using reduce:

    Now the output is more interesting. We can see how the sum is being built one value at a time. First we start with the sum of the first two values in the array: 0 and 1. That result (1) is passed to the next call to reduce, which adds 2, producing 3. That result (3) is passed to the next call to reduce, which adds 3, producing 6. And so on.

    We can use a similar approach to reducing to the product:

    You might wonder what happens if you call reduce on a collection with a single element. Try it!

    That makes a certain amount of sense. What about on an empty list? (Note that we need to add the Int type parameter to our list so that Kotlin can determine the type of the arguments to reduce.)

    fold

    It’s important to note one important limitation of reduce. Because it is called the first time on the first pair of values in the collection, reduce can only reduce to a value that is the same type as the collection. This is because the first time that reduce is called result is passed the first value in the collection, which must be the same type as the values in the collection.

    This is an important limitation. For example, imagine that we want to combine all the numbers in our list into a string.

    We could work around this using casting, but that’s not the right approach. Instead, we can use a variant of reduce called fold. Unlike reduce, fold allows us to determine what is passed as the result to the first call to our folding function.

    Here’s how to do the sum from above using fold instead of reduce:

    The result is the same, but observe the differences. Our folding function was called 10 times, whereas our reducing function was called 9. The first call to fold receives the value passed to fold as its first argument and the first item from the collection as its second argument.

    After that point fold behaves identically to reduce. But how the process is bootstrapped has important effects on what we can do with fold. For example, here’s how to combine all of the numbers into a string:

    Passing the empty string "" to fold establishes the type of result as String, and also that we must return something of type String. Which we do easily using string interpolation. This extra capability of fold turns out to be pretty useful in certain situations.

    For the sake of completeness, folding an empty list turns out better than reducing an empty list:

    filter

    The last commonly-used pipeline operator is filter. Filtering is simple. If the filter function returns true, the item is retained in the collection. If it returns false, the item is removed.

    Let’s see this in practice:

    Other Collection Operations

    Working with data using map-reduce-filter pipelines becomes second-nature quite quickly. It helps that Kotlin includes a bunch of useful helper functions for performing common operations—even if they could be done using some combination of map, reduce, or filter. We’ll show both the helpers and an idea of how to implement them—usually using reduce—if possible.

    sum, min, max

    sum, min, and max work out of the box:

    We’ve already seen how to implement sum using reduce. How about min?

    distinct

    distinct eliminates duplicate items from a collection.

    Again, we can reimplement this using fold:

    sorted, reversed

    sorted sorts the items in the collection. Without an argument the natural sort order of the elements is used:

    If we want to adjust the sort order we can use sortedBy and provide a function that should return a sort key for each item.

    Finally, reversed will reverse the order of a collection, providing a superior reverse sort solution to the one shown above:

    groupBy

    groupBy is a fun and powerful one. Given a list, it will group the elements by a value into a map. Duplicate keys will be added to a list, meaning that no items are lost.

    For example, let’s say that we wanted to group all people by age:

    Now let’s do some processing and see what happens when we have some duplicate ages:

    joinToString

    Finally, it’s frequently useful to be able to reduce a collection to a String. Kotlin makes this easy and maybe even joyful:

    And Plenty More

    The examples above just scratch the surface of what Kotlin is capable of when working with collections. Many common operations are supported out of the box, and it’s rare that you find yourself stuck and needing to actually use a loop and mutable state.

    When you’re getting started with map-reduce-filter data processing, it can be fun to try and rewrite your existing data processing pipelines that probably use loops in this functional style. Give it a shot and see how you go.