Python's comprehensions syntax for java developers

Originally posted on JaggedVerge: http://www.jaggedverge.com/2015/10/python-comprehension-syntax-for-java-developers/ (please ask questions and leave comments over there)

Recently I've been working with both Java and Python. One of the jarring differences between Java and Python is the relative brevity of Python code that's enabled by the various Python language features. In all languages there's a decision about how much syntactic sugar is used and how succinct the language is, with regards to this the Python and Java languages happen to be close to the opposite extremes. Something that Java developers who are new to Python frequently have trouble with is the Python comprehensions syntax as there's no direct analogue of this in Java. Part of the difficulty of learning this part of Python syntax is that it's hard to search for as there's no searchable keywords involved.

Here are a few examples of some of the syntax that you can see in Python source code:

[x**2 for x in range(5)] #list comprehension
{x**2 for x in range(5)} #set comprehension
(x**2 for x in range(5)) #generator expression

All of these generate a collection of items based on the same rule. While the types are different these all share a common form of syntax which can be broken down into 2 constituent parts:

  1. A rule that generates a sequence of items
  2. The data type that the sequence of items is stored as

Part 1: sequence generation

The general structure for the comprehension is as follows:

The equivalent of this in pseudocode is:

Essentially you loop over every value in the given iterable variable and you do something with it. (Note that you don't have to call a function with x as a parameter, more on that a bit later.)

For example say we had:

x**2 for x in range(10)

This is essentially shorthand for:

for x in range(10):
    x**2

As you can see this generates a sequence of values where we square each value.

Similar code in Java would be:

//Similar to the range construct in Python
List<Integer> makeSequence(int begin, int end) {
    List<Integer> ret = new ArrayList(end - begin + 1);
    for(int i = begin; i <= end; i++, ret.add(i));
    return ret;
}

//Calling code
List<Integer> range = makeSequence(0, 10);
foreach(Integer item: range){
    Math.pow(item, 2.0);
}

Note that we haven't actually done anything with the result of squaring the numbers yet, which leads us nicely to part 2.

Part 2: data types

Compared to Java there's a lot of syntactic sugar involved in the idiomatic usage of Python. You can see an example of this when creating variables, while Python is strongly typed you do not supply a type when creating a variable, the type is determined by the contents of what the variable refers to. With Java you must supply the type when creating a variable and these types are keyword based, such as the List<Integer> type we had in the examples before.

For example in Python creating a list from a comprehension looks like this:

[x**2 for x in range(10)]

Which like earlier is shorthand for:

ret = []
for x in range(10):
     ret.append(x**2)
return ret

Running this in the interpreter in interactive mode you can see what this does:

>>> a = [x**2 for x in range(10)]
>>> type(a)
<class 'list'>
>>> a
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Note that the variable a here is of the list type because we assigned a list to it. In Java a rough equivalent of this code is:

List<Integer> range = makeSequence(0, 10);
List<Integer> ret = new ArrayList<Integer>();
foreach(Integer item: range){
     ret.add(Math.pow(item, 2.0));
}
return ret;

In Python when you have the form [sequence_generation_expression] you get a list. If it were {sequence_generation_expression} you would get a set, or if it was (sequence_generation_expression) you would get a generator expression. More on this in the following examples.

Examples

Here's a few examples of how these work with various types by running through some code in the interpreter in interactive mode. If you are trying out this code while reading then you might also want to try running the type and help commands on the various variables you create in the process.

List comprehension

>>> a = [x**2 for x in range(10)]
>>> type(a)
<type 'list'>
>>> a
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

This list comprehension creates a list like the one in the earlier example.

Set comprehension

>>> b = {x**2 for x in range(10)}
>>> type(b)
<class 'set'>
>>> b
{0, 1, 64, 4, 36, 9, 16, 49, 81, 25}

This creates a set with the same elements as those found by applying the rule. Note that the order of the elements were created in the set is not preserved with the contents of the set. This is because the set uses a hashing function to add and lookup items. Overall this is similar to Java's HashSet data type.

Dictionary comprehension

>>> c = {x: x**2 for x in range(10)}
>>> type(c)
<class 'dict'>
>>> c
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

This creates a dictionary with the x's themselves as keys and the x squared as values. This is similar to Java's HashMap datatype.

Generator expression

>>> d = (x**2 for x in range(10))
>>> type(d)
<class 'generator'>
>>> d
<generator object <genexpr> at 0x7f78bea3eb40>

Essentially a generator lazily creates the same sequence of items that the list comprehension does but it will only compute the values as they are needed. This is a fairly substantial departure from what you would see in the core Java language which is basically eager-evaluation everywhere, you could create a class that emulated this but it would take some effort. Note that it only generates the list once, so once you consume the values you need to re-instantiate the generator. This makes generator expressions useful in much the same way as pipelines are, you can run some processes without having to store the entire dataset in memory at once. We can feed a generator expression into the constructor of other types:

>>> d = (x**2 for x in range(10))
>>> list(d)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> list(d)
[]

Again note that the generator can only be used once.

Creating a tuple

To create a tuple we have to use the explicit tuple keyword because there's no other way of making these:

tuple(x**2 for x in range(5))

As mentioned earlier you don't need to do anything with the value from the iterable in a comprehension. or example say we had a function foo() that we wanted to call 3 times, we could do this as follows:

[foo() for _ in range(3)]

This uses the Python convention of the underscore representing an unused variable.

blogroll

social