(Status: this proposal will not go forward, but helped spark a discussion on a better idea--generator expressions.)
List displays (aka list comprehensions) have proven to be a useful and popular part of Python. However, there are limitations on what they can do, and they sometimes lead to inefficiencies: it is tempting to create a list of intermediate results, even when the whole list is not really needed. Consider finding the maximum termperature over the last day, where you have a function, temp, that takes an hour as argument and fetches the temperature. With list comprehensions this is:
max([temp(hour) for hour in range(24)])
This builds up the list of temperatures, only to discard
them. Not so bad in this case, but troublesome in cases when the list
is multiple megabytes.
In this note I propose a new syntax called accumulation
displays that combines the accumulating operator (in this case
max) with the iteration over a collection of values. Accumulation
displays are intended to be more efficient in some cases (because no
intermediate list is constructed), more extensible (you can do things
beyond max, min, and sum that you just can't do by operating on a
returned list), and very natural to read. An accumulation display for
the example above is:
[Max: temp(hour) for hour in range(24)]
I use "Max" rather than "max" because I want the accumulator to be
different from the builtin function (they could be made the same, but
I prefer not to for reasons we will soon see). Here are some more
examples:
[Sum: x*x for x in numbers]
[Product: Prob_spam(word) for word in email_msg]
[Min: temp(hour) for hour in range(24)]
[Mean: f(x) for x in data]
[Median: f(x) for x in data]
[Mode: f(x) for x in data]
[SortBy: abs(x) for x in (-2, -4, 3, 1)]
"[" expression ":" listmaker "]"
where listmaker is everything that can go into a list display.
(See http://www.python.org/doc/current/ref/lists.html.)
[Sum: x*x for x in numbers]
we want that to be the equivalent of
total = 0
for x in it:
total = (x*x + total)
return total
That seems simple enough: we initialize a value, update it in the
loop, and then return it at the end. But it is not always quite that
easy. For example, in
[Mean: f(x) for x in data]
we want
(total, n) = (0, 0)
for x in it:
(total, n) = (total+x, n+1)
return total / n
In this case we need some additional computation at the end, and the
value we are updating is more complex (a two-element tuple rather than
a single number). (And we have assumed true division.) We need to make
some choices: let us say that an accumulator (such as Sum or
Mean) is a class or other callable object which when called
with no arguments returns an object with proper initial state, and
that we then call the object's add method to accumulate
values in the loop, and its result method to get the final
result. Rather than showing the expansion of the accumulation display
code into a sequence of statements, we will show it as a function
call. That is,
[acc: exp for x in it]
is equivalent to
accumulation(acc, lambda x: exp, it)
where we have defined
def accumulation(acc, f, it):
a = acc()
for x in it:
a.add(f(x))
return a.result()
We can now define Mean as
class Mean:
def __init__(self):
self.total, self.n = 0, 0
def add(self, exp):
self.total, self.n = self.total+exp, self.n+1
def result(self):
return self.total / self.n
[Argmax: votes(c) for c in candidates]
(In math "argmax" (or "arg max") gives you the element of a set with
the maximal value on some function (whereas "max" gives you the maximal function value itself).)
Here we are asking for the element c with the maximal value of
votes(c). We can program this by adding the three characters ", x" to
the definition of accumulation:
def accumulation(acc, f, it):
a = acc()
for x in it:
a.add(f(x), x)
return a.result()
[Some: temp(hour) > 30 for hour in range(24)]
We want this to return a true value if there is some hour with temp(hour) >
30, and False if there is not. Obviously it can stop iterating when
it finds the first element that satisfies the condition. We will
handle this by having the accumulator's add method return a
true value if it wants to stop the loop. After changing accumulation
def accumulation(acc, f, it):
a = acc()
for x in it:
if a.add(f(x), x):
break
return a.result()
we can define Some as:
class Some:
def __init__(self):
self.value = False
def add(self, exp, _):
if exp:
self.value = True
return StopIteration
def result(self):
return self.value
Note I returned StopIteration (rather than, say,
True) just as a hint to the reader. (I considered having the
protocol raise StopIteration rather than just returning it, but I
think it is simpler just to return a value.) Note that we still call
the result method, even when we have broken out of the loop. If we
had the add method return the final result for the
accumulation it would work and be slightly simpler for Some,
but wouldn't work on an accumulator that needs to exit early and then
return None as its final result.
[Top(10): humor(joke) for joke in jokes]
Up to now, accumulators were passed no arguments when they were
initialized, implicitly, by the function accumulation. Here
we have an accumulator, Top, that takes a parameter, and all
we have to do is change accumulation so that it can accept
either an accumulator class or an instance:
def accumulation(acc, f, it):
if isinstance(acc, (types.ClassType, types.TypeType)):
a = acc()
else:
a = acc
for x in it:
if a.add(f(x), x):
break
return a.result()
Now we can define Top:
from heapq import *
class Top:
def __init__(self, k):
self.k = k
self.heap = []
def add(self, exp, _):
heap = self.heap
if len(heap) < self.k:
heappush(heap, exp)
elif exp < heap.biggest():
heapreplace(heap, exp)
def result(self):
return self.heap
Note that this can be much more efficient than alternatives such as:
jokes.sort(lambda a, b: humor(a) < humor(b))
jokes[0:10]
a = Top(10)
for j in new_jokes:
print j
a.add(float(raw_input("Rating?")), j)
return [a: humor(j) for j in machine_ratable_jokes]
Here we need a print statement, so we explicitly manage the
accumulator a, filling it with results derived by asking the
user for a rating, and then we use the same accumulator in a display
to add in some more results and return the final list of top 10 results.
Another way to look at it is that there are some common algorithms in programming, such as finding the best element(s) or value(s) from a candidate set, that appear again and again, and this allows us to abstract out these algorithms, rather than re-implement them every time.
One drawback of my proposal is that the [ ... ] syntax suggests that a list is returned, when that need not be the case. My defense is that it is iterating over a list (or at least, an iterable). But Guido pronounced, quite reasonably, that this was not a sufficient defense.
There have been past discussions of set displays and dict displays, using syntax something like:
words = { w.lower() for w in text.split() }
squares = { i:i*i for i in range(N) }
I'm not going to take a stand on whether these are good ideas, but I note
you can do this with accumulation displays easily:
words = [Set: w.lower() for w in text.split()]
squares = [Dict: i, i*i for i in range(N)]
(or of course you can do it without any changes to the language):
words = set([w.lower() for w in text.split()])
squares = dict([i, i*i for i in range(N)])
Discussion of my proposal led to the ressurection of PEP 289 for
generator
comprehensions, slightly updated and now called generator expressions. The idea is that
exp for x in it
would be roughly equivalent to
def gen():
for x in it:
yield exp
return gen()
You get short-circuit evaluation just by ceasing to call the
generator. In fact, all the accumulators that I wrote could be
re-written as functions that take an generator/iterator/sequence as
input (although if you want both the loop variable and the computed
value, you need to pass them both as a tuple). Compare:
| Accumulation Display | Generator Expression plus Function |
|---|---|
[Sum: x*x for x in numbers]
[Product: Prob_spam(word) for word in email_msg]
[Min: temp(hour) for hour in range(24)]
[Median: f(x) for x in data]
[SortBy: abs(x) for x in (-2, -4, 3, 1)]
[Top(10): humor(joke) for joke in jokes]
[Argmax: votes(c) for c in candidates]
|
sum(x*x for x in numbers)
product(Prob_spam(word) for word in email_msg)
min(temp(hour) for hour in range(24))
median(f(x) for x in data)
sortdecorated((abs(x),x) for x in (-2, -4, 3, 1))
top(10, (humor(joke) for joke in jokes))
argmax((c,votes(c)) for c in candidates)
|
With the accumulation display approach, you need a change to the language, and a new set of accumulators. With the generator expression approach you need a change to the language of similar complexity, and a combination of old funcxtions (sum, min) and new functions (product, top, argmax). The new functions are somewhat easier to write than the accumulators; for example a 6-line some function instead of an 8-line Some accumulator:
def some(items):
"If some element in items is true, return it; otherwise return False"
for item in items:
if item:
return item
return False
Philip J. Eby and Raymond Hettinger proposed the generator expression idea. I'm happy that my proposal came at the right time to help spur a discussion that appears to be on the way to a happy ending--I think the generator expression idea is more generally useful than my proposal.
Other points were raised in the discussion on python-dev. Ian Bicking made a suggestion that I interpreted to mean that accumulation should be implemented as something like:
def accumulation(acc, f, it):
a = acc.__accum__()
for x in it:
if a.add(f(x), x):
break
return a.result()
The advantage is that min, max, sum and perhaps some others could
be given a __accum__ method. I'm not sure of the details of how this would work out.
temp = [70, 70, 71, 74, 76, 76, 72, 76, 77, 77, 77, 78, 78, 79, 79, 79, 78, 80, 82, 83, 83, 81, 84, 83]
data = temp
votes = {'Gray': 45, 'Arnie': 48, 'Peter': 3, 'Cruz': 32, 'Tom': 13}
candidates = ['Gray', 'Arnie', 'Peter', 'Cruz', 'Tom']
[Max: temp[hour] for hour in range(24)]
==> 84
[Min: temp[hour] for hour in range(24)]
==> 70
[Sum: x*x for x in data]
==> 144999
[Mean: f(x) for x in data]
==> 155.25
[Median: f(x) for x in data]
==> 156.0
[Mode: f(x) for x in data]
==> 166
[Argmax: votes[c] for c in candidates]
==> Arnie
[Argmin: votes[c] for c in candidates]
==> Peter
[Some: temp[hour] > 75 for hour in range(24)]
==> True
[Every: temp[hour] > 75 for hour in range(24)]
==> False
[Top(10): temp[hour] for hour in range(24)]
==> [84, 83, 83, 83, 82, 81, 80, 79, 79, 79]
[Join(', '): votes[c] for c in candidates]
==> 45, 48, 3, 32, 13
[SortBy: abs(x) for x in (-2,-4, 3, 1)]
==> [1, -2, 3, -4]
[SortBy(reverse=True): abs(x) for x in (-2, -4, 3, 1)]
==> [-4, 3, -2, 1]
listmaker ::= expression ( list_for | ( "," expression )* [","] )we could change it to:
listmaker ::= expression [ ":" expression ] ( list_for | ( "," expression )* [","] )
[acc: exp for x in it1 if c1 for y in it2 if c2 ...]
to mean
from itertools import ifilter
accumulation(acc, lambda (x, y, ...): exp,
icross_product(ifilter(c1, it1), ifilter(c2, it2), ...))
where icross_product takes the cross product of all the
argument iterables, in left-to-right order. That is, it generates tuples
in the order
(c1[0], c2[0], ... cn[0]) (c1[0], c2[0], ... cn[1]) (c1[0], c2[0], ... cn[2]) ... (c1[0], c2[0], ... cn[-1]) (c1[0], c2[0], ... cn_1[1], cn[0]) (c1[0], c2[0], ... cn_1[1], cn[1]) ... (c1[-1], c2[-1], ... cn[-1])Perhaps it should be a standard part of itertools.
This document is in the public domain and may be used for any purpose.