How to Use LINQ SelectMany
Okay, everyone knows by now how simple LINQ queries with a where and
select (and orderby, and Take and Skip and Sum, etc) are translated
from a query comprehension into an equivalent expression for further
translation:
from p in products where p.Price > 100 select p.Name
becomes
products.Where(p => p.Price > 100).Select(p => p.Name)
All
blue syntax highlighting has gone; the compiler is happy with what
remains and takes it from there in a left-to-right fashion (so, it
depends on the signature of the found Where method whether or not we
take the route of anonymous methods or, in case of an
Expression<…> signature, the route of expression trees).
But let’s make things slightly more complicated and abstract:
from i in a
from j in b
where i > j
select i + j
It’s
more complicated because we have two from clauses; it’s more abstract
because we’re using names with no intrinsic meaning. Let’s assume a and
b are IEnumerable<int> sequences in what follows. Actually what
the above query means in abstract terms is:
(a X b).Where((i, j) => i > j).Select((i, j) => i + j)
where X is a hypothetical
Cartesian product operator, i.e. given a = { 1, 4, 7 } and b = { 2, 5,
8 }, it produces { (1,2), (1,5), (1,8), (4,2), (4,5), (4,8), (7,2),
(7,5), (7,8) }, or all the possible pairs with elements from the first
sequence combined with an element from the second sequence. For the
record, the generalized from of such a pair – having any number of
elements – would be a tuple. If we would have this capability, Where
would get a sequence of such tuples, and it could identify a
tuple in its lambda expression as a set of parameters (i, j).
Similarly, Select would do the same and everyone would be happy. You
can verify the result would be { 6, 9, 12 }.
Back to reality
now: we don’t have the direct equivalent of Cartesian product in a form
that produces tuples. In addition to this, the Where operator in LINQ
has a signature like this:
IEnumerable<T> Where<T>(this IEnumerable<T> source, Func<T, bool> predicate)
where
the predicate parameter is a function of one – and only one – argument.
The lambda (i, j) => i > j isn’t compatible with this since it
has two arguments. A similar remark holds for Select. So, how can we
get around this restriction? SelectMany is the answer.
Demystifying SelectMany
What’s the magic SelectMany all about? Where could we better start our investigation than by looking at one of its signatures?
IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
this IEnumerable<TSource> source,
Func<TSource, IEnumerable<TCollection>> collectionSelector,
Func<TSource, TCollection, TResult> resultSelector)
Wow, might be a little overwhelming at first. What does it do? Given a sequence of elements (called source) of type TSource, it asks every such element (using collectionSelector)
for a sequence of – in some way related – elements of type TCollection.
Next, it combines the currently selected TSource element with all of
the TCollection elements in the returned sequence and feed it in to resultSelector to produce a TResult that’s returned. Still not clear? The implementation says it all and is barely three lines:
foreach (TSource item in source)
foreach (TCollection subItem in collectionSelector(item))
yield return resultSelector(item, subItem);
This already gives us a tremendous amount of power. Here’s a sample:
products.SelectMany(p => p.Categories, (p, c) => p.Name + “ has category “ + c.Name)
How
can we use this construct to translate multiple from clauses you might
wonder? Well, there’s no reason the function passed in as the first
argument (really the second after rewriting the extension method, i.e.
the collectionSelector) uses the TSource argument to determine the
IEnumerable<TCollection> result. For example:
products.SelectMany(p => new int[] { 1, 2, 3 }, (p, i) => p.Name + “ with irrelevant number “ + i)
will
produce a sequence of strings like “Chai with irrelevant number 1”,
“Chai with irrelevant number 2”, “Chai with irrelevant number 3”, and
similar for all subsequent products. This sample doesn’t make sense but
it illustrates that SelectMany can be used to form a Cartesian
product-like sequence. Let’s focus on our initial sample:
var a = new [] { 1, 4, 7 };
var b = new [] { 2, 5, 8 };from i in a
from j in b
select i + j;
I’ve
dropped the where clause for now to simplify things a bit. With our
knowledge of SelectMany above we can now translate the LINQ query into:
a.SelectMany(i => b, …)
This
means: for every i in a, “extract” the sequence b and feed it into ….
What’s the …’s signature? Something from a (i.e. an int) and something
from the result of the collectionSelector (i.e. an int from b), is
mapped onto some result. Well, in this case we can combine those two
values by summing them, therefore translating the select clause in one
go:
a.SelectMany(i => b, (i, j) => i + j)
What happens when we introduce a seemingly innocent where clause in between?
from i in a
from j in b
where i > j
select i + j;
The first two lines again look like:
a.SelectMany(i => b, …)
However,
going forward from there we’ll need to be able to reference i (from a)
and j (from b) in both the where and select clause that follow but both
the corresponding Where and Select methods only take in “single values”:
IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate);
IEnumerable<TResult> Select<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> projection);
So what can we do to combine the value i and j into one single object? Right, use an anonymous type:
a.SelectMany(i => b, (i, j) => new { i = i, j = j })
This
produces a sequence of objects that have two public properties “i” and
“j” (since it’s anonymous we don’t care much about casing, and indeed
the type never bubbles up to the surface in the query above, because of
what follows:
a.SelectMany(i => b, (i, j) => new { i = i, j = j }).Where(anon => anon.i > anon.j).Select(anon => anon.i + anon.j)
In
other words, all references to i and j in the where and select clauses
in the original query expression have been replaced by references to
the corresponding properties in the anonymous type spawned by
SelectMany.
Lost in translation
This whole
translation of this little query above puts quite some work on the
shoulder of the compiler (assuming a and b are IEnumerable<int>
and nothing more, i.e. no IQueryable<int>):
- The lambda expression i => b captures variable b, hence a closure is needed.
- That same lambda expression acts as a parameter to SelectMany, so an anonymous method will be created inside the closure class.
- For new { i = i, j = j } an anonymous type needs to be generated.
- SelectMany’s
second argument, Where’s first argument and Select’s first argument are
all lambda expressions that generate anonymous methods as well.
As
a little hot summer evening exercise, I wrote all of this plumbing
manually to show how much code would be needed in C# 2.0 minus closures
and anonymous methods (more or less C# 1.0 plus generics). Here’s where
we start from:
class Q
{
IEnumerable<int> GetData(IEnumerable<int> a, IEnumerable<int> b)
{
return from i in a
from j in b
where i > j
select i + j;
}
}
This translates into:
class Q
{
IEnumerable<int> GetData(IEnumerable<int> a, IEnumerable<int> b)
{
Closure0 __closure = new Closure0();
__closure.b = b;return Enumerable.Select(
Enumerable.Where(
Enumerable.SelectMany(
a,
new Func<int, IEnumerable<int>>(__closure.__selectMany1),
new Func<int, int, Anon0<int, int>>(__selectMany2)
),
new Func<Anon0<int, int>, bool>(__where1)
),
new Func<Anon0<int, int>, int>(__select1)
);
}private class Closure0
{
public IEnumerable<int> b;public IEnumerable<int> __selectMany1(int i)
{
return b;
}
}private static Anon0<int, int> __selectMany2(int i, int j)
{
return new Anon0<int, int>(i, j);
}private static bool __where1(Anon0<int, int> anon)
{
return anon.i > anon.j;
}private static int __select1(Anon0<int, int> anon)
{
return anon.i + anon.j;
}
}private class Anon0<TI, TJ> //
generics allow reuse of type for all anonymous types with 2 properties,
hence the use of EqualityComparers in the implementation
{
private readonly TI _i;
private readonly TJ _j;public Anon0(TI i, TJ t2) { _i = i; _j = j; }
public TI i { get { return _i; } }
public TJ j { get { return _j; } }public override bool Equals(object o)
{
Anon0<TI, TJ> anonO = o as Anon0<TI, TJ>;
return anonO != null && EqualityComparer<TI>.Default.Equals(_i, anonO._i) && EqualityComparer<TJ>.Default.Equals(_j, anonO._j);
}public override int GetHashCode()
{
return EqualityComparer<TI>.Default.GetHashCode(_i) ^ EqualityComparer<TJ>.Default.GetHashCode(_j); // lame quick-and-dirty hash code
}public override string ToString()
{
return “( i = “ + i + “, j = ” + j + “ }”; // lame without StringBuilder
}
}
Just
a little thought… Would you like to go through this burden to write a
query? “Syntactical sugar” might have some bad connotation to some, but
it can be oh so sweet!
A word of caution
Now that you know how SelectMany works,
can you think of a possible implication when selecting from multiple
sources? Let me give you a tip: nested foreachs. This is an
uninteresting sentence that acts as a placeholder in the time space
while you’re thinking about the question. Got it? Indeed, order
matters. Writing the following two lines of code produces a different
query with a radically different execution pattern:
from i in a from j in b …
from j in b from i in a …
Those roughly correspond to:
foreach (var i in a)
foreach (var j in b)
…
versus
foreach (var j in b)
foreach (var i in a)
…
But
isn’t this much ado about nothing? No, not really. What if iterating
over b is much more costly than iterating over a? For example,
from p in localCollectionOfProducts
from c in sqlTableOfCategories
…
This
means that for every product iterated locally, we’ll reach out to the
database to iterate over the (retrieved) categories. If both were
local, there wouldn’t be a problem of course; if both were remote, the
(e.g.) SQL translation would take care of it to keep the heavy work on
the remote machine. If you want to see the difference yourself, you can
use the following simulation:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading;class Q
{
static void Main()
{
Stopwatch sw = new Stopwatch();Console.WriteLine("Slow first");
sw.Start();
foreach (var s in Perf<int,char>(Slow(), Fast()))
Console.WriteLine(s);
sw.Stop();
Console.WriteLine(sw.Elapsed);sw.Reset();
Console.WriteLine("Fast first");
sw.Start();
foreach (var s in Perf<char,int>(Fast(), Slow()))
Console.WriteLine(s);
sw.Stop();
Console.WriteLine(sw.Elapsed);
}static IEnumerable<string> Perf<S,T>(IEnumerable<S> a, IEnumerable<T> b)
{
return from i in a
from j in b
select i + "," + j;
}static IEnumerable<int> Slow()
{
Console.Write("Connecting… ");
Thread.Sleep(2000); // mimic query overhead (e.g. remote server)
Console.WriteLine("Done!");
yield return 1;
yield return 2;
yield return 3;
}static IEnumerable<char> Fast()
{
return new [] { 'a', 'b', 'c' };
}
}
This produces:
Obviously,
it might be the case you’re constructing a query that can only execute
by reaching out to the server multiple times, e.g. because order of the
result matters (see screenshot above for an illustration of the
ordering influence – but some local sorting operation might help too in
order to satisfy such a requirement) or because the second query source
depends on the first one (from i in a from j in b(i) …). There’s no
silver bullet for a solution but knowing what happens underneath the
covers certainly provides the necessary insights to come up with
scenario-specific solutions.
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.
