How To Query Data with Parallel LINQ

This post shows a simple way to write code that takes advantage of
multiple processors. You will see that LINQ queries can allow you to
side step the difficult tasks normally involved in writing
multi-threaded code. To get started, all you need is a little basic
knowledge of how to write simple LINQ queries.

The code shown
in this post uses a pre-release version of PLINQ called the Microsoft
Parallel Extensions to .NET Framework 3.5. When PLINQ finally ships, it
will run only on .NET 4.0 or later. The version I'm using that runs on
top of 3.5 is for evaluation purposes only. There will never be a
shipping version that runs on .NET 3.5.

This LINQ provider is
being created at Microsoft by the Parallel Computing team; it is not
the work of the C# team that created LINQ to Objects and LINQ to SQL.
Here is the website for the Parallel Computing team:

http://msdn.microsoft.com/en-us/concurrency/

At
the time of this writing, these extensions were available only in
pre-release form. You could download them either as Visual Studio 2008
compatible extensions to .NET 3.5, or as part of the pre-release
version of Visual Studio 2010. Since the download sites might change
over the coming months, I suggest that you find these resources by
going to the Parallel Computing site, or to the Visual Studio site:

http://msdn.microsoft.com/en-us/vs2008

Parallel
LINQ, or PLINQ, is only a small part of the Parallel Extensions to the
.NET Framework. It is, however, an important part. Since it is a simple
and natural extension of the LINQ syntax, I think developers familiar
with that technology will find it easy to use.

Consider this code:

var list = Enumerable.Range(1, 10000);
var q = from x in list.AsParallel()
where x < 3300
select x;
foreach (var x in q)
{
Console.WriteLine(x);
}

These lines look nearly identical to the code you have seen in many
simple LINQ samples. The only significant difference is the call to AsParallel
at the end of the first line. Though we have often used type inference
to hide the return type of a LINQ query, I'm going to pause and take a
second look at this instance. Rather than returning
IEnumerable<T>, this version of PLINQ returns IParallelEnumerable<int>:

IParallelEnumerable<int> q = from x in list.AsParallel() etc….

In the near future, PLINQ queries of this type will probably return ParallelQuery<int>.
Because this product is still evolving, it might be simplest to use
var, at least during the pre-release phase, and let the compiler choose
the type. That way you can save typing, avoid problems with anonymous
types, and you need not concern yourself about changes in the API as
the product develops. It is almost always appropriate to use var to designate the return type of a LINQ query, and there are only special circumstances when you would do otherwise.

Here are the results from this first PLINQ query:

2
1
3
4
6
512
5
7
513
8
12
514
9
13
515
10
14
516
11
15
517
16
72
518
17

The numbers shown here are in a relatively random order because they
are being returned from different threads. It is important to remember
that the sequence of values returned by LINQ is not always guaranteed
to be presented in a particular order. If Order is important in your
code, you can add a call to AsOrdered to the query after the call to AsParallel. Alternatively, you could insert a GroupBy
clause to establish the desired ordering. Otherwise developers should
assume that the ordering from a PLINQ query will be entirely random

Now that you understand the basics of Parallel LINQ, let’s
move on to look at a more interesting example. Improved performance is
the main reason to write code that can run in parallel. The program
shown in this post uses a timer to demonstrate how PLINQ can improve
performance in a program.

Performance improvements become more evident when our code has
access to more processors. The code I show here runs faster on a two
processor machine, but it really starts to come into its own on a four
processor machine. Moving up to even more processors yields more
powerful results. Here, for instance, are the results showing an
improvement of 1.33 times when using two processors, and almost two
times when using 4 processors:

2 Processors = 1.44 x improvement:
Linear: 00:00:13.15
Parallels: 00:00:09.10
4 Processors = 1.96 x improvement:
Linear: 00:00:15.00
Parallel: 00:00:07.68

These tests are being running against pre-release software, so these
numbers are almost certain to change before release, and of course
different machines will yield different results. Furthermore, the
degree of improvement that you see is likely to change depending on the
type of algorithm you run, the number of cores on your machine, the
architecture of the machine, how many caches there are and how they’re
laid out, etc. Though it is rare, some queries show superlinear
performance enhancements. In other words, there is a greater than 4x
speedup on a 4-core box. An improvement of 2 times, such as the one
shown, or even a 3 time improvement, is common.

This sample program is called FakeWeatherData, and it is available for download from the LINQ Farm on Code Gallery.
It features a simple LINQ to XML query run against a file with 10,000
records in it. The data I'm querying is not real, but consists of
random dates and temperatures generated by a simple algorithm included
in the FakeWeatherData program.

The XML file is structured like this:

<?xml version="1.0" encoding="utf-8" ?>
<Samples>
<Sample>
<Year>1973</Year>
<Month>May</Month>
<Day>15</Day>
<Temperature>10</Temperature>
</Sample>
<Sample>
<Year>1970</Year>
<Month>Feb</Month>
<Day>10</Day>
<Temperature>14</Temperature>
</Sample>
<Sample>
<Year>1970</Year>
<Month>Jan</Month>
<Day>15</Day>
<Temperature>11</Temperature>
</Sample>
  ... Many lines of code omitted here
</Samples>

There is also a simple C# class used by the program to encapsulate the data from the XML file:

class WeatherData
{
public string Year { get; set; }
public string Month { get; set; }
public string Day { get; set; }
public string Temperature { get; set; }
}

The parallel version of the query in the program looks like this:

for (int i = 0; i < NUM_REPS; i++)
{
var list = (from x in doc.Root.Elements("Sample").AsParallel()
where x.Element("Year").Value == "1973" &&
x.Element("Month").Value == "Apr" &&
x.Element("Day").Value == "15"
                select new WeatherData
                {
Day = x.Element("Day").Value,
Month = x.Element("Month").Value,
Temperature = x.Element("Temperature").Value,
Year = x.Element("Year").Value
}).ToList();
}

Accompanying this code is a similar LINQ query that does not use PLINQ

for (int i = 0; i < NUM_REPS; i++)
{
var list = (from x in doc.Root.Elements("Sample")
where x.Element("Year").Value == "1973" &&
x.Element("Month").Value == "Apr" &&
x.Element("Day").Value == "15"
                select new WeatherData
                {
Day = x.Element("Day").Value,
Month = x.Element("Month").Value,
Temperature = x.Element("Temperature").Value,
Year = x.Element("Year").Value
}).ToList();
}

The program queries the data in the XML file first using the
Parallel code, then using standard LINQ. By comparing the time it takes
each block of code to execute you can get a sense of the relative
improvement available through PLINQ. I'll show you how to make such
comparisons in just a moment. I will also discuss some tools that will
become available to help profile code of this type.

You can see that the PLINQ query contains a call to AsParallel,
while the other query does not. Other than that the two queries are
identical. The fact that the two queries look so much alike points to a
primary strength of PLINQ: very little specialized knowledge is
necessary in order to begin using it. This does not mean that the
subject is trivial, but only that the barrier to entry is low. This is
not the case with most concurrent programming models.

LINQ queries are designed to be read-only, working with
immutable data. This is a good model for parallelism, because it makes
it unlikely that data will mutate, thereby setting up the potential for
a race condition. You should note, however, that PLINQ does nothing to
prevent this from happening, it is simply that LINQ is designed to make
it unlikely.

Note also that the declarative LINQ programming style ensures
that developers specify what they want done, rather than how it should
be done. This leaves PLINQ free to ensure that concurrent LINQ queries
run in the safest manner possible. If LINQ had been defined more
strictly, such that it had to process each element in a certain order,
then the PLINQ team would have had a much more difficult task.

The code in both these queries pulls out only the records from
the XML file that have their date set to April 15, 1973. Because of
deferred execution, the query would not do anything if I did not call
ToList(). As a result, I added that call and converted the result into
a List<WeatherData>. Though hardly earthshaking in import, these
calls ensure that the code actually does something, and thus gives
PLINQ scope to take advantage of the multiple processers on your
system.

Simple timers are created to measure the difference between
the standard LINQ query and the PLINQ query. I've also used a method
used in many of Parallel LINQ team's samples for displaying the time
elapsed during a test run:

private static void RunTest()
{
XDocument doc = XDocument.Load("XMLFile1.xml");
Stopwatch sw = new Stopwatch();
sw.Start();
LinqOrdinarie(doc);
sw.Stop();
ShowElapsedTime("Ordinaire", sw.Elapsed);
sw.Reset();
sw.Start();
ParallelLinq(doc);
sw.Stop();
ShowElapsedTime("Parallels", sw.Elapsed);
}
private static TimeSpan ShowElapsedTime(string caption, TimeSpan ts)
{
string elapsedTime = String.Format("{0}: {1:00}:{2:00}:{3:00}.{4:00}",
caption, ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
Console.WriteLine(elapsedTime, "RunTime");
return ts;
}

At least with the pre-release version of PLINQ that I've played
with, I've found it very useful to set up timers to confirm that PLINQ
is actually able to speed up an operation. My record at guessing which
code will benefit from running in parallel is not good, and so I find
that confirming the effectiveness of the code by explicitly measuring
it is worthwhile. You can either use the simple StopWatch class from
the System.Diagnostics namespace, as shown here, or else you can use a
profiler. Note that a thread aware profiler might ship with some
versions of Visual Studio 2010.

I've found that the advantages of concurrent LINQ become more
obvious the longer the operation I'm timing lasts. As a result, I've
placed the query inside a loop, and added a variable to the program
called NUM_REPS. By setting NUM_REPS to a large number, say 500, you
can clearly see the benefits that can be accrued when you run LINQ
queries in parallel on multiple processors. Note that the first time
PLINQ is used, its assembly will need to be loaded, the relevant types
will need to be JIT compiled, and new threads will need to be spun up,
etc. As a result, many developers see improved performance after they
get past the initial warm-up time.

Though it is very easy to get started with PLINQ, there are
still complexities inherent in the subject that you need to consider.
For instance, PLINQ will sometimes develop a different partitioning
scheme for your data depending on whether you are working with an
Enumerable or an Array. To learn more about this subject, see the
following post from the Parallel Programming team:

http://blogs.msdn.com/pfxteam/archive/2007/12/02/6558579.aspx

The simple PLINQ examples shown in this post should help you get
started with this powerful and interesting technology. Parallel LINQ is
still in its infancy, but already it provides means of greatly
simplifying tasks that are not normally easy to perform.

GD Star Rating
loading...
GD Star Rating
loading...
How To Query Data with Parallel LINQ, 5.0 out of 5 based on 1 rating
  • Share/Bookmark

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

Leave a Reply