Euler Primes with LINQ Iterators

Thanks' to Steve Smith’s Project Euler with LINQ, I’ve recently begun playing with the Project Euler questions seeing how far I can push my algorithm skills along with LINQ and LINQPad. LINQPad makes it easy to slap together some code samples and output results, including performance metrics, so I don’t need to worry with that plumbing code and can focus on creating fast code.

While working on one of the problems I realized the solution would offer a good opportunity to demonstrate a nuance of .Net iterators. Understanding this nuance may help some of you to grok how they work inside LINQ to Objects. In this case, the problem is Euler 10: Find the sum of all primes less than 2,000,000. We can begin by putting this into a LINQ query which elegantly reflects the desired outcome:


Dim query = _
  Aggregate num In Enumerable.Range(2, 1999998) _
  Where IsPrime(num) _
  Into Sum(CLng(num))

So far, so good. We just need to identify which numbers are prime and we’re finished. Easier said than done. I started by brute forcing this, which is often not a good idea with Euler which requires that your results take less than a second to process. My first stab was to set-up a list of found prime numbers. Then as I was iterating over my source data, I would iterate over each of the previously found primes to see if this number is divisible by that number. If it is, I would return that it was not prime. Otherwise, I would add it to the prime list and return that it is a prime. Here’s some code:


Private FoundPrimes As New List(Of Long)
  Function IsPrime(ByVal num As Long) As Boolean
  For Each prime In FoundPrimes
    If num Mod prime = 0 Then
      Return False
    End If
  Next

  Return AddToPrimes(num)
End Function 

Function AddToPrimes(ByVal num) As Boolean
  FoundPrimes.Add(num)
  Return True
End Function

Ok, so this works, but is quite slow (several minutes for 2 million numbers). We can’t speed this up using the PLINQ here because we need to move through our list in order, otherwise we will be adding numbers to our prime list that aren’t actually primes.

We can easily modify this by limiting the set of primes that we test a given number against. We know from mathematical proofs that the highest number we need to test any given number to see if it is a prime, is the square root of that given number. Thus, we modify the IsPrime method as follows:


Function IsPrime(ByVal num As Long) As Boolean
    Dim top = math.sqrt(num)
    For Each prime In FoundPrimes
        If prime > top Then Return AddToPrimes(num)
        If num Mod prime = 0 Then
            Return False
        End If
    Next
    Return AddToPrimes(num)
End Function

Running this through LINQPad now returns a respectable performance of 2.003 seconds. This is pretty good, but still doesn’t fit Euler’s guideline of sub-second performance. Back to the grindstone to find a better way. The performance hog here is obviously repeated iterating over the prime list for each given number. Perhaps if we could somehow flag multiples of each prime in our list as we find a prime, we could eliminate this iteration process. Thus instead of using a list of primes, we’ll create an array of bits (boolean) the size of the range we want to test:


Const numCount As Integer = 2000000
Dim allNums(numCount - 1) As Boolean

So, how do we find the primes in this list. First, realize that the Where and Select LINQ operators have overloads that include not only the input value, but also the index for the current value. To use this, we will need to modify our query because we can’t access this index using query expressions (at least not in VB). We’ll have to change our query to something like the following:



Dim query = allNums.Where(Function(num, index) IsPrime(index)) _
                   .Select(Function(num, index) index) _
                   .Sum(Function(num) num)

This would work, but the index for our Select method is not the index of the underlying data source, but rather the index for the item returned in the Where clause. As a result, we’ll need to process this index through our Where function and expose that in a scope wide enough that we can call back into it in our Select method. Here’s our revised code to this point, including a delegate pointer to a PrimeSieve method that will do the grunt work:


Const numCount As Integer = 2000000
Dim allNums(numCount - 1) As Boolean
Dim foundPrime As Integer 

Sub Main()
    Dim query = allNums _
            .Where(AddressOf PrimeSieve) _
            .Select(Function(ignore) CLng(foundPrime)) _
            .Sum(Function(num) num)
    Console.WriteLine(query)
End Sub

One thing to point out here is that we could have eliminated the Select clause if we were dealing with smaller numbers, but we need to widen our results to a long because the sum will overflow an integer type otherwise.

Now, on to our revised Prime algorithm. In this one, we pass in the current bit which we ignore and the index. The definition of Where requires that this returns a Boolean, so we’ll return false if this number has already been marked as a prime multiple. Otherwise, we’ll mark all multiples of this number as no longer prime and return true:


Private Function PrimeSieve(ByVal num1 As Boolean, ByVal index As Integer) As Boolean
    If index < 1 Then Return False
    If allNums(index) Then Return False 

    foundPrime = index + 1
    For num = index To (numCount - 1) Step foundPrime
        allNums(num) = True
    Next 

    Return True
End Function

At this point, you may be questioning if the underlying query will work if we keep referring back to a single FoundPrime variable. Here is where understanding iterators becomes important. Let’s begin by considering the basic definition of Where and Select (We’ll have to use C# here because VB doesn’t have a syntax for iterators yet):


static IEnumerable<t> Where<t>(IEnumerable<t> source, Func<t , bool> predicate) {
    foreach (T element in source) {
        if (predicate(element)) yield return element;
    }
}

static IEnumerable<s> Select<t , S>(IEnumerable<t> source, Func<t , S> selector) {
    foreach (T element in source) {
        yield return selector(element);
    }
}

What these are basically saying is: as some outside method calls us to move through our list, return values that meet the appropriate results. We are NOT doing a full iteration over our list in the Where method and returning the results to the next method (Select). If we did, select would indeed be using the same foundPrime value. Instead, we start moving through the results getting the first value that meets our criteria and passing control of our application on to the Select method. Select operates over it and passes control on to the next level - Sum. Sum then aggregates the result and pulls the next value returning control to the Where clause to find the next matching result.

Let’s step through the process with the first several numbers. The operation in our query doesn’t start until we actually start walking through the results. Thus Sum causes us to fetch the first number in our list. Where passes index 0 to the predicate PrimeSieve which returns false because it is under 1 (0 and 1 are not considered primes). Where continues to the next bit (index 1).

Since 2 is a prime, we then mark all multiples of 2 (4, 6, 8, 10, etc) true and return true. Because the predicate evaluated true, we yield that on to the Select method which pulls the foundPrime value and passes that on to Sum.

Sum then asks for the next number (3). Now, we re-enter the Where clause after the yield (internally using goto) and continue the iteration. We now do a PrimeSieve for index 2. This bit is still false, so we mark all multiples (6,9,12) as true. Of course 1/2 of these values are already true. I suspect that checking the bits before setting them would only slow our operation down, so I just go ahead and set it. We now pass operation on to select which pulls the new foundPrime and passes that value on to Sum to aggregate.

In the next iteration, we find that allNums(3) (the fourth number) is already marked true, thus we return false and to where and continue on to index 4 which is not yet marked true because this is the prime value 5. Rinse and repeat and we can efficiently evaluate as many numbers as we need.

So after all of this, what’s the performance difference?

Test Range	Speed with brute force	Speed with Sieve	% Improvement
1000	.004	.002	200%
10000	.007	.005	140%
1,000,000	.055	.009	611%
10,000,000	.0829	.080	1036%
100,000,000	15.439	1.189	1298%
1,000,000,000	5699.4	13.078	43580%

A couple things to mention here:

When dealing with small sets of iterations the amount of improvement is relatively small. This points to the reminder that you shouldn’t over optimize code if you don’t need to. The benefits as we increase the iterations becomes dramatic.
The order in which the results are pulled is important. Thus, you can’t parallelize this algorithm (using PLINQ). In this example, the prime check has side effects of changing the underlying data source and setting an external property.
This version relies on the way iterators work. You would not be able to substitute an observer pattern (like in LINQ to Events and the Reactive Framework). That is a push model rather than a pull model. As a result, it could be possible that you are processing the number 4 before it has been marked as not prime and your results would thus be invalid.

Posted on 9/12/2009 4:26:00 PM - Comment
Categories: LINQ - VB Dev Center -

Thoughts about LINQ
served up with LINQ

Euler Primes with LINQ Iterators by ThinqLinq

Euler Primes with LINQ Iterators