joins in linq

Consider this query which implements a join.
var q = from c in Customers, o in Orders
where c.CustomerId == o.CustomerId
group o by new { o.CustomerId, c.CompanyName } into g
select new { g.Key.CustomerId, g.Key.CompanyName, NumOrders = g.Group.Count() }
As far as I can tell, this implements a nested-loops join, which has O(N^2) performance.
Is there a recommended way to do a more efficient join in LINQ?

[442 byte] By [damienmorton] at [2008-2-7]
# 1
A buddy and I were asking the same thing. The only thing we could come up with was that LINQ does't inherently know about indexes (to our knowledge). You'd need to create some sort of indexable data store, and probably override the Where() clause to rewrite its expression to make use of the indexer when it made sense.
My sense is that this would be a non-trivial task, but I think a valuable one.
KeithFarmer at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 2
Well, the GroupBy function knows how to index incoming data, so its not impossible. Its just not directly supported by the LINQ syntax.
Heres a stab at a Join() function:


public static IEnumerable<Pair<T1,T2>> Join<T1,T2>(this IEnumerable<T1> source1, IEnumerable<T2> source2, Func<T1,K> selector1, Func<T2,K> selector2, IEqualityComparer<K> comparer)
{
Dictionary<K, List<T1>> dict = new Dictionary<K, List<T1>>(comparer);
foreach (T1 element in source1)
{
K key = selector1(element);
List<T1> list;
if (!dict.TryGetValue(key, out list))
dict[key] = list = new List<T1>();
list.Add(element);
}
foreach (T2 element2 in source2)
{
List<T1> list = dict[selector2(element2)];
if (list != null)
foreach (T1 element1 in list)
yield return new Pair<T1,T2>(element1, element2);
}
}


damienmorton at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 3
called as:

IEnumerable<Customer> custsomers = GetCustomers();
IEnumerable<Orders> orders = GetOrders();

var join = customers.Join(orders, c => c.CustomerId, o => o.CustomerId, EqualityComparer<IdType>.Default);

.. right?

I like it. Now a Where() could inspect the expression trees and replace the == operation with Join(). Similar tricks would work for < and >, etc. The only overhead is in the Dictionary, which is the price to pay for indexing.

KeithFarmer at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 4
You can use it directly in a from.where.select expression:


from co in customers.class
="txt4">Join(orders, c => c.CustomerId, o => o.CustomerId, EqualityComparer<IdType>.Default)
select new { co.CustomerId }


But yeah, ideally youd operate directly on expressions, re-arranging them as necessary, sorting and creating indices as required. One could imagine quite a sophisticated little in-memory database.
damienmorton at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 5
Right. My point is that if you wrote a proxy to the build-in Where, you could include the indexed Join behavior in the query syntax (non-dotted).

Actually, that probably wouldn't be difficult to pull off at all; I suspect everything's on the Sequence class.

Have you been able to test relative performance?

KeithFarmer at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 6

Okay.. looking at Sequence.Where.. It takes a Func, not an Expression. So it may be more difficult.

Unfortunately for me, LINQ doesn't work (yet) on 64-bit.

KeithFarmer at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 7
Can you run in 32bit mode?
I dont think its too much of a problem to inject a Farmer.Sequence class somewhere. You have the source code for System.Sequence, just copy and paste, modify to suit, and bango - there you are.
Not sure how name resolution happens when two namespaces are imported that offer the same Where() function, or what happens if one takes a Func<> and the other takes an Expression<>.
I imagine the precedence comes from the order in which they are imported.
damienmorton at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 8
64-bit Framework .. I don't think WOW would affect it.
KeithFarmer at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 9
We are considering more explicit join operators like the one damien describes.
MattWarren at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 10

Aside from speeding up joins, having operators than can make use of pre-existing indexes would be useful for Where operators.



class House
{
public string Address;

public string City;

public Color HouseColor;

public int OccupantCount;
}

List<House> houses = new List<House>();

// Create some indices.
// This can be expensive, so make it an explicit user operation.
// I can see "index on <property list>" operator for IEnumerable
// and for projection.
houses.CreateIndex("City");
houses.CreateIndex("OccupantCount");

// Adding a house will also maintain indices.
houses.Add(GetHouses());

// I don't how to deal with modifications to house
// without an interceptor or compiler magic.

var nonRedHouses =
from house in houses
where house.City == "El Dorado" // uses index
and house.Color != Color.Red // could use index, if index were created
and
(
house.OccupantCount == 2 // uses index
or house.OccupantCount > 5 // uses index
)
select house index on Color; // automatically populate a Color index on result

var blueHouses =
from house in nonRedHouses index on OccupantCount // create index
where house.OccupantCount <= 10 // use index
and house.Color == Color.Blue // use index
select house; // no indices created on result

KeithFarmer at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...
# 11

Can we consider more general Join - e.g.

from co in customers.Join(orders, (c,o) = > c.CustomerId==o.CustomerId)
select new {co.CustomerId}

Franji at 2007-9-9 > top of Msdn Tech,Visual Studio Orcas,LINQ Project General...

Visual Studio Orcas

Site Classified