joins in linq
Consider this query which implements a join.
var q = from c in Customers, o in Orders
where c.CustomerId == o.CustomerId
group o by new { o.CustomerId, c.CompanyName } into g
select new { g.Key.CustomerId, g.Key.CompanyName, NumOrders = g.Group.Count() }
As far as I can tell, this implements a nested-loops join, which has O(N^2) performance.
Is there a recommended way to do a more efficient join in LINQ?
A buddy and I were asking the same thing. The only thing we could come up with was that LINQ does't inherently know about indexes (to our knowledge). You'd need to create some sort of indexable data store, and probably override the Where() clause to rewrite its expression to make use of the indexer when it made sense.
My sense is that this would be a non-trivial task, but I think a valuable one.
Well, the GroupBy function knows how to index incoming data, so its not impossible. Its just not directly supported by the LINQ syntax.
Heres a stab at a Join() function:
| |
public static IEnumerable<Pair<T1,T2>> Join<T1,T2>(this IEnumerable<T1> source1, IEnumerable<T2> source2, Func<T1,K> selector1, Func<T2,K> selector2, IEqualityComparer<K> comparer) { Dictionary<K, List<T1>> dict = new Dictionary<K, List<T1>>(comparer); foreach (T1 element in source1) { K key = selector1(element); List<T1> list; if (!dict.TryGetValue(key, out list)) dict[key] = list = new List<T1>(); list.Add(element); } foreach (T2 element2 in source2) { List<T1> list = dict[selector2(element2)]; if (list != null) foreach (T1 element1 in list) yield return new Pair<T1,T2>(element1, element2); } } |
called as:
IEnumerable<Customer> custsomers = GetCustomers();
IEnumerable<Orders> orders = GetOrders();
var join = customers.Join(orders, c => c.CustomerId, o => o.CustomerId, EqualityComparer<IdType>.Default);
.. right?
I like it. Now a Where() could inspect the expression trees and replace the == operation with Join(). Similar tricks would work for < and >, etc. The only overhead is in the Dictionary, which is the price to pay for indexing.
Right. My point is that if you wrote a proxy to the build-in Where, you could include the indexed Join behavior in the query syntax (non-dotted).
Actually, that probably wouldn't be difficult to pull off at all; I suspect everything's on the Sequence class.
Have you been able to test relative performance?
Okay.. looking at Sequence.Where.. It takes a Func, not an Expression. So it may be more difficult.
Unfortunately for me, LINQ doesn't work (yet) on 64-bit.
Can you run in 32bit mode?
I dont think its too much of a problem to inject a Farmer.Sequence class somewhere. You have the source code for System.Sequence, just copy and paste, modify to suit, and bango - there you are.
Not sure how name resolution happens when two namespaces are imported that offer the same Where() function, or what happens if one takes a Func<> and the other takes an Expression<>.
I imagine the precedence comes from the order in which they are imported.
Can we consider more general Join - e.g.
from co in customers.Join(orders, (c,o) = > c.CustomerId==o.CustomerId)
select new {co.CustomerId}