VERY slow binary serialization of strings coming from SQL

Please help!

This is probably the weirdest issue I've ever encountered.

I'm trying, in C#, to binary serialize an array of strings that come from a SQL Server 2000 database. Problem is, this takes orders of magnitude longer than if the strings don't come from the database.

The EXACT same 35000-length array of strings takes 0.1 seconds to serialize if the strings don't come from the database, and 40(!) seconds to serialize if they do.

So for instance:

...
SqlConnection conn = new SqlConnection("...");

//In my database, this will return 34769 rows
SqlCommand comm = new SqlCommand("SELECT 'This is just a test' as TEST"
+ " FROM Board_Layout_Point WHERE Board_LO_SID = 121", conn);

conn.Open();
SqlDataReader reader = comm.ExecuteReader();
string[] arr = new string[34769];
int index = 0;
while (reader.Read())
{
string val = reader.GetString(0);
arr[index] = val;
index++;
}
conn.Close();
MemoryStream memStream = new MemoryStream();
BinaryFormatter frmt = new BinaryFormatter();
DateTime start = DateTime.Now;
frmt.Serialize(memStream, arr);
Console.WriteLine("That took " + (DateTime.Now - start).ToString());
...

That takes 40 seconds!

Now, if I keep everything EXACTLY the same, including all the database work, but instead of passing the value from the reader, I use the string "This is just a test" that I supply in the code (as a literal or constructed somewhere else), the serialization takes 0.1 seconds!

Does anyone have any idea of what could be going on?

-Daniel

[1661 byte] By [dksimon] at [2008-2-25]
# 1
The main difference between the two techniques that I see is how many strings you are using. In the case of directly assigning "This is just a test" repeatedly to the string[], you have 34769 references to one string on the managed heap. In the case of the database, you have 34769 references to 34769 strings. With a single string, I'm guessing that it would likely stay in your processor's cache during the serialization algorithm. With 34769 strings, you're going to be incurring a lot of memory overhead. Take a look at your application with the CLRProfiler to see the differences in your memory between the two test scenarios.
JamesKovacs at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 2
Hmm...

It appears you're probably right. After your message, I tried to make sure I was creating new objects every time with:

StringBuilder sb = new StringBuilder();
sb.Append("This is just ");
sb.Append("a test");
arr[index] = sb.ToString();

... and that brought me back to the 40 second time frame.

Do you know of any way to improve performance for the serializer in this situation (i.e. when I'm working with many different references to strings of equal value)? I feel like 40 seconds is a giant amount of time for 35000 rows. Am I mistaken?

dksimon at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 3
And, in fact, I don't think that just the 35000 strings is the problem, because when I do this:

Random rnd = new Random();

...35000 times the following...
arr[index] = rnd.Next().ToString();
index++;

...

I'm back to 0.15 seconds again (which is really the performance I would expect), even though when I examine the array I can clearly see 35000 distinct strings.

Something weird is still going on here.

dksimon at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 4

So you got me thinking some more, and now it seems that the problem occurs specifically when I have this situation:

1) I'm serializing many strings
2) The strings are not equal by reference
3) The strings are equal by value

This third point is important, because I've now done some more tests and verified that when the strings have different values, the serialization is very, very fast.

This situation may seem silly, but it's my real-world problem. I originally isolated this issue because I was returning large arrays of business objects over remoting calls, and though the objects are different, a lot of them will have the same value for some particular property (to give a similar situation, imagine retrieving 1000s of people with their Title property set to "Mr.").

When I remove the offending property, and all the rest of the objects' properties are different, the call takes 10 seconds, and when I add it back, the call takes 2 minutes.

Any ideas?

dksimon at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 5

I'm going to have to look at the problem a bit more, but if you explicitly intern the duplicate strings, the performance of binary serialization increases dramatically. It appears that interning can take awhile though if the strings aren't duplicates.

string val = string.Intern(reader.GetString(0));

Interning basically looks up a string in a string pool and returns a common reference (or adds a new entry to the pool if one doesn't exist yet).

I'm not sure what the binary serializer is doing when the strings are identical which causes such a big perf difference. I'll let you know if I find anything.

JamesKovacs at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 6

The BinaryFormatter tracks references.

In order to do so, we need to do a lookup in an id table for each instance of a reference type -- which string is.

There a whole host of things that can influence the performance of this process, including interning, etc.

That said, why don't you just pass the datareader around -- rather than buffering the strings in the first place?

douglasp at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 7
I'm still wondering where the quirky behaviour from the BinaryFormatter in v1.1 is coming from. For instance, try running this code in v1.1:

string[] arr = new string[100000];
for(int index = 0; index < arr.Length; index++) {
// Use StringBuilder to ensure that the strings are identical, but aren't interned
StringBuilder sb =
new StringBuilder("This is just");
sb.Append(" a test");
//sb.Append(index);
arr[index] = sb.ToString();
}
MemoryStream memStream =
new MemoryStream();
BinaryFormatter frmt =
new BinaryFormatter();
DateTime start = DateTime.Now;
frmt.Serialize(memStream, arr);
DateTime end = DateTime.Now;
memStream.Close();
Console.WriteLine("That took " + (end - start).ToString());

Time to complete on my computer: 1 min 37 sec

Now uncomment the sb.Append(index). The strings are a bit longer and now definitely unique. You would think it would take longer, if only by a fraction. But it takes 0.25 seconds. That's right - 1/4 of a second. Why the huge difference? Also, if you intern the identical strings and hence are serializing one instance, the time is roughly 0.25 seconds. Lastly if you run the same tests under Whidbey, the identical strings (interned or not) and the unique strings take roughly the same time to serialize - less than a second. So something was fixed. The question is what.

JamesKovacs at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 8

Passing the datareader around really won't work in my architecture. I'm not even using datareaders - I just used it here to illustrate the problem. I'm trying to return arrays of business objects over a remoting boundary, and decouple the client side from the database schema.

And frankly, if it weren't for this problem, it would work. Something I don't understand in your explanation is the idea that the fact that the string is a reference type does not explain why:

if the references are different and the values are different, or if the references are the same and the value is the same, serialization is very fast, but -

if the references are different and the values are the same, serialization is VERY slow.

dksimon at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 9
Yes, James, that's exactly the behavior I'm seeing. Once you append the index to the string, the values become different, and when the values are different, the problem seems to go away.

I think you're right - this is definitely a bug in the v1.1 BinaryFormatter. Microsoft seems to know where the trouble is - is there a process for trying to get a hotfix out on something like this?

In the meantime, I'm going to try to see how much of a performance hit interning will cause, but it must be faster than this serialization lag.

dksimon at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 10
I searched for a hotfix in the Microsoft Knowledge Base, but didn't turn up anything unfortunately.

I thought about the interning idea more and you might run into memory problems if you intern all strings. I don't see that interned strings ever get cleaned up. So you're basically keeping a pointer to every unique string you ever see. This could be problematic for a long-running process. I would consider implementing your own string cache (using System.WeakReference) or only interning those string fields that you know have a high probability of duplication. (Intern with Mr./Ms./Dr., but not for first/last name.)

Regarding my explanation, I was trying to point out that the sheer number of strings isn't the problematic factor. If the strings are unique, performance is great. If the strings are identical because the string variables all reference the same underlying string, again no problem. It's only when you have multiple copies of identical strings that the problems manifest. It's not an explanation, but an observation of the conditions under which this bug occurs.

JamesKovacs at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 11
Sorry, James - I was directing my confusion at douglasp's explanation. Sorry for the ambiguity. I appreciate the time you're putting into this.

That's a very good point about garbage collecting interned strings. That's a real shame, because interning all the strings was really a giant performance improvement. I thought I had a solution here. This is turning into a very thorny problem.

Is there nothing else I can do at this point? Can Microsoft provide a hotfix even if one does not currently exist?

I like your suggestion of creating my own string caching mechanism, because it would be difficult to identify the commonly repeated strings I could intern. So, some questions about that:

How could I perform my own string caching prior to serialization? What would be best for performance if dealing with tens of thousands of strings? Could I just do it using System.Collections.Specialized.StringCollection and IndexOf? How could I do it using WeakReference, as you recommend?

Thanks so much for all your help on this.

dksimon at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 12
Something else I should have mentioned is that the same problem manifests itself with the SoapFormatter too. The XmlSerializer does not have this problem.
JamesKovacs at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 13

Will this work for caching strings for this purpose? Assume here that the Cache parameter is a hashtable created at the start of processing a large number of strings, and that the reference to it is lost after the serialization. From a SingleCall remoting object hosted out of IIS, will this get properly cleaned up? The performance of this method seems to be good. Most importantly, does using String.GetHashCode() guarantee that equal codes mean equal strings? I know String overrides GetHashCode, I just don't know the particulars.


private static string GetCachedString (Hashtable Cache, string ToCache)
{
int code = ToCache.GetHashCode();
object cached = Cache[code];
if (cached != null)
{
return (string)cached;
}
else
{
Cache[code] = ToCache;
return ToCache;
}
}


Does this approach seem workable?

dksimon at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...
# 14
That sounds workable. As long as you're creating the Hashtable on each call and releasing it after completing process (i.e. you're not squirreling it away in a static variable or singleton), you should be alright.

WRT GetHashCode() - it doesn't guarentee uniqueness, but just provides a hashbucket to drop the strings into. I would modify the code like this:

private static string GetCachedString(Hashtable Cache, string toCache)
{
object cached = Cache[toCache];
if (cached != null)
{
return (string)cached;
}
else
{
Cache[toCache] = toCache;
return toCache;
}
}

Internally, Hashtable will call string.GetHashCode() to place it in the correct bucket.

BTW - I took a look at interning and the string intern pool never gets cleaned up. So your memory footprint gets larger and larger. I'll have to write a blog entry about it because I've seen people recommending that you intern all strings?!? There's a reason it's not done by default.

JamesKovacs at 2007-9-9 > top of Msdn Tech,.NET Development,.NET Remoting and Runtime Serialization...

.NET Development

Site Classified