About the author

J Sawyer is a developer based in Houston, TX and loves to write code, especially ASP.NET and other web-related stuff. He is currently working on implementing Team Foundation Server at a large energy company in Houston and is loving that too.

He also loves to ride his Yamaha FZ1. And sometimes his Ninja 650.

But he doesn't code and ride at the same time. That would be bad.

Linq Performance Part II - Filtering

October 22, 2008 1:43 PM

Continuing on the previous topic of Linq Performance … I’m now doing something a bit more interesting than just a “Select From”. All of the key conditions (machines, specs, methodology, blah blah blah) remain the same; no changes at all there. However, I’ll be digging around in filtering this time, comparing filtering between ADO.NET, Linq to SQL and, just for giggles, Linq to Objects and Linq to ADO.NET. Based on the previous results, I’m not using constructors for the custom classes, but rather property binding. The performance of the full property binding (rather than fields) is good and, let’s be honest here, that’s how you should be doing it anyway.

First, an overview of the different types of filters that I’m going to be running:

Find By First Letter: This will do a search/filter for Persons by the first letter of their first name … a LIKE query. Rather than getting the database all optimized and the query results cached, I select the first letter randomly from a cached copy of the table, but this is not included in the results. Yes, the query plan will be cached, but that’s normal and a part of the overall performance system that we want to test anyway.

Find By Non Key: This does a search/filter for Persons by First Name and Last Name. This uses an equality operator and will (most likely, though I didn’t check) return a single row. As before, the First Name/Last Name combination comes from a cached copy of the table and the values are randomly selected. As with the previous test, the query plan is cached and, again, that’s a normal thing.

Find Key: The last test does a search for a row by the primary key value. This does return a single row in all cases. The key to search for is randomly selected from a cached copy of the table.

For all of the tests, actual, valid values were used – hence the random selection from a cached copy of the table. Originally, this was not the case, but I quickly found that, in particular, the Linq tests that returned a single item would throw an exception if nothing was found – though this is likely because I used to First() method on the query return (the exception said that the list was empty). This would not have been an issue if I didn’t call this method and, instead, enumerated over the collection of 0 or 1 with the return.

For each of the test batches, five different methodologies were used.

Data View: This uses an ADO.NET DataView on an existing DataTable to do the filtering. The creation and filling of the table was not included in the test result. This is a method that you would use for cached data and tests the filtering capabilities of the DataView on its own.

DataSet FIlter: This uses the Filter() method to retrieve a subset of the rows. As with the previous, the table that is used comes prefilled.

Linq Detached: Essentially, this is Linq to Objects. The results come from the database and are then detached from the database, putting the results into a generic List<> class. As with the previous, creating and filling the list is not included in the results.

Linq To ADO: For something different, this filters a DataTable using Linq. Again, this is something that you’d do with a cache. And, yet again (I’m beginning to feel like a broken record here), the filling of the DataTable that is used for this is not included in the results.

Linq To Sql: This uses pure Linq to Sql, retrieving the results from the database and then returning the results. In this case, the cost of actually hitting the database is included in the results. As you can, I’m sure, imagine, this is the only test where the query plan caching made any difference at all; the rest of the tests were working on data in memory.

I did not include results where a DataSet returns results directly from the database; the performance characteristics of this with respect to the Linq To Sql tests would be the same as in the previous selection tests.

So, without further ado, the results:

Test Batch Data View DataSet Filter Linq Detached Linq to ADO Linq to Sql
Find By First Letter 25.687 23.679 49.844 36.979 28.084
Find By Non Key 34.516 138.782 9.066 27.020 12.787
Find Key 17.115 0.162 6.200 7.029 9.064
Average 25.773 54.208 21.703 23.676 16.645

image

I have to say, I found the results quite interesting. There are some pretty wide variations in the methods, depending on what you are doing. I was also surprised to see that the Find By First Letter had the worst performance for Linq Detached … this was not what I was expecting and not something that I had seen in previous test runs on a different machine (but that was also testing against a Debug build rather than a Release build). The average time for the DataSet Filter was very highly impacted by the Find By Non Key batch … this is just really bad with DataSets. Find Key for the dataset was very fast though … so much so that you can’t even see the bar in the chart; this is due to the indexing of the primary key by the DataSet. Linq Detached was hurt by the Find By First Letter batch; my theory is that this is due to string operations, which have always been a little on the ugly side. Other than that, the find performance of Linq to Objects was quite good and finding by key and by non-key fields were little different – and this difference would, again, most likely be due to the string comparison vs. integer comparisons.



Tags: ,

Linq | Performance

Austin Code Camp Stuff ...

May 24, 2008 10:50 PM

I promised that I'd make the materials from my talk at the Austin Code Camp available for download. I've finally gotten it compressed and uploaded. It's 111 MB so be forewarned. Since I used WinRar (and that's not as ubiquitous as zip formats), I've made is a self-extracting archive. You'll need Visual Studio 2008 Team Edition for Software Developers (at least) to read all of the performance results. But I do have an Excel spreadsheet with the pertinent data.



Tags: , , ,

.NET Stuff | Linq | Performance | User Groups

More Notes on Performance Testing

May 14, 2008 11:06 PM

Well, I wanted to provide a little update on my previous discussion on the my performance testing methodology; I've refined it a bit while getting ready for the Austin Code Camp.

Of course, GC.Collect() is still very important ... but I must correct myself in the previous post. It's called before each test method run. This ensures that the garbage collector is all cleaned up and collected before the test run even starts executing.

Now, on the calculations. I still do a normalized (or perhaps weighted, but we're getting into semantics here) average. But ... I've altered the equation a bit to subtract the overhead associated with the profiler probe. These were, surprisingly, pretty different across the board with the different test methods. It really is appropriate to discount these from the overall results as they do impact the overall numbers. And, considering the differences between them in the various methods (in one set of tests, it ranged from .1 msec to 2.54 msec), they really needed to be removed from the results.

The final tweak was to make a call to each of the test methods before I went into the actual test. This was done in a separate Initialize method. This ensures that all of the classes being used (as was mentioned in the previous post) are loaded into memory and initialized. It also ensures that the methods themselves are JIT'd before the test runs begin as well; again, this is something that we need to take out of the final equation.



Tags: ,

Performance | Visual Studio Tools

Notes on performance testing

May 7, 2008 12:38 PM

In performing the performance tests for Linq vs. ADO.NET, I spent quite a bit of time getting the methodology ironed out. Why? Well, I kept getting different results depending on the order in which the test methods were run. This struck me as somewhat odd and, honestly, even more frustrating. If the methodology was valid, one would certainly expect the results to be consistent regardless of the order in which the test methods were called.

Of course, the first things that comes to mind is the connection pool. The first access to the database with a particular set of credentials would create the pool and take the hit for opening the connection to Sql Server. This would skew the results against the first called test run. This was an easy one and one that I had figured out before even running the tests. Creating and opening the connection before any of the tests were run was a no-brainer.

But something else was going on. The first method called on a particular run seemed to have a performance advantage. I even, at one time on previous tests, had case statements to alter the order ... but even then I'd get different results on different runs. This left me scratching my head a bit. Eventually, though, it occurred to me. There's a bunch of stuff that the Framework does for us and it's sometimes easy to forget about these things and how the impact performance. In this case, it was garbage collection. And it makes complete sense. Think about it ... the GC in non-deterministic. It happens pretty much when the runtime "feels" like it. So ... the GC would happen in various places and invariably skew the results somewhat. The impact didn't seem to be evenly distributed. Why the skewing? Because the GC, when it does a collection, halts all thread processing while it does its thing. Of course, when this occurred to me, it was a "DOH!" moment.

Once I added a call to GC.Collect() after every call to a test method, the results were, as I expected, remarkably similar across all of the test runs, regardless of the order in which they were called. Confirming, of course, my newly realized theory about the garbage collection and its impact on my performance tests.

I did, for the final "numbers" toss out the low and the high values and re-averaged. Since Windows always has other things going on, some of those things may take a time slice or two of the processor from the test run. Or not take any. Still, doing this actually made very little difference to the results. As I think about it, though, I should also create an instance  of every class that I create in order to make sure that the type is initialized in memory and the dll is loaded. But, looking at the results, this really didn't appear to make much difference. Still, on future tests, I'll start doing that.

Now, keep in mind that this applies only to artificial tests. And if you look at the Linq vs. ADO.NET tests, they were certainly quite artificial. Not what you would do in a real-world application. This was, of course, really only designed to test raw numbers for each of the methods that were being used at the time. When you are doing performance testing on your applications, this kind of testing methodology is invalid, to say the least. And calling GC.Collect() after every method call will, without question, hurt the overall performance of your application. So don't do it. For your individual applications, you need to take a holistic approach; test the application in the way it is expected to be used on the real world. Of course, this can only go so far because users will, invariably, do something that we didn't expect (why is that???) and telling them "Well, just don't do that" never seems to be an acceptable answer. For web applications, this needs to go a step further - in web apps, performance != to scalability. They are related, to be sure, but not the same. I've seen web apps that perform pretty well ... but only with a few users, keeling over when they get 20 or more users. That's not good.



Tags: ,

.NET Stuff | Performance

Thoughts on Linq vs ADO.NET - Simple Query

April 22, 2008 1:09 PM

I had a little discussion today with an old buddy of mine this morning. I won't mention his name (didn't ask him for permission to) but those of you in Houston probably remember him ... he used to be a Microsoft guy and is probably one of the best developers in town. I have a world of respect for him and his opinion.

So ... it started with this ... he was surprised by the "do you think a user will notice 300 ms".  Of course, that's a loaded question. They won't. But his point was this: 300 ms isn't a lot of time for a user, but under a heavy load, it an be a lot of time for the server. Yes, it can be ... if you have a heavy load. I won't give a blow-by-blow account of the conversation (I can't remember it line for line anyway), but it was certainly interesting.

One thing that we both agreed on that is important for web developers to understand is this: performance is not equal to scalability. They are related. But they are not the same. It is possible (and I've seen it) to create a web app that is really fast for a single user, but dies when you get a few users. Not only have I seen it, but (to be honest here), I've done it ... though, in my defense, it was my first ASP "Classic" application some 10 or 11 years ago; I was enamored with sessions at the time. This was also the days when ADO "Classic" was new and RDO was the more commonly used API. And ... if you are a developer and haven't done something like that ... well, you're either really lucky or you're just not being honest.

With that out of the way ... I'd like to give my viewpoint on this:

Data Readers are still the fastest way to get data for a single pass. If it's one-time-use data that is just thrown away, it's still the way to go. No question. (At least, IMHO). But there's a lot of data out there that isn't a single pass and then toss ... it may be something that you keep around for a while as the user is working on it (which you often see in a Smart Client application) or is shared among multiple users (such as a lookup field that is consistent ... or pretty much consistent ... across all users). In both of these cases, you will need to have an object that can be held in memory and accessed multiple times. If you are doing a Smart Client application, it also needs to be scrollable. Data Readers don't provide this. So ... if you are doing these types of things, the extra 300 ms is actually well worth it, In a web application, you'll scale a lot better (memory is a lot faster than a database query and it keeps load off the database server for little stuff) by caching common lookup lists in the global ASP.NET Cache. One thing that I find interesting ... the LinqDataSource in ASP.NET doesn't have an EnableCaching property like the SqlDataSource. It does, however, have a property StoreOriginalValuesInViewState.  Hmmm ... curious. Storing this in ViewState can have its benefits ... it's a per-page, per-user quasi-cache ... but at the cost of additional data going over the wire (which might be somewhat painful over a 28.8 modem ... yes, some folks still use those). That said, ViewState is compressed to minimize the wire hit and can be signed to prevent tampering. But ... the EnableCaching puts the resulting DataSet (it won't work in DataReader mode) into the global ASP.NET cache ... which, again, is good for things like lookups that really don't change very often, if at all.  For the Smart Client application ... well, DataReaders have limited use there anyway due to the respective natures of DataReaders and Smart Client apps.  Granted, you can use a DataReader and then manually add the results to the control that you want it to display in ... but that can be a lot of code (yeah, ComboBoxes are pretty simple, but a DataGrid ... or a grid of any sort?). One thing that struck me is the coding involved with master/child displays in Smart Client applications. There's two ways that you can do this in ADO.NET: You can get all the parents and children in one shot and load 'em into a DataSet (or object structure) -or- you can retrieve the children "on demand" (as the user requests the child). Each method has it benefits, but I'd typically lean to the on-demand access, especially if we are looking at a lot of data. This involves writing code to deal with the switching of the focus in the parent record and then filling the child. Not something that's all that difficult, but it is still more stuff to write and maintain. With Linq to Sql, this can be configured with the DeferredLoadingAvailable property of the DataConnection and it will do it for you - depending on the value of this property (settable at runtime - you won't see it in the property sheet in the DataContext designer).

There was also some discussion about using Linq vs. rich data objects. This ... hmmm ... well, I'll just give my perspective. This is certainly possible with Linq, though certainly not with anonymous types (see http://blog.microsoft-j.net/2008/04/15/LinqAndAnonymousTypes.aspx for a discussion of them). But ... the Linq to Sql classes are generated as partial classes, so you can add to them to your heart's delight. As well as add methods that hit stored procs that aren't directly tied to a data class.  Additionally, you can certainly use Linq to Sql to have existing (or new) rich data classes that you create independently of your data access and then filled from the results of your query. As for the performance of these ... well, at the current moment, I don't have any numbers but I'd venture to guess that the performance would be comparable to anonymous types.

Performance aside, one thing that you also need to consider when looking to use Linq in your projects is not just the performance, but the other benefits that Linq brings to the table. Things like the ease of sorting and filtering the objects returned by Linq to Sql (or Linq to XML for that matter) using Linq to Objects. There is also the (way cool, IMHO) feature that lets you merge data from two different data sources (i.e. Linq to Sql and Linq to XML) into a single collection of objects or a single object hierarchy. Additional capabilities and functionality of one methodology over another are often overlooked when writing ASP.NET applications ... it's simply easier to look at the raw, single user, single page performance without thinking about the data in the holistic context of the overall application. This is, however, somewhat myopic; you need to keep the overall application context in mind when making technology and architecture decisions. This in mind ... hmmm ... off to do a bit more testing. Not sure if I'll do updates first or Linq sorting and filtering vs. DataViews.



Tags: , ,

.NET Stuff | Linq | Performance