Search The Web

Today's Headlines

Friday, November 6, 2009

Microsoft Access Tips & Tricks: Finding The Median

If you have worked with SQL, you know that one of the annoyances you might encounter is the need to find the median of a field of data in a table. The problem is that there is no standard aggregate function in SQL to find the median of a field, even though there are aggregate functions to find averages, standard deviations, minimums, maximums, etc. But the median is a very useful statistical measure of a field, and has several advantages over the mean including the ability not to be as influenced by outliers as the mean is prone to be.

I have found several SQL books that provide solutions to this problem. Most of these solutions are somewhat convoluted, and involve pretty expensive constructs like self-joins. Given below is an example of such a solution:

SELECT A.myField
FROM myTable AS A, myTable AS B
GROUP BY A.myField
HAVING (((Sum(IIf(A.myField<=B.myField,1,0)))>=(Count(*)/2))
And ((Sum(IIf(A.myField>=B.myField,1,0)))>=(Count(*)/2)));

This recipe is actually from page 311 of the book "Access Data Analysis Cookbook" by Ken Bluttman and Wayne S. Freeze. The ISBN's of the book are ISBN-10: 0-596-10122-8 or ISBN-13: 978-0-596-10122-0.

Essentially, it is trying to find the datapoint D in field myField such that at least half of the datapoints in that field are equal to or greater than D, AND such that at least half of the datapoints in that field are equal to or less than D.

If you have a table with a few hundred thousand or a few million records, this code will probably take so long to run that nobody will have the patience to wait for something like this to produce results for a report that they need on demand. The main problem is the cartesian product of the table with itself (expressed in the second line of the SQL above as "FROM myTable AS A, myTable AS B". Once you realize that, it is easy to see that even modest-sized tables would have enormous difficulties with this code. A table with a million records would produce a cartesian product that has a TRILLION rows! The database engine then has to group these trillion rows by the value of myField in the table and apply the HAVING clause on that.

Now, if the recipe actually worked, it may not be a big problem for people who want to do analysis primarily on small tables. If you have only a few hundred records in your table, you may not notice any delay in the execution of this SQL.

But, there is a actually a bigger problem with this code than just the time taken to execute. It sometimes produces incorrect results!

Consider a table with four records. Let the value of myField in these 4 records by 1, 2, 3 and 4. Mathematically, we know that the median of this data is actually 2.5. But it may be acceptable (for the sake of simplicity), if a given SQL code produced either 2 or 3 as the answer. If you actually run the SQL code above on this set of data though, you will find that the query does not produce one row as the answer, but 2! It gives you both 2 and 3 as the answers.

This is obviously unacceptable. I actually sent these concerns as an email to the publisher of the book (O'Reilly), but I never got a response to that email. In that email, I also suggested a couple of simple alternatives.

The first of these alternatives is:

SELECT MAX(myField)
FROM (SELECT TOP 50 PERCENT myField from myTable order by myField)

This is very easy to understand since I am simply taking advantage of the TOP keyword in Access SQL. Notice that TOP 50 PERCENT does not mean the highest 50% of values, but the lowest 50% (since the field is ordered in ascending order, and therefore, the TOP value will actually be the lowest value). Once the lowest 50% has been identified, I can take the maximum value out of that and that would be quite close to the true median.

For the simple table with 4 values, 1, 2, 3 and 4, the query above actually gives a median of 2, one of the acceptable values. If you expand the table to include the value 5 in the 5th record, it gives 3, which is the true median of the field.

Essentially, the query produces acceptable values for the median if the number of records is even and the true median is the average of two values in the table. The query produces mathematically correct values for the median when the number of records in the table is odd.

One can also modify the query as below to find 3 as the median instead of 2 if you have a slightly different definition of what is an acceptable median for the 4 values, 1, 2, 3 and 4.

SELECT MIN(myField)
FROM (SELECT TOP 50 PERCENT myField from myTable order by myField desc)

When the true median is actually the average of two data points in the dataset and is not a value in the dataset (for instance in datasets with even number of records where the middle two points are not equal), finding the true median actually requires an extra step: One has to find the average of the median found from one direction and the median found from the other direction. This is shown in the SQL below:

SELECT AVG(Field1) FROM
(select max(myField) as Field1 from (select top 50 percent myField from myTable order by myField)
UNION
select min(myField) from (select top 50 percent myField from myTable order by myField desc))

One will notice immediately that we are now taking the average of a set of values returned by a UNION query. Each of the queries in the UNION query returns one value (the median from either the top, or the bottom). The UNION query itself may return either one or two values (if the number of records in the dataset is odd, both queries in the UNION will arrive at the same answer, and by default the UNION query will remove duplicates, leaving only one row in the final output). The average of these one or two values then becomes the true median of the dataset. The only thing to keep in mind is that at least one of the individual queries in the UNION query has to provide an alias for the median field so that it can be averaged in the outer query.

Obviously, you can also calculate the median in a VBA function similar to the Domain Aggregate functions like DSum(), DAvg(), etc. There is no built-in DMedian() function in Access, but it would be easy to write one as below, and call it where appropriate. There is really no reason to do this since it is easy enough to calculate the median using just SQL, but if you are more comfortable in VBA than in SQL (especially if you are calculating the median of a set of records derived after heavy filtering, for instance), the function is there as a backup to the SQL.

This function would take the same arguments as the other domain aggregate functions for the sake of being consistent. The arguments are strings representing expression (the table field or calculated field for which you need the median), domain (the table or query name that constitutes the domain), and criteria (an optional string expression used to restrict the range of data on which the function is performed, usually written like the WHERE clause of a query without the word WHERE). If you are not sure how to use the function below, search Access help on using a function such as DSum(), look at the examples provided, and use the function below in exactly the same way.


Function DMedian(expr As String, domain As String, Optional criteria As String) As Double
Dim dbs As Database
Dim rst As Recordset
Dim middleIndex As Integer
dim numberOfRecords as Integer

Set dbs = CurrentDb
If Len(criteria) <> 0 Then
Set rst = dbs.OpenRecordset("select " & expr & " from " & domain & " where " & criteria & _
" order by " & expr)
Else
Set rst = dbs.OpenRecordset("select " & expr & " from " & domain & _
" order by " & expr)
'Make sure the spaces inside the quotes are preserved, otherwise your SQL will
'not be syntactically correct and Access will complain!
End If
If rst.BOF Then
numberOfRecords = 0
Else
rst.MoveLast
numberOfRecords = rst.RecordCount
'You need the Movelast to get the correct record count out of a recordset
End If

If numberOfRecords = 0 Then
DMedian = 0
'We assume that the median is 0 when the number of records is zero
ElseIf numberOfRecords = 1 Then
DMedian = rst(expr)
'If the number of records is 1, the value of the expression in that record is the median
Else
middleIndex = Int(0.5 * (numberofrecords - 1) + 1)
'MiddleIndex now points to the middle of the recordset if the recordset contains an odd number of records.
'If the recordset has an even number of records, it points just above the midpoint of the recordset.
rst.MoveFirst
rst.Move (middleIndex - 1)
DMedian = rst(expr)
'if the number of records is odd, we are done
If numberofrecords Mod 2 = 0 Then
'the number of records is an even number
rst.MoveNext
DMedian = (DMedian + rst(expr)) / 2#
'Take the average between the values in this record and the next record
'Note that the next record will point just below the midpoint of the recordset
End If
End If
rst.Close
Set rst = Nothing
dbs.Close
Set dbs = Nothing
'Cleanup everything before leaving the function
End Function


This code has been checked in Access 2003. It should work in any version of Access from Access 97 on up, but please do let me know through the comments if you have problems using the code. Good luck!

6 comments:

Holly E. said...

This is excellent - thanks! My only question -- I'd like to perform this calculation in the context of a crosstab (i.e., find the median for values in a category field), but I haven't been able to find information on how to use the D* functions within a crosstab syntax. Do you have any suggestions? Thanks!

Blogannath said...

Try wrapping the Dmedian() function in an avg(), min(), max() or some other built-in grouping function. So, you would say avg(dmedian(arguments)) instead of just dmedian(arguments). Only a built-in grouping function is allowed in a crosstab query as far as I know, and this trick should allow you to get the median into the crosstab by tricking Access into thinking you are just using a built-in function.

James said...

This is really helpful information. I have a massive data (millions of records) and I need to calculate median and other percentile values across several criteria. For example, median sell price across all the stores in a territory. All the data is in one table, and I'd like to be able to run a query and return the median sell value for each store. If I were calculating an average, I'd just group by store and access would do the rest. How can I use the information about to produce something similar?

Blogannath said...

Take a look at the blog post titled "Microsoft Access Tips & Tricks: Statistics with Grouping" in this blog. The link is: http://blogannath.blogspot.com/2010/03/microsoft-access-tips-tricks-statistics.html

Good luck!

Pierre-Alexandre Jamot said...

Hello ! This is a great article !

I am having trouble though using your last code which is just wonderful.

The systems says error 3061, Too few parameters, expected 7.

Do you have any idea?

best regards,

Anonymous said...

Fantastic - so much simpler than what I've come across on other sites.

Visitors Country Map

Free counters!

Content From TheFreeDictionary.com

In the News

Article of the Day

This Day in History

Today's Birthday

Quote of the Day

Word of the Day

Match Up
Match each word in the left column with its synonym on the right. When finished, click Answer to see the results. Good luck!

 

Hangman

Spelling Bee
difficulty level:
score: -
please wait...
 
spell the word:

Search The Web