Search The Web

Today's Headlines

Monday, March 22, 2010

Microsoft Access Tips & Tricks: Statistics With Grouping

My post on finding percentile values using Microsoft Access received a comment from an interested reader who had an interesting twist on the idea. Instead of finding percentile values for all the data in the table, the commenter wanted to find out percentile values for individual groups in the data. After some experimentation, I found a way to calculate what he/she wanted. But it led me to think about how one can calculate other aggregate statistics of one field like medians, modes, etc. while grouping by other fields in the table. This post is a result of that research.

If you are interested, you can find my earlier posts on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, calculating probability masses out of given data, calculating cumulative distributions out of given data, finding percentile scores, percentile values, calculating distinct counts, full outer joins, parameter queries, crosstab queries, working with system objects, listing table fields, and finding unmatched rows.

Obviously, finding aggregate statistics like average, minimum, maximum, etc. while grouping by other fields is quite trivial in Access. These aggregate statistics have built-in aggregate functions, making their computation simply a matter of using the correct aggregate function in the query. Thus, to find the average of myField while grouping by groupField would involve a query like the one below:
select groupField, avg(myField) from myTable
group by groupField
But what happens when you want to find the median, mode or different percentile values? These don't have built-in aggregate functions, so I devoted entire posts to tricking Access to cough up these statistics using SQL constructs and tricks. It turns out that grouping while finding these kinds of aggregate statistics can be tricky at first. Once you figure out the limitations of Microsoft Access SQL, and learn how to live within them, it is not too bad.

I have provided some solutions to calculating aggregates while grouping below. I have kept these individual sections short so that I can accommodate as many of them as possible in this post. Given the links to the posts that deal with how to calculate these statistics without grouping, you should not have any problem seeing how the SQL there has been extended to accommodate grouping. Still, if you have questions or concerns, please don't hesitate to let me know with your comments.

Also note that even though I have grouped by only one field in all the queries below, you can actually group by as many fields as you want. You just have to add the additional fields to the main query in the GROUP BY clause, and add additional conditions to the subquery in the WHERE clause. In all the queries below, we are calculating aggregates on myField while grouping on groupField. Both of them are fields in myTable.

Median:
SELECT T.groupField, (select max(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField  order by myField)) AS Median
FROM myTable AS T group by T.groupField
or
select T.groupField, 0.50*(select max(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
0.50*(select min(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS Median
from myTable AS T group by T.groupField

Mode:
SELECT groupField, (select max(myField) from myTable where myField in
(SELECT top 1 myField
FROM myTable where myTable.groupField = T.groupField
GROUP BY myField
ORDER BY count(*) DESC)) as Mode
from myTable as T group by T.groupField

Geometric Mean:
SELECT groupField, (select exp(avg(log(myField))) from myTable where myField in
(SELECT myField
FROM myTable where myTable.groupField = T.groupField)) as GeomMean
from myTable as T group by T.groupField

Harmonic Mean:
SELECT groupField, (select count(*)/sum(1.0/myField) from myTable where myField in
(SELECT myField
FROM myTable where myTable.groupField = T.groupField)) as GeomMean
from myTable as T group by T.groupField

Percentile Values:

Medians are just a special case of percentile values, the median being the 50th percentile value. Thus, the SQL for a percentile value while grouping is almost identical to that for the median. In the SQL below, PV represents the percentile value we are trying to find.
SELECT T.groupField, (select max(myField) from myTable where  myTable.myField in
(select top PV percent myField from myTable
where  myTable.groupField = T.groupField  order by myField)) AS PVPercentile
FROM  myTable AS T group by T.groupField
or
select  T.groupField, (1-PV/100)*(select max(myField) from myTable where  myTable.myField in
(select top PV percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
(PV/100)*(select min(myField) from myTable where myTable.myField in
(select top (100-PV) percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS PVPercentile
from myTable AS T group by T.groupField
Thus, the 25th percentile would be found as below:
SELECT T.groupField, (select max(myField) from myTable where  myTable.myField in
(select top 25 percent myField from myTable
where  myTable.groupField = T.groupField  order by myField)) AS PVPercentile
FROM  myTable AS T group by T.groupField
or
select  T.groupField, 0.75*(select max(myField) from myTable where  myTable.myField in
(select top 25 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
0.25*(select min(myField) from myTable where myTable.myField in
(select top 75 percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS 25Percentile
from myTable AS T group by T.groupField
Notice the pattern in all the queries above: We select groupField first in the main query so that we know which group the aggregate statistic belongs to. The main query also sets an alias for myTable that can be used in the subquery. The main query also has a GROUP BY clause to group by groupField.

The second field selected in the main query is a calculated field. The calculated field is our aggregate statistic for myField. We use a correlated subquery to perform the calculation. The correlated subquery uses another subquery with an IN clause to calculate the value. The query inside the IN clause uses the value of groupField from the main query to subdivide the table into multiple temporary tables that contain only rows that have the same value of groupField as in the outer query.

The aggregate statistic we need is then calculated inside this subquery. Since the subquery has segregated the table to include only rows with the value of groupField equal to the value of groupField in the main query, we end up calculating the aggregate statistic for that group of rows.

Using this pattern, it should now be possible for you to calculate other aggregate statistics while grouping if you have the need to. If there are any that you have problems with, I would love to hear about it so that I can take it on as a challenge while helping you out at the same time. Good luck!

Hope this post has been helpful in solving any problems you might have had with aggregating with grouping in Access. If you have any problems or concerns with the SQL in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

19 comments:

James said...

Brilliant. I'm a relative novice to SQL, and I've shown your work to a few folks at my job that know more. They say the work is thoughtful and elegant. Thanks for the help. My one question: does T in the above represent table name? You refer to mytable explicitly in other places....

Blogannath said...

Thank you for your kind compliments.

In the above queries, T is an alias for myTable in the outer query. You will find the line "from myTable as T" at the end of each query. That is where T is defined. Essentially, for the grouping, we pick each grouped value from the table in the outer query, and then calculate statistics for it in the inner query using a WHERE clause to limit values chosen by the inner query to be corresponding to the grouped value in the outer query. And you need an alias for myTable in the inner query to do this selection. Hence the definition of T in the outer query, and then its use in the inner queries. Hope that makes sense.

Kevin said...

Hello!
First of all: Thank you very much for this blog and your posts about access, it helped me a lot in the last days.
Now I stuck with above solution of finding percentile in grouped data. When I test it with a regular table it works quite fine but when I use a query/view instead it says something like " " is not part of the aggregate function. " Any ideas on this?

Blogannath said...

Unfortunately, without looking at the query, it is impossible for me to figure out what is wrong with it. Cut the query down as much as possible to include only the absolute minimum necessary to generate the error. Just the process of doing this might give you a solution to the problem.

Mike said...

This page was very helpful. Could you provide an example using your second option for Median that would add a second grouping field (groupField2). I understand adding it in the GROUP BY clause in the main query but cannot get the WHERE clause to work with the added condition.
Thank you.

Blogannath said...

I haven't tested this, but try it and let me know if you encounter problems:

select T.groupField1, T.groupfield2, 0.50*(select max(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField1 = T.groupField1 and myTable.groupfield2 = T.groupfield2 order by myField)) +
0.50*(select min(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField1 = T.groupField1 and myTable.groupfield2 = T.groupfield2 order by myField desc)) AS Median
from myTable AS T group by T.groupField1, T.groupfield2

Mike said...

Your suggestion is similar what I had tried previously. I get the error message
[You tired to execute a query that does not include the specific expression " as part of an aggregate function.]

Blogannath said...

That error usually tells me that there is some spelling error or something like that where the variables in the group by clause are not the same as the variables in the select clause. Check them thoroughly (copy and paste them from one place to another to eliminate such errors), and see if the problem is still there. If the problem persists, post the query here and I will also take a look at it to see if I can help.

Mike said...

Pasted below is the query that I am working with. It works with grouping on one field (Interval) but I can't seem to make it work once I add the second group field (Timeframe). Thanks for your help. Here is the SQL:

select T.Interval, T.Timeframe, 0.50*(select max(Duration) from [Calc Query 1] where [Calc Query 1].Duration in
(select top 50 percent Duration from [Calc Query 1]
where [Calc Query 1].Interval = T.Interval and [Calc Query 1].Timeframe = T.Timeframe order by Duration)) +
0.50*(select min(Duration) from [Calc Query 1] where [Calc Query 1].Duration in
(select top 50 percent Duration from [Calc Query 1]
where [Calc Query 1].Interval = T.Interval and [Calc Query 1].Timeframe = T.Timeframe order by Duration desc)) AS Median
from [Calc Query 1] AS T group by T.Interval, T.Timeframe

Blogannath said...

Is [Calc Query 1] a table or another query? If it is another query, have you checked whether that query runs fine by itself?

I created a mock database with a table called "Calc Query 1", and created fields in it called "Interval", "Duration" and "Timeframe" (all doubles). I copied and pasted your query into the SQL view of a new query in this database, and it works perfectly fine (no syntax errors. I don't have any data in the table, so I have no idea what output it would produce, but logically, I think it would produce the median if there were data in the table).

Keith M said...

Do you have any recommendations on making this query run quicker (i.e., more efficiently)? I have a rather sizeable table (~165,000 records) that I need to calculate 95th percentile values grouped by a field (for which there are about 1,400 unique values). I tried indexing the table, on both groupField and myField (both with duplicates OK) and the query has been running for 30+ minutes and doesn't show signs of stopping. Eventually, I want to run this statistic on a much larger table (upwards of 25x larger). I've seen other ways of calculating percentile values using a custom function...any thoughts on whether that might be quicker? Any insight that you can give would be most helpful!

Keith M said...

A couple of other questions/comments:

1)In your text above for Percentil Values, the "or" section, shouldn't the second "select top" statement (the one associated with "select min(myField)") be "select top (100-PV) percent? Note that the same comment goes to your example of the 25th percentile value.

2)I'm using Access 2010, and trying to get a 95th percentile value. Access is returning the same record for the max of "SELECT TOP 95 PERCENT..." as it does for "SELECT TOP 5 PERCENT...order...DESC". I created a test table that only had one of my unique values, and thus only had 114 records. A "SELECT TOP 95 PERCENT..." query returns 109 records while a "SELECT TOP 5 PERCENT...order...DESC" returns 6 records. The 109th record of the ascending sort is the exact same record as the 6th record in the descending sort. I would have thought that the "SELECT TOP 95 PERCENT" query would only return 108 records (0.95*114 = 108.3).

Blogannath said...

1) The logic for how to weight the two values when calculating percentile values is explained in the post on percentile values: http://blogannath.blogspot.com/2010/01/microsoft-access-tips-tricks-percentile_25.html

2) Access rounds up the number of records it returns to the next higher integer. If it did not do that, you would never be able to find the median (or any percentile values) when the table contains a single record, for instance.

As far as speed is concerned, I have posted a VBA solution to the problem of finding percentile values in the post on percentile values: http://blogannath.blogspot.com/2010/01/microsoft-access-tips-tricks-percentile_25.html

You have to modify that VBA code to find percentiles with grouping. I think it may be faster than a query, but Access is not the best solution when you have a lot of records in a table. You might be better off using an actual enterprise RDBMS product like SQL server or a free solution like MySQL.

Blogannath said...

Actually, I have posted VBA in the medians post, not in the post on finding percentile values: http://blogannath.blogspot.com/2009/11/microsoft-access-tips-tricks-finding.html

It should be easy to modify that VBA to return any percentile value rather than just the 50th percentile value.

Keith M said...

1) Yes, I read the post on percentile values first, and the example that you give there says "select 0.90*(SELECT MAX(myField)
FROM (SELECT TOP 10 PERCENT myField from myTable order by myField)) +
0.10*(SELECT MIN(myField)
FROM (SELECT TOP 90 PERCENT myField from myTable order by myField desc))
from myTable". I note that the "MAX(myField)" is associated with the "SELECT top 10 PERCENT" and is multiplied by 0.90, while the "MIN(myField)" is associated with the "SELECT top 90 PERCENT" and is multiplied by 0.10.

However, in the code you give on this page, both of the "SELECT TOP" statements say "SELECT TOP PV PERCENT". My point was that, following the logic on the post on percentile values, the second one should read "SELECT TOP (100-PV) PERCENT".

2) If Access is rounding up the number of records, it will always return the same record for the "SELECT TOP PV PERCENT" as the "SELECT TOP (100-PV) PERCENT...ORDER...DESC", so there is no need to weight one by PV% and the other by (100-PV)%.

3) Thanks for pointing me to the VB code. I'll give that a read and post further comments if I have any questions.

THANK YOU SO MUCH for replying!

Blogannath said...

1) Ah, got it. I was looking for the mistake in the wrong place, and I misread your original comment. The text has been corrected, thanks for pointing out the error.

2)Not necessarily. In the case of 114 records, it just worked out that way. If there were 4 records in the table, and you were trying to find the median, one part of the query would pull up the second record and the other would pull up the 3rd record, and you have to average them to get the correct median. Similarly, if you have 100 records, and you are trying to find the 95th percentile, you will pull up the 95th and 96th records with the two queries and the results have to be weighted appropriately to get the correct value.

In general it is better to assume that two different records will be pulled up rather than rely on them being identical.

3) You are welcome, and I will be glad to help out as much as I can.

Keith M said...

1) Don't forget to also fix the example of the 25th percentile (the second "select top" should be 75 instead of 25).

2) The "TOP PV PERCENT" and "TOP (100-PV) PERCENT" selects only pull up two different records in certain specific instances: when the number of records is a multiple of either (100/PV) or (100/(100-PV)). (In your two examples, 4 is a multiple of 2 (100/50) and 100 is a multiple of 20 (100/5). When you're just looking for medians (PV=50), that's fine, because either the number your looking for is a record, or it's between two records. However, if PV is anything other than 50, the technique doesn't appear to give you an accurate percentile value. For example, I created myTable with two fields, myGroup and myField and created 100 records, with myField being 1 through 100, and myGroup being all "A". I then created a query as:

SELECT T.myGroup, (1-95/100)*(select max([myField]) from myTable where myTable.myField in (select top 95 percent myField from myTable where myTable.myGroup = T.myGroup order by myField))+(95/100)*(select min([myField]) from myTable where myTable.myField in (select top 5 percent myField from myTable where myTable.myGroup = T.myGroup order by myField desc)) AS Expr1 FROM myTable AS T GROUP BY T.myGroup;

For exactly 100 records, this gives a result of 95.95. However, if I add a 101th record (myField of 101), the same query gives the result of 96. Adding 102 give 97, 103 gives 98, and so on until 119 records results in a value of 114. However, adding the 120th record gives a value of 114.95. I'm not really sure this is the behavior that is expected.

3) I was able to successfully modify your DMedian code to return percentiles, but I decided to use the method of "linear interpolation between closest ranks" (see http://en.wikipedia.org/wiki/Percentile). I'll paste the code in the next comment. The query that I created is as follows:
SELECT myTable.myGroup, DPercentile(95,"myField","myTable","myGroup = '" & [myGroup] & "'") AS _95Percentile
FROM myTable
GROUP BY myTable.myGroup;

This works MUCH MUCH faster than using the nested SELECT TOP PV PERCENT technique, even with relatively small datasets.

Keith M said...

Here's the code. Again, I chose to use a linear interpolation method. If you want to weight based on PV and (100-PV) you could easily change that section of the code.

Function DPercentile(PV As Single, expr As String, domain As String, Optional criteria As String) As Double
'Uses linear interpolation method for percentile values.
Dim dbs As Database
Dim rst As Recordset
Dim lowIndex As Integer
Dim numberOfRecords As Integer

Set dbs = CurrentDb
If Len(criteria) <> 0 Then
Set rst = dbs.OpenRecordset("select " & expr & " from " & domain & " where " & criteria & _
" order by " & expr)
Else
Set rst = dbs.OpenRecordset("select " & expr & " from " & domain & _
" order by " & expr)
'Make sure the spaces inside the quotes are preserved, otherwise your SQL will
'not be syntactically correct and Access will complain!
End If
If rst.BOF Then
numberOfRecords = 0
Else
rst.MoveLast
numberOfRecords = rst.RecordCount
'You need the Movelast to get the correct record count out of a recordset
End If

If numberOfRecords = 0 Then
DPercentile = 0
'We assume that the percentile is 0 when the number of records is zero
ElseIf numberOfRecords = 1 Then
DPercentile = rst(expr)
'If the number of records is 1, the value of the expression in that record is the percentile
Else
trueIndex = (PV / 100 * numberOfRecords) + 0.5
lowIndex = Int(trueIndex)
'lowIndex now points to the position below (or at) the correct percentile value.
rst.MoveFirst
rst.Move (lowIndex - 1)
DPercentile = rst(expr)
'if the number of records is odd, we are done
If lowIndex <> trueIndex Then
'the percentile value doesn't fall exactly on a record
rst.MoveNext
DPercentile = DPercentile + (trueIndex - lowIndex) * (rst(expr) - DPercentile)
'Do the linear interpolation between the two values.
End If
End If
rst.Close
Set rst = Nothing
dbs.Close
Set dbs = Nothing
'Cleanup everything before leaving the function
End Function

Blogannath said...

Sorry about the delay in responding. Vacation, then computer problems, you know the drill.

1) Thanks for pointing that out. It is now fixed.

2) I am not disagreeing with you. But as long as at least one case exists where the two queries pull up different records, I need to take that possibility into account. I don't want to write a query that will work 95 or 99% of the time, and give the wrong results the rest of the time. The way things are written now, it will give you the correct answers 100% of the time.

3) Thank you for the code.

Visitors Country Map

Free counters!

Content From TheFreeDictionary.com

In the News

Article of the Day

This Day in History

Today's Birthday

Quote of the Day

Word of the Day

Match Up
Match each word in the left column with its synonym on the right. When finished, click Answer to see the results. Good luck!

 

Hangman

Spelling Bee
difficulty level:
score: -
please wait...
 
spell the word:

Search The Web