If you are interested, you can find my earlier posts on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, calculating probability masses out of given data, calculating cumulative distributions out of given data, finding percentile scores, percentile values, calculating distinct counts, full outer joins, parameter queries, crosstab queries, working with system objects, listing table fields, and finding unmatched rows.

Obviously, finding aggregate statistics like average, minimum, maximum, etc. while grouping by other fields is quite trivial in Access. These aggregate statistics have built-in aggregate functions, making their computation simply a matter of using the correct aggregate function in the query. Thus, to find the average of myField while grouping by groupField would involve a query like the one below:

select groupField, avg(myField) from myTable group by groupFieldBut what happens when you want to find the median, mode or different percentile values? These don't have built-in aggregate functions, so I devoted entire posts to tricking Access to cough up these statistics using SQL constructs and tricks. It turns out that grouping while finding these kinds of aggregate statistics can be tricky at first. Once you figure out the limitations of Microsoft Access SQL, and learn how to live within them, it is not too bad.

I have provided some solutions to calculating aggregates while grouping below. I have kept these individual sections short so that I can accommodate as many of them as possible in this post. Given the links to the posts that deal with how to calculate these statistics without grouping, you should not have any problem seeing how the SQL there has been extended to accommodate grouping. Still, if you have questions or concerns, please don't hesitate to let me know with your comments.

Also note that even though I have grouped by only one field in all the queries below, you can actually group by as many fields as you want. You just have to add the additional fields to the main query in the GROUP BY clause, and add additional conditions to the subquery in the WHERE clause. In all the queries below, we are calculating aggregates on myField while grouping on groupField. Both of them are fields in myTable.

Median:

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in (select top 50 percent myField from myTable where myTable.groupField = T.groupField order by myField)) AS Median FROM myTable AS T group by T.groupFieldor

select T.groupField, 0.50*(select max(myField) from myTable where myTable.myField in (select top 50 percent myField from myTable where myTable.groupField = T.groupField order by myField)) + 0.50*(select min(myField) from myTable where myTable.myField in (select top 50 percent myField from myTable where myTable.groupField = T.groupField order by myField desc)) AS Median from myTable AS T group by T.groupField

Mode:

SELECT groupField, (select max(myField) from myTable where myField in (SELECT top 1 myField FROM myTable where myTable.groupField = T.groupField GROUP BY myField ORDER BY count(*) DESC)) as Mode from myTable as T group by T.groupField

Geometric Mean:

SELECT groupField, (select exp(avg(log(myField))) from myTable where myField in (SELECT myField FROM myTable where myTable.groupField = T.groupField)) as GeomMean from myTable as T group by T.groupField

Harmonic Mean:

SELECT groupField, (select count(*)/sum(1.0/myField) from myTable where myField in (SELECT myField FROM myTable where myTable.groupField = T.groupField)) as GeomMean from myTable as T group by T.groupField

Percentile Values:

Medians are just a special case of percentile values, the median being the 50th percentile value. Thus, the SQL for a percentile value while grouping is almost identical to that for the median. In the SQL below, PV represents the percentile value we are trying to find.

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in (select top PV percent myField from myTable where myTable.groupField = T.groupField order by myField)) AS PVPercentile FROM myTable AS T group by T.groupFieldor

select T.groupField, (1-PV/100)*(select max(myField) from myTable where myTable.myField in (select top PV percent myField from myTable where myTable.groupField = T.groupField order by myField)) + (PV/100)*(select min(myField) from myTable where myTable.myField in (select top (100-PV) percent myField from myTable where myTable.groupField = T.groupField order by myField desc)) AS PVPercentile from myTable AS T group by T.groupFieldThus, the 25th percentile would be found as below:

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in (select top 25 percent myField from myTable where myTable.groupField = T.groupField order by myField)) AS PVPercentile FROM myTable AS T group by T.groupFieldor

select T.groupField, 0.75*(select max(myField) from myTable where myTable.myField in (select top 25 percent myField from myTable where myTable.groupField = T.groupField order by myField)) + 0.25*(select min(myField) from myTable where myTable.myField in (select top 75 percent myField from myTable where myTable.groupField = T.groupField order by myField desc)) AS 25Percentile from myTable AS T group by T.groupFieldNotice the pattern in all the queries above: We select groupField first in the main query so that we know which group the aggregate statistic belongs to. The main query also sets an alias for myTable that can be used in the subquery. The main query also has a GROUP BY clause to group by groupField.

The second field selected in the main query is a calculated field. The calculated field is our aggregate statistic for myField. We use a correlated subquery to perform the calculation. The correlated subquery uses another subquery with an IN clause to calculate the value. The query inside the IN clause uses the value of groupField from the main query to subdivide the table into multiple temporary tables that contain only rows that have the same value of groupField as in the outer query.

The aggregate statistic we need is then calculated inside this subquery. Since the subquery has segregated the table to include only rows with the value of groupField equal to the value of groupField in the main query, we end up calculating the aggregate statistic for that group of rows.

Using this pattern, it should now be possible for you to calculate other aggregate statistics while grouping if you have the need to. If there are any that you have problems with, I would love to hear about it so that I can take it on as a challenge while helping you out at the same time. Good luck!

Hope this post has been helpful in solving any problems you might have had with aggregating with grouping in Access. If you have any problems or concerns with the SQL in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

## 23 comments:

Brilliant. I'm a relative novice to SQL, and I've shown your work to a few folks at my job that know more. They say the work is thoughtful and elegant. Thanks for the help. My one question: does T in the above represent table name? You refer to mytable explicitly in other places....

Thank you for your kind compliments.

In the above queries, T is an alias for myTable in the outer query. You will find the line "from myTable as T" at the end of each query. That is where T is defined. Essentially, for the grouping, we pick each grouped value from the table in the outer query, and then calculate statistics for it in the inner query using a WHERE clause to limit values chosen by the inner query to be corresponding to the grouped value in the outer query. And you need an alias for myTable in the inner query to do this selection. Hence the definition of T in the outer query, and then its use in the inner queries. Hope that makes sense.

Hello!

First of all: Thank you very much for this blog and your posts about access, it helped me a lot in the last days.

Now I stuck with above solution of finding percentile in grouped data. When I test it with a regular table it works quite fine but when I use a query/view instead it says something like " " is not part of the aggregate function. " Any ideas on this?

Unfortunately, without looking at the query, it is impossible for me to figure out what is wrong with it. Cut the query down as much as possible to include only the absolute minimum necessary to generate the error. Just the process of doing this might give you a solution to the problem.

This page was very helpful. Could you provide an example using your second option for Median that would add a second grouping field (groupField2). I understand adding it in the GROUP BY clause in the main query but cannot get the WHERE clause to work with the added condition.

Thank you.

I haven't tested this, but try it and let me know if you encounter problems:

select T.groupField1, T.groupfield2, 0.50*(select max(myField) from myTable where myTable.myField in

(select top 50 percent myField from myTable

where myTable.groupField1 = T.groupField1 and myTable.groupfield2 = T.groupfield2 order by myField)) +

0.50*(select min(myField) from myTable where myTable.myField in

(select top 50 percent myField from myTable

where myTable.groupField1 = T.groupField1 and myTable.groupfield2 = T.groupfield2 order by myField desc)) AS Median

from myTable AS T group by T.groupField1, T.groupfield2

Your suggestion is similar what I had tried previously. I get the error message

[You tired to execute a query that does not include the specific expression " as part of an aggregate function.]

That error usually tells me that there is some spelling error or something like that where the variables in the group by clause are not the same as the variables in the select clause. Check them thoroughly (copy and paste them from one place to another to eliminate such errors), and see if the problem is still there. If the problem persists, post the query here and I will also take a look at it to see if I can help.

Pasted below is the query that I am working with. It works with grouping on one field (Interval) but I can't seem to make it work once I add the second group field (Timeframe). Thanks for your help. Here is the SQL:

select T.Interval, T.Timeframe, 0.50*(select max(Duration) from [Calc Query 1] where [Calc Query 1].Duration in

(select top 50 percent Duration from [Calc Query 1]

where [Calc Query 1].Interval = T.Interval and [Calc Query 1].Timeframe = T.Timeframe order by Duration)) +

0.50*(select min(Duration) from [Calc Query 1] where [Calc Query 1].Duration in

(select top 50 percent Duration from [Calc Query 1]

where [Calc Query 1].Interval = T.Interval and [Calc Query 1].Timeframe = T.Timeframe order by Duration desc)) AS Median

from [Calc Query 1] AS T group by T.Interval, T.Timeframe

Is [Calc Query 1] a table or another query? If it is another query, have you checked whether that query runs fine by itself?

I created a mock database with a table called "Calc Query 1", and created fields in it called "Interval", "Duration" and "Timeframe" (all doubles). I copied and pasted your query into the SQL view of a new query in this database, and it works perfectly fine (no syntax errors. I don't have any data in the table, so I have no idea what output it would produce, but logically, I think it would produce the median if there were data in the table).

Do you have any recommendations on making this query run quicker (i.e., more efficiently)? I have a rather sizeable table (~165,000 records) that I need to calculate 95th percentile values grouped by a field (for which there are about 1,400 unique values). I tried indexing the table, on both groupField and myField (both with duplicates OK) and the query has been running for 30+ minutes and doesn't show signs of stopping. Eventually, I want to run this statistic on a much larger table (upwards of 25x larger). I've seen other ways of calculating percentile values using a custom function...any thoughts on whether that might be quicker? Any insight that you can give would be most helpful!

A couple of other questions/comments:

1)In your text above for Percentil Values, the "or" section, shouldn't the second "select top" statement (the one associated with "select min(myField)") be "select top (100-PV) percent? Note that the same comment goes to your example of the 25th percentile value.

2)I'm using Access 2010, and trying to get a 95th percentile value. Access is returning the same record for the max of "SELECT TOP 95 PERCENT..." as it does for "SELECT TOP 5 PERCENT...order...DESC". I created a test table that only had one of my unique values, and thus only had 114 records. A "SELECT TOP 95 PERCENT..." query returns 109 records while a "SELECT TOP 5 PERCENT...order...DESC" returns 6 records. The 109th record of the ascending sort is the exact same record as the 6th record in the descending sort. I would have thought that the "SELECT TOP 95 PERCENT" query would only return 108 records (0.95*114 = 108.3).

1) The logic for how to weight the two values when calculating percentile values is explained in the post on percentile values: http://blogannath.blogspot.com/2010/01/microsoft-access-tips-tricks-percentile_25.html

2) Access rounds up the number of records it returns to the next higher integer. If it did not do that, you would never be able to find the median (or any percentile values) when the table contains a single record, for instance.

As far as speed is concerned, I have posted a VBA solution to the problem of finding percentile values in the post on percentile values: http://blogannath.blogspot.com/2010/01/microsoft-access-tips-tricks-percentile_25.html

You have to modify that VBA code to find percentiles with grouping. I think it may be faster than a query, but Access is not the best solution when you have a lot of records in a table. You might be better off using an actual enterprise RDBMS product like SQL server or a free solution like MySQL.

Actually, I have posted VBA in the medians post, not in the post on finding percentile values: http://blogannath.blogspot.com/2009/11/microsoft-access-tips-tricks-finding.html

It should be easy to modify that VBA to return any percentile value rather than just the 50th percentile value.

1) Yes, I read the post on percentile values first, and the example that you give there says "select 0.90*(SELECT MAX(myField)

FROM (SELECT TOP 10 PERCENT myField from myTable order by myField)) +

0.10*(SELECT MIN(myField)

FROM (SELECT TOP 90 PERCENT myField from myTable order by myField desc))

from myTable". I note that the "MAX(myField)" is associated with the "SELECT top 10 PERCENT" and is multiplied by 0.90, while the "MIN(myField)" is associated with the "SELECT top 90 PERCENT" and is multiplied by 0.10.

However, in the code you give on this page, both of the "SELECT TOP" statements say "SELECT TOP PV PERCENT". My point was that, following the logic on the post on percentile values, the second one should read "SELECT TOP (100-PV) PERCENT".

2) If Access is rounding up the number of records, it will always return the same record for the "SELECT TOP PV PERCENT" as the "SELECT TOP (100-PV) PERCENT...ORDER...DESC", so there is no need to weight one by PV% and the other by (100-PV)%.

3) Thanks for pointing me to the VB code. I'll give that a read and post further comments if I have any questions.

THANK YOU SO MUCH for replying!

1) Ah, got it. I was looking for the mistake in the wrong place, and I misread your original comment. The text has been corrected, thanks for pointing out the error.

2)Not necessarily. In the case of 114 records, it just worked out that way. If there were 4 records in the table, and you were trying to find the median, one part of the query would pull up the second record and the other would pull up the 3rd record, and you have to average them to get the correct median. Similarly, if you have 100 records, and you are trying to find the 95th percentile, you will pull up the 95th and 96th records with the two queries and the results have to be weighted appropriately to get the correct value.

In general it is better to assume that two different records will be pulled up rather than rely on them being identical.

3) You are welcome, and I will be glad to help out as much as I can.

1) Don't forget to also fix the example of the 25th percentile (the second "select top" should be 75 instead of 25).

2) The "TOP PV PERCENT" and "TOP (100-PV) PERCENT" selects only pull up two different records in certain specific instances: when the number of records is a multiple of either (100/PV) or (100/(100-PV)). (In your two examples, 4 is a multiple of 2 (100/50) and 100 is a multiple of 20 (100/5). When you're just looking for medians (PV=50), that's fine, because either the number your looking for is a record, or it's between two records. However, if PV is anything other than 50, the technique doesn't appear to give you an accurate percentile value. For example, I created myTable with two fields, myGroup and myField and created 100 records, with myField being 1 through 100, and myGroup being all "A". I then created a query as:

SELECT T.myGroup, (1-95/100)*(select max([myField]) from myTable where myTable.myField in (select top 95 percent myField from myTable where myTable.myGroup = T.myGroup order by myField))+(95/100)*(select min([myField]) from myTable where myTable.myField in (select top 5 percent myField from myTable where myTable.myGroup = T.myGroup order by myField desc)) AS Expr1 FROM myTable AS T GROUP BY T.myGroup;

For exactly 100 records, this gives a result of 95.95. However, if I add a 101th record (myField of 101), the same query gives the result of 96. Adding 102 give 97, 103 gives 98, and so on until 119 records results in a value of 114. However, adding the 120th record gives a value of 114.95. I'm not really sure this is the behavior that is expected.

3) I was able to successfully modify your DMedian code to return percentiles, but I decided to use the method of "linear interpolation between closest ranks" (see http://en.wikipedia.org/wiki/Percentile). I'll paste the code in the next comment. The query that I created is as follows:

SELECT myTable.myGroup, DPercentile(95,"myField","myTable","myGroup = '" & [myGroup] & "'") AS _95Percentile

FROM myTable

GROUP BY myTable.myGroup;

This works MUCH MUCH faster than using the nested SELECT TOP PV PERCENT technique, even with relatively small datasets.

Here's the code. Again, I chose to use a linear interpolation method. If you want to weight based on PV and (100-PV) you could easily change that section of the code.

Function DPercentile(PV As Single, expr As String, domain As String, Optional criteria As String) As Double

'Uses linear interpolation method for percentile values.

Dim dbs As Database

Dim rst As Recordset

Dim lowIndex As Integer

Dim numberOfRecords As Integer

Set dbs = CurrentDb

If Len(criteria) <> 0 Then

Set rst = dbs.OpenRecordset("select " & expr & " from " & domain & " where " & criteria & _

" order by " & expr)

Else

Set rst = dbs.OpenRecordset("select " & expr & " from " & domain & _

" order by " & expr)

'Make sure the spaces inside the quotes are preserved, otherwise your SQL will

'not be syntactically correct and Access will complain!

End If

If rst.BOF Then

numberOfRecords = 0

Else

rst.MoveLast

numberOfRecords = rst.RecordCount

'You need the Movelast to get the correct record count out of a recordset

End If

If numberOfRecords = 0 Then

DPercentile = 0

'We assume that the percentile is 0 when the number of records is zero

ElseIf numberOfRecords = 1 Then

DPercentile = rst(expr)

'If the number of records is 1, the value of the expression in that record is the percentile

Else

trueIndex = (PV / 100 * numberOfRecords) + 0.5

lowIndex = Int(trueIndex)

'lowIndex now points to the position below (or at) the correct percentile value.

rst.MoveFirst

rst.Move (lowIndex - 1)

DPercentile = rst(expr)

'if the number of records is odd, we are done

If lowIndex <> trueIndex Then

'the percentile value doesn't fall exactly on a record

rst.MoveNext

DPercentile = DPercentile + (trueIndex - lowIndex) * (rst(expr) - DPercentile)

'Do the linear interpolation between the two values.

End If

End If

rst.Close

Set rst = Nothing

dbs.Close

Set dbs = Nothing

'Cleanup everything before leaving the function

End Function

Sorry about the delay in responding. Vacation, then computer problems, you know the drill.

1) Thanks for pointing that out. It is now fixed.

2) I am not disagreeing with you. But as long as at least one case exists where the two queries pull up different records, I need to take that possibility into account. I don't want to write a query that will work 95 or 99% of the time, and give the wrong results the rest of the time. The way things are written now, it will give you the correct answers 100% of the time.

3) Thank you for the code.

Mr/Ms Blogannath, is it possible to combine partition function, with your median function?

Thx

I am not sure I understand. How do you want to combine the two? What is the input and what is the output you need?

I have a data set with 600k records. My groupField has range 0 to 50 in steps by 0.1 (500 values in total). I have finished analysis using partition function and average on myField like this:

SELECT Partition([groupField],0,50,1) AS Range, Avg(myTable.myField) AS AvgOfmyField

FROM myTable GROUP BY Partition([groupField],0,50,1);

Output was 50 rows with two fields, Range and AvgOfmyField, like this: 0:0 nnn, 1:1 nnn, itd.

Range 0:0 covers range from 0.0 to 0.5, Range 1:1 covers range from 0.6 to 1.4, 2:2 range 1.5 to 2.5, 3:3 range 2.6 to 3.4 itd. A little bit strange and uneven ranges but let say ok.

Similar result I have with pivot table with groupField in RawArea and myField in DataArea with Average. Here I have also rows 0 to 50 automatically grouped, but range for row 0 is from 0.0 to 0.9, row 1 range is from 1.0 to 1.9 itd.

These two results differ because of different ranges of bins, but both would be acceptable for me.

My main problem is how to calculate median (and percentile) instead of average on the same distribution (histogram).

I can apply your solution directly grouping on my groupField but this will produce 500 bins which is too fine resolution. Some of bins at the end of the scale will hold too few samples to be statistically confident, so I prefer to stick with grouping by 1.

I’m thinking of adding new calculated field in myTable by which I can group.

Databases are not my primary field, but problems like this I often encounter so I’m looking for simplest solution. I believe you will have some advice.

Thank you for your detailed explanation of the problem you are facing. I can think of a few solutions: 1. Have you looked at the article on histograms (http://blogannath.blogspot.com/2009/12/microsoft-access-tips-tricks-histograms.html)? You may be able to replace the count() in that post with a subquery that calculates medians. (2) You can create a calculated field in the table to group by and then use the concepts in this post to calculate medians. (3) You can use your query above as a subquery (it already has a calculated field called RANGE) in a median query and group by the RANGE field in the outer query or (4) you can create a temporary table using the results of your histogram query, and then calculate medians out of this temporary table.

I like all of the solutions better than (2) because including calculated fields in a database is very error-prone (the calculations have to be redone as soon as any data in the table changes, none of the calculations update automatically, etc.). But it may be acceptable if you are going to use the table once and then you have a new data feed the next time you need to do this or something like that.

Post a Comment