Search The Web

Today's Headlines

Monday, March 22, 2010

Microsoft Access Tips & Tricks: Statistics With Grouping

My post on finding percentile values using Microsoft Access received a comment from an interested reader who had an interesting twist on the idea. Instead of finding percentile values for all the data in the table, the commenter wanted to find out percentile values for individual groups in the data. After some experimentation, I found a way to calculate what he/she wanted. But it led me to think about how one can calculate other aggregate statistics of one field like medians, modes, etc. while grouping by other fields in the table. This post is a result of that research.

If you are interested, you can find my earlier posts on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, calculating probability masses out of given data, calculating cumulative distributions out of given data, finding percentile scores, percentile values, calculating distinct counts, full outer joins, parameter queries, crosstab queries, working with system objects, listing table fields, and finding unmatched rows.

Obviously, finding aggregate statistics like average, minimum, maximum, etc. while grouping by other fields is quite trivial in Access. These aggregate statistics have built-in aggregate functions, making their computation simply a matter of using the correct aggregate function in the query. Thus, to find the average of myField while grouping by groupField would involve a query like the one below:

select groupField, avg(myField) from myTable
group by groupField
But what happens when you want to find the median, mode or different percentile values? These don't have built-in aggregate functions, so I devoted entire posts to tricking Access to cough up these statistics using SQL constructs and tricks. It turns out that grouping while finding these kinds of aggregate statistics can be tricky at first. Once you figure out the limitations of Microsoft Access SQL, and learn how to live within them, it is not too bad.

I have provided some solutions to calculating aggregates while grouping below. I have kept these individual sections short so that I can accommodate as many of them as possible in this post. Given the links to the posts that deal with how to calculate these statistics without grouping, you should not have any problem seeing how the SQL there has been extended to accommodate grouping. Still, if you have questions or concerns, please don't hesitate to let me know with your comments.

Also note that even though I have grouped by only one field in all the queries below, you can actually group by as many fields as you want. You just have to add the additional fields to the main query in the GROUP BY clause, and add additional conditions to the subquery in the WHERE clause. In all the queries below, we are calculating aggregates on myField while grouping on groupField. Both of them are fields in myTable.

Median:

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) AS Median
FROM myTable AS T group by T.groupField
or

select T.groupField, 0.50*(select max(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
0.50*(select min(myField) from myTable where myTable.myField in
(select top 50 percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS Median
from myTable AS T group by T.groupField

Mode:

SELECT groupField, (select max(myField) from myTable where myField in
(SELECT top 1 myField
FROM myTable where myTable.groupField = T.groupField
GROUP BY myField
ORDER BY count(*) DESC)) as Mode
from myTable as T group by T.groupField

Geometric Mean:

SELECT groupField, (select exp(avg(log(myField))) from myTable where myField in
(SELECT myField
FROM myTable where myTable.groupField = T.groupField)) as GeomMean
from myTable as T group by T.groupField

Harmonic Mean:

SELECT groupField, (select count(*)/sum(1.0/myField) from myTable where myField in
(SELECT myField
FROM myTable where myTable.groupField = T.groupField)) as GeomMean
from myTable as T group by T.groupField

Percentile Values:

Medians are just a special case of percentile values, the median being the 50th percentile value. Thus, the SQL for a percentile value while grouping is almost identical to that for the median. In the SQL below, PV represents the percentile value we are trying to find.

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in
(select top PV percent myField from myTable
where myTable.groupField = T.groupField order by myField)) AS PVPercentile
FROM myTable AS T group by T.groupField
or

select T.groupField, (1-PV/100)*(select max(myField) from myTable where myTable.myField in
(select top PV percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
(PV/100)*(select min(myField) from myTable where myTable.myField in
(select top PV percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS PVPercentile
from myTable AS T group by T.groupField
Thus, the 25th percentile would be found as below:

SELECT T.groupField, (select max(myField) from myTable where myTable.myField in
(select top 25 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) AS PVPercentile
FROM myTable AS T group by T.groupField
or

select T.groupField, 0.75*(select max(myField) from myTable where myTable.myField in
(select top 25 percent myField from myTable
where myTable.groupField = T.groupField order by myField)) +
0.25*(select min(myField) from myTable where myTable.myField in
(select top 25 percent myField from myTable
where myTable.groupField = T.groupField order by myField desc)) AS 25Percentile
from myTable AS T group by T.groupField
Notice the pattern in all the queries above: We select groupField first in the main query so that we know which group the aggregate statistic belongs to. The main query also sets an alias for myTable that can be used in the subquery. The main query also has a GROUP BY clause to group by groupField.

The second field selected in the main query is a calculated field. The calculated field is our aggregate statistic for myField. We use a correlated subquery to perform the calculation. The correlated subquery uses another subquery with an IN clause to calculate the value. The query inside the IN clause uses the value of groupField from the main query to subdivide the table into multiple temporary tables that contain only rows that have the same value of groupField as in the outer query.

The aggregate statistic we need is then calculated inside this subquery. Since the subquery has segregated the table to include only rows with the value of groupField equal to the value of groupField in the main query, we end up calculating the aggregate statistic for that group of rows.

Using this pattern, it should now be possible for you to calculate other aggregate statistics while grouping if you have the need to. If there are any that you have problems with, I would love to hear about it so that I can take it on as a challenge while helping you out at the same time. Good luck!

Hope this post has been helpful in solving any problems you might have had with aggregating with grouping in Access. If you have any problems or concerns with the SQL in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

2 comments:

James said...

Brilliant. I'm a relative novice to SQL, and I've shown your work to a few folks at my job that know more. They say the work is thoughtful and elegant. Thanks for the help. My one question: does T in the above represent table name? You refer to mytable explicitly in other places....

Blogannath said...

Thank you for your kind compliments.

In the above queries, T is an alias for myTable in the outer query. You will find the line "from myTable as T" at the end of each query. That is where T is defined. Essentially, for the grouping, we pick each grouped value from the table in the outer query, and then calculate statistics for it in the inner query using a WHERE clause to limit values chosen by the inner query to be corresponding to the grouped value in the outer query. And you need an alias for myTable in the inner query to do this selection. Hence the definition of T in the outer query, and then its use in the inner queries. Hope that makes sense.

Visitors Country Map

Free counters!

Blog Explosion - Drive More Traffic To Your Blog!

Content From TheFreeDictionary.com

In the News

Article of the Day

This Day in History

Today's Birthday

Quote of the Day

Word of the Day

Match Up
Match each word in the left column with its synonym on the right. When finished, click Answer to see the results. Good luck!

 

Hangman

Spelling Bee
difficulty level:
score: -
please wait...
 
spell the word:

Search The Web