Search The Web

Today's Headlines

Monday, January 25, 2010

Microsoft Access Tips & Tricks: Percentile Values

As explained in the previous post on Access, the distinction between percentile scores and percentile values is important to understand. Given a particular instance from a large sample contained in a column of a table, we can find out its percentile score, which denotes how close to the maximum or minimum value the particular instance falls. Percentile scores always fall between 0 and 100.

On the other hand, you may be given a percentile value (which can only fall between 0 and 100), and asked to find what instance from a large sample has that percentile score. This instance can have any value between the minimum and maximum of the given sample. The median is an example of a percentile value. Finding the median of a large sample involves finding that value in the sample that has a percentile score of 50. In this case 50 is the percentile score. The actual value of the median is the percentile value corresponding to the percentile score of 50.

If you are interested, you can find my earlier posts on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, calculating probability masses out of given data, calculating cumulative distributions out of given data, and finding percentile scores.

In this lesson, we will be dealing with finding instances out of a sample, given the percentile value we want to find. In that sense, this post is going to be very similar to the post on finding medians. It also means that the answers in this post are all supposed to be single values, not tables of values. You are given a table, myTable, with a field, myField, that contains several instances. You are then asked to find the instance with a percentile value of x, where x is between 0 and 100.

We have already seen how to find the instance with a percentile value of 50. That is what finding the median was all about. We can easily see that it is possible to modify the SQL for finding the median to find other percentile values also. But even before we go there, consider the following basics:

The 0th percentile value of a dataset is always the minimum value in the dataset. Thus, the 0th percentile value can be found simply with the SQL below:
SELECT MIN(myField)FROM myTable
Similarly, the 100th percentile value of a dataset is always the maximum value in the dataset. Thus, the 100th percentile value can be found simply with the SQL below:
SELECT MAX(myField)FROM myTable
To find other percentile values, we can modify the SQL for median. For instance, to find the 10th percentile value, we can use the SQL below:

SELECT MAX(myField)
FROM (SELECT TOP 10 PERCENT myField from myTable order by myField)
Just as in the case of medians, different schools of thought define percentile values in different ways. So, in some other schools of thought, the SQL below will give us the 10th percentile value:

SELECT MIN(myField)
FROM (SELECT TOP 90 PERCENT myField from myTable order by myField desc)
Note that the TOP modifier is used with 100 - the percentile value we are interested in calculating, when we use this version of the SQL.

When the true 10th percentile value is an actual instance in the dataset, then both these queries will in fact give you that exact instance. However, if the true 10th percentile value falls between two values in the dataset, the two queries above will give you the two values which have to be combined in a particular ratio to get the true 10th percentile value.

In the case of medians, when the two queries for median give different answers, the true median is easily found by combining the two values in 50-50 ratio. In other words, we took 50% of one value and combined it with 50% of the other value. This is the same as taking the average of the two values.

In general, if the xth percentile value is calculated as two different numbers by the two queries above, then the true xth percentile value is (100 - x)% of the value from the first query combined with x% of the value from the second query. We were able to do this combination using the SQL aggregate function AVG() during the calculation of medians. No such SQL aggregate function exists to combine arbitrary percentages of two different values. So, we have to code it up ourselves.

One way to do this would be to code up the two queries above as subqueries of a larger query that then combines the results from these queries in the appropriate proportions to produce the final results. Such a query for the 10th percentile value might look like this:

select 0.90*(SELECT MAX(myField)
FROM (SELECT TOP 10 PERCENT myField from myTable order by myField)) +
0.10*(SELECT MIN(myField)
FROM (SELECT TOP 90 PERCENT myField from myTable order by myField desc))
from myTable
Note the symmetry of multiplying 0.90 by the SQL that uses TOP 10 percent and multiplying 0.10 by the SQL that uses TOP 90 percent. That will make the form of the SQL easy to remember.

Just as in the case of medians, one can write VBA code to calculate percentile values. The set of parameters passed to such a function can be modified to include the percentile value we are interested in finding. I will leave the actual writing of such a function to my readers though!

Hope this post has been helpful in solving any problems you might have had with calculating percentile values out of your data in Access. If you have any problems or concerns with the queries in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

20 comments:

Anonymous said...

I have one issue to use it and I don't know who resolve.

In my case my table is not just one sample, i have a list os sample ie.

service1 - value1
service1 - value2
service1 - value3
service2 - value1
service2 - value2
servide2 - value3
service3 - value4

I supose that to use your sql I need to group by service but I didn't get the right solution then I get the avg ok, but not the percentil.

Any hep?

Blogannath said...

Hi,

Thanks for your comment. I need a little more information to help you. I am not able to understand exactly what the problem you are facing is. Do you want to find the 10th percentile value of both service as well as value or do you want the 10th percentile value of value for each value of service? If you can give me a few lines of sample data and then the result you want to see out of that sample data, I might be able to help you better. Otherwise, I am just guessing what you want, and it may be completely different from what you really want.

Anonymous said...

Many thanks for your help.

Basically:

I have a table with next columns:

Service_Name, hour_of_execution, duration_seconds

For each service (I have 20) I have different samples per hour (from 9 to 13)

Here an esxtraction of the table:

Service_Name hour_of_execution duration_seconds
KC1 9 10
KC1 10 11
KC1 11 11
KC1 12 11
KC1 13 9
KC2 9 56
KC2 10 50
KC2 11 60
KC2 12 65
KC2 13 60
KC3 9 120
KC3 10 150
KC3 11 156
KC3 12 170
KC3 13 175

I am loking for the percentil 90 of each service

KC1 11
KC2 63
KC3 173

Blogannath said...
This comment has been removed by the author.
Blogannath said...

You have to find the 90th percentile for each service_name separately (use a WHERE clause in the subquery to limit results to each service_name individually), and then use a UNION to combine the results for multiple service_names together. Something like below:

select "KC1", 0.90*(SELECT MAX(myField)
FROM (SELECT TOP 10 PERCENT myField from myTable where service_name = "KC1" order by myField)) +
0.10*(SELECT MIN(myField)
FROM (SELECT TOP 90 PERCENT myField from myTable where service_name = "KC1" order by myField desc))
from myTable
UNION
select "KC2", 0.90*(SELECT MAX(myField)
FROM (SELECT TOP 10 PERCENT myField from myTable where service_name = "KC2" order by myField)) +
0.10*(SELECT MIN(myField)
FROM (SELECT TOP 90 PERCENT myField from myTable where service_name = "KC2" order by myField desc))
from myTable
...

This is obviously not a very elegant solution, so I will spend some time working on this to see if there is a better solution. In the meantime, I wanted to provide you with one solution, even if it is not the best, so that you can continue your work instead of waiting on me with nothing to do in the meantime.

Blogannath said...

OK, this was a lot more complicated than I thought it would be. But I cracked it finally. Try this:

SELECT T.Field1, (select max(field2) from table1 where table1.field2 in (select top 90 percent field2 from table1 where table1.field1 = T.field1 order by field2)) AS Expr1
FROM table1 AS T group by T.field1

Here field1 is the category and field2 contains the service times.

You can write the other half of the query and weight them accordingly if you so please. That would give you something like this:

SELECT T.Field1, 0.10*(select max(field2) from table1 where table1.field2 in (select top 90 percent field2 from table1 where table1.field1 = T.field1 order by field2)) + 0.90*(select min(field2) from table1 where table1.field2 in (select top 10 percent field2 from table1 where table1.field1 = T.field1 order by field2 desc)) AS Expr1
FROM table1 AS T group by T.field1;

The syntax below is what I thought of first:

SELECT T.Field1, (select max(field2) from (select top 90 percent field2 from table1 where table1.field1 = T.field1 order by field2)) AS Expr1
FROM table1 AS T group by T.field1

Unfortunately this does not work because the alias of table1 as T does not carry through all the way to the inner-most query, causing Access to throw up a parameter prompt for T.field1. Most of the delay in coming up with the solution was spent in trying to reformulate the query so that there are only two levels of queries rather than 3 levels! Rewriting the percentile value query without an explicit subquery, but instead using the IN clause solves the problem.

Hope it works for you too! Good luck!!

Anonymous said...

Hello! I was very excited to see your posts on how to get percentile values and CDF in Access. However, I am trying to computer these by group, as the poster does above. I was trying to understand the examples that you provided, and I am a little unclear on one thing: what is "T"? You have SELECT T.Field1 and AS T group by T.field1. I understand what field 1 is, but not T. It is probably pretty simple, but I appreciate your help! Thank you. - Kris Robbins

Blogannath said...

T is just an alias for a table name. So, I am doing a bunch of stuff based on selecting things from table1, which I am aliasing as T with the syntax "as T" right after the "from table1" clause of the SQL.

What I doing is using table1 twice in the query, once in the outer query and once in the inner query. I am doing a bunch of calculations in the inner query using a particular value of Field1 from the outer query. To carry this particular value of Field1 from the outer query to the inner query, you need an alias for the table (which is what T is) so that you can reference this value of Field1 unambiguously within the inner query.

Anonymous said...

Hi, I thought this was exactly what I was looking for to find the 90th percentile for a number of different services to enable benchmarking but I'm really struggling with
SELECT T.field1
in the outer query which is referencing the inner query and resulting in a parameter message box asking for the value in T.field1.

Were you intending this to be a query where you would enter each service_name seperately or have I missed something?

Many thanks

Blogannath said...

Did you read my previous comment explaining what T is? T is an alias for Table1 in the outer query as evidenced by the "as T" as part of the from clause of the outer query. If you have copied the query as is, and just changed the table and field names where appropriate, it should work without giving any parameter prompts.

If you are still having problems, post the entire query you are using, and I will try to take a look.

You might also want to take a look at another of my posts in the series entitled "Statistics with Grouping". The link is: http://blogannath.blogspot.com/2010/03/microsoft-access-tips-tricks-statistics.html

Anonymous said...

Hi, yes I did read the comment.

This is the query

SELECT Test.OrgCode, (select max(TimeInAEMins) from TimeInAESummary where TimeInAESummary.TimeInAEMins in (select top 90 percent TimeInAEMins from TimeInAESummary where TimeInAESummary.OrgCode = Test.OrgCode order by TimeInAEMins)) AS Expr1
FROM TimeInAESummary AS Test9 group by Test.OrgCode;

And it is asking for a parameter value for Test.OrgCode and then takes an absolute age to run

Blogannath said...

You have aliased the table TimeInAESummary as Test9, but are trying to refer to it as just Test in your query. That is why Access is asking for a parameter value (because it has no idea what Test is). If you change your alias to Test or change all references to Test to Test9, your problem should go away.

SELECT Test.OrgCode, (select max(TimeInAEMins) from TimeInAESummary where TimeInAESummary.TimeInAEMins in (select top 90 percent TimeInAEMins from TimeInAESummary where TimeInAESummary.OrgCode = Test.OrgCode order by TimeInAEMins)) AS Expr1
FROM TimeInAESummary AS Test group by Test.OrgCode;

Anonymous said...

Sorry, that must have been a mis-key while cutting and pasting. I have checked the original query and the superfluous 9 isn't in there.

It is also taking ages to run - well over 2 hours so far for a dataset with <7000 records so I don't know what the outcome of the query is, regardless of the parameter value

Blogannath said...

OK, if your query looks like what I posted in my comment, then it should work (and it does look like it is now working without giving you a parameter prompt). I have no idea why it is taking so long. It should not take anywhere near as long. I ran the following query on one of my databases with about 3000 rows in the table "Combined", and it ran through in about 5 to 10 seconds:

SELECT Test.Mode, (select max([Total Time]) from Combined where Combined.[Total Time] in (select top 90 percent [Total Time] from Combined where Combined.Mode = Test.Mode order by Combined.[Total Time])) AS Expr1
FROM Combined AS Test group by Test.Mode;

I would say check your code thoroughly, and make sure you have coded it up correctly. Also run the subquery separately by hardcoding a value for Test.OrgCode and make sure it is producing the right results.

Ultimately, coding in SQL is the same as coding in any programming language: testing and debugging takes twice as long as the original coding, and when you have a complicated query, it is best to break it down into individual pieces and unit-test each of the individual pieces.

Anonymous said...

Hi everyone,

I am trying to get the 5th and 95th percentile values from a table. My table contains 2 fields called 'Operating Costs' and 'Total Heated Floor Area (SF)'.

The percentile values that I am trying to obtain are not of either field, but a new calculated field which is 'Operating Costs'/'Total Heated Floor Area (SF)'.

While the formula that you have posted works if I try to get the percentile value of just one of the fields in my table, it seems that when using my calculated field, I keep getting a pop up that says Enter Parameter for both my 'Operating Costs' and my 'Total Heated Floor Area (SF)' field.

I've double checked if I wrote down my field names incorrectly but I don't think thats the case.

I have no idea why my query isn't obtaining the percentile value...Please help??

FYI: my calculated field is ([Operating Costs]/[Total Heated Floor Area (SF)])

Blogannath said...

Have you checked whether you have aliased the table names correctly and referred to the aliases correctly in all parts of your query. Please read the previous comments in this article for the various pitfalls in writing queries that involve multiple levels of subqueries. If you still can't figure it out, post your entire query here and I will be happy to take a look at it.

Anonymous said...

Thanks for the response,

I've checked my table names and I am sure I wrote them down correctly in my query... but i still can't seem to be getting the value that I want.

Heres the overview of my situation:

I have a table called Raw Data and it holds fields including 'operating costs' and 'property size'

The percentile value that I want is the 95th and 5th percentile value of 'operating costs/property size'

First step I did was to create a query called 'STEP 1:FACTOR VARIABLES' that calculates 'operating costs/property size'

Second step I made another query that would use my query 'STEP 1:FACTOR VARIABLES' and tried calculating my 5th percentile with the sql code

SELECT Max([Operating Costs/Property Size]) AS [5th percentile Operating Costs/Property]FROM (SELECT TOP 95 PERCENT [Operating Costs/Property Size] FROM [STEP 1: FACTOR VARIABLES])ORDER BY Max([Operating Costs/Property Size]);

I thought that this would get the value that I need but all it does is grab the largest 95% of values and then get the max of it, which is basically just the max value itself....

Blogannath said...

The ORDER BY clause of your query seems to be off: it should be ORDER BY [Operating Costs/Property Size], without the MAX.

deg said...

Hello, I have a related issue. I am trying to develop an elegant list of percentile values (10th, 50th and 90th) for a table, but it has on additional level of grouping than the scenario posed by anonymous with the KC service areas.
I have 'name' and a 'time period' and am trying to identify the percentile values grouped by name and then grouned by time period. Could you advise?

Blogannath said...

You would have to include both levels of grouping in the outer query and include both fields in the where clause of the inner query. Perhaps something like this will work:

SELECT T.Field1, T.Field2, (select max(field3) from table1 where table1.field3 in (select top 90 percent field3 from table1 where table1.field1 = T.field1 and table1.field2 = T.field2 order by field3)) AS Expr1
FROM table1 AS T group by T.field1, T.field2

Name and time period would be your field1 and field2, and whatever you are trying to find the percentile of would be field3.

Visitors Country Map

Free counters!

Content From TheFreeDictionary.com

In the News

Article of the Day

This Day in History

Today's Birthday

Quote of the Day

Word of the Day

Match Up
Match each word in the left column with its synonym on the right. When finished, click Answer to see the results. Good luck!

 

Hangman

Spelling Bee
difficulty level:
score: -
please wait...
 
spell the word:

Search The Web