On the other hand, you may be given a percentile value (which can only fall between 0 and 100), and asked to find what instance from a large sample has that percentile score. This instance can have any value between the minimum and maximum of the given sample. The median is an example of a percentile value. Finding the median of a large sample involves finding that value in the sample that has a percentile score of 50. In this case 50 is the percentile score. The actual value of the median is the percentile value corresponding to the percentile score of 50.
If you are interested, you can find my earlier posts on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, calculating probability masses out of given data, calculating cumulative distributions out of given data, and finding percentile scores.
In this lesson, we will be dealing with finding instances out of a sample, given the percentile value we want to find. In that sense, this post is going to be very similar to the post on finding medians. It also means that the answers in this post are all supposed to be single values, not tables of values. You are given a table, myTable, with a field, myField, that contains several instances. You are then asked to find the instance with a percentile value of x, where x is between 0 and 100.
We have already seen how to find the instance with a percentile value of 50. That is what finding the median was all about. We can easily see that it is possible to modify the SQL for finding the median to find other percentile values also. But even before we go there, consider the following basics:
The 0th percentile value of a dataset is always the minimum value in the dataset. Thus, the 0th percentile value can be found simply with the SQL below:
SELECT MIN(myField)FROM myTableSimilarly, the 100th percentile value of a dataset is always the maximum value in the dataset. Thus, the 100th percentile value can be found simply with the SQL below:
SELECT MAX(myField)FROM myTableTo find other percentile values, we can modify the SQL for median. For instance, to find the 10th percentile value, we can use the SQL below:
Just as in the case of medians, different schools of thought define percentile values in different ways. So, in some other schools of thought, the SQL below will give us the 10th percentile value:
FROM (SELECT TOP 10 PERCENT myField from myTable order by myField)
Note that the TOP modifier is used with 100 - the percentile value we are interested in calculating, when we use this version of the SQL.
FROM (SELECT TOP 90 PERCENT myField from myTable order by myField desc)
When the true 10th percentile value is an actual instance in the dataset, then both these queries will in fact give you that exact instance. However, if the true 10th percentile value falls between two values in the dataset, the two queries above will give you the two values which have to be combined in a particular ratio to get the true 10th percentile value.
In the case of medians, when the two queries for median give different answers, the true median is easily found by combining the two values in 50-50 ratio. In other words, we took 50% of one value and combined it with 50% of the other value. This is the same as taking the average of the two values.
In general, if the xth percentile value is calculated as two different numbers by the two queries above, then the true xth percentile value is (100 - x)% of the value from the first query combined with x% of the value from the second query. We were able to do this combination using the SQL aggregate function AVG() during the calculation of medians. No such SQL aggregate function exists to combine arbitrary percentages of two different values. So, we have to code it up ourselves.
One way to do this would be to code up the two queries above as subqueries of a larger query that then combines the results from these queries in the appropriate proportions to produce the final results. Such a query for the 10th percentile value might look like this:
Note the symmetry of multiplying 0.90 by the SQL that uses TOP 10 percent and multiplying 0.10 by the SQL that uses TOP 90 percent. That will make the form of the SQL easy to remember.
select 0.90*(SELECT MAX(myField)
FROM (SELECT TOP 10 PERCENT myField from myTable order by myField)) +
FROM (SELECT TOP 90 PERCENT myField from myTable order by myField desc))
Just as in the case of medians, one can write VBA code to calculate percentile values. The set of parameters passed to such a function can be modified to include the percentile value we are interested in finding. I will leave the actual writing of such a function to my readers though!
Hope this post has been helpful in solving any problems you might have had with calculating percentile values out of your data in Access. If you have any problems or concerns with the queries in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!