Note that the term percentile is subject to a lot of confusion. In this lesson, I deal with percentile scores, which are the percentiles associated with each value of a data item. There is also the concept of percentile value, which I am going to define as the value that has a certain percentile score. Thus, for instance, the median of a dataset (that is the 50th percentile value) has a percentile score of 50. In this lesson, I try to be consistent by using percentile score everywhere instead of the vaguer "percentile", but if you do find a standalone "percentile" somewhere down below, it is to be interpreted as a percentile score, not a percentile value.

Microsoft Access (or any other relational database product that supports SQL) can be used to produce a table of distinct values in a given column along with the percentile scores of each of these values using SQL.

If you are interested, you can find my earlier lessons on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, calculating probability masses out of given data, and calculating cumulative distributions out of given data.

In some schools of thought, percentile scores are simply cumulative distributions multiplied by 100. That is to say, the percentile score is the simply the probability of a random value from the distribution being less than or equal to the given value expressed as a percentage. However, percentile scores have some other definitions also, hence my decision to devote a whole post to it, rather than mentioning it as a post-script in the CDF post. As such, this first definition makes the percentile scores equivalent to cumulative percentages.

In this post we will assume that we have a table myTable that contains a field myField with several different values.

We will start with a pretty basic definition of percentile scores as above. By this definition, the percentile score of the maximum value of myField will be exactly 100 and the percentile score of the minimum value of myField will be slightly above zero (depending on how many rows in myTable have values of myField equal to the minimum and how many total rows myTable has). Later we will see how to deal with a different definition, and also with fields that have NULL values that one may want to exclude from the calculations and/or output.

Given the definition above, it is easy to see that the SQL below will produce a table of percentile scores:

select T.myField, (select count(*) from myTable where myTable.myField <= T.myField)*100.0/There are two subqueries in the query above. The first subquery provides the numerator for the calculation of the percentile score by finding the number of rows in myTable with the value of myField less than or equal to the value of myField in the outer query and multiplying this by 100.0.

(select count(*) from myTable) from myTable as T group by myField order by myField

The second sub-query in the above SQL provides us the denominator of the percentile score by providing us the count of the total number of rows in myTable. The group by clause produces one row of output for each distinct value of myField, and the order by clause arranges these rows of output in ascending order of value of myField.

The order by clause can be modified to arrange the rows in the output by any other measure, and a table of percentile scores can be arranged in any order though it may make more intuitive sense for it to be arranged in increasing order of the value of myField.

You will notice that the SQL above, except for the multiplication by 100.0, is identical to the SQL used to calculate cumulative distributions. That is because the definition of percentile scores we took up first was quite identical to the definition of cumulative distribution.

In the second school of thought, the percentile score of a value is defined as 1 - the probability of a random value being greater than or equal to the given value of myField, multiplied by 100. By this definition, the percentile score of the minimum value of myField in myTable will be exactly 0, and the percentile score of the maximum value of myField will be slightly below 100 (depending on the total number of values in myTable and the number of rows in which myField has the maximum value).

The SQL below can be used to get percentile scores that are derived using the definition above:

select T.myField, ((select count(*) from myTable) -There are now three subqueries in this query. The first and third subqueries are identical and simply give us the total number of rows in myTable. The second subquery calculates the number of rows in myTable in which the value of myField is greater than or equal to the value of myField in the outer query. We subtract this value from the total number of rows in myTable, and then multiply the result by 100.0 to get the numerator for our percentile score.

(select count(*) from myTable where myTable.myField >= T.myField))*100.0/

(select count(*) from myTable) from myTable as T group by myField order by myField

What happens when there are NULL values in myField? The problems are similar to those we encountered during the calculation of cumulative distributions, but in different ways. When you use the first definition of percentile scores, the NULL values all get zero percentile scores and the maximum value does not get a percentile score of 100. When you use the second definition of percentile scores, the NULL value of myField gets a percentile score of 100. This is obviously because no value of myField (not even the NULL values) are greater than or equal to the NULL value of myField in the outer query.

And just as in the case of cumulative distributions, there are two ways to deal with this problem. One would be to remove NULL values from consideration entirely. Thus, we would neither calculate a percentile score for NULL values of myField nor would we include NULL values of myField in either the numerator or denominator of the percentile score calculations.

The SQL below shows how to do this for the first definition of percentiles:

select T.myField, (select count(myField) from myTable where myTable.myField <= T.myField)*100.0/The SQL below shows how to do this for the second definition of percentiles:

(select count(myField) from myTable) from myTable as T where myField is not NULL

group by myField order by myField

select T.myField, ((select count(myField) from myTable) -Please refer to the post on cumulative distributions for the explanation of why the SQL statements above work.

(select count(myField) from myTable where myTable.myField >= T.myField))*100.0/

(select count(myField) from myTable) from myTable as T where myField is not NULL

group by myField order by myField

The second way to deal with this problem is to assume that the NULL values are the minimum value myField can take (sort of like negative infinity), and use this assumption to include those values in the percentile score results.

This means that the NULL values of myField are included in the calculation of the numerator for all values of myField under the first definition (because they are less than or equal to all other values of myField). The SQL below illustrates how to implement this second method by including all NULL values of myField in the first subquery:

select T.myField, (select count(*) from myTable where myTable.myField <= T.myFieldFor the second definition of percentile scores, it is obvious that the NULL values of myField should get a percentile score of zero (all values of myField are greater than or equal to the NULL value). Moreover, the denominator of the percentile calculation must include all the NULL values while the numerator should never include NULL values (the NULL values are not greater than or equal to any other value of myField). This second condition is already satisfied since count(*) will include all the NULL values in the denominator, and will include none of them in the numerator (none of them are greater than or equal to any value of myField).

or myTable.myField is NULL)*100.0/

(select count(*) from myTable) from myTable as T group by myField order by myField

The first condition is a little harder to enforce directly. You can use nz() to convert the NULL values to a very low value in the query and enforce the first condition that way. But it will be easier to consider the first condition to be a special characteristic of the definition, and enforce it using iif(). The SQL below uses iif() to implement the above rules to give us the percentile scores according to this second definition of percentile scores. Refer to the Access online help files on how to use iif() and to understand how it has been used in this context.

select T.myField,The iif() above assigns a value of 0 to the percentile when the value of myField is NULL. It calculates the value of the percentile using the second definition for all other values of myField.

iif(T.myField is NULL, 0, ((select count(myField) from myTable) -

(select count(myField) from myTable where myTable.myField >= T.myField))*100.0/

(select count(myField) from myTable)) from myTable as T where myField is not NULL

group by myField order by myField

Hope this post has been helpful in solving any problems you might have had with calculating percentile scores out of your data in Access. If you have any problems or concerns with the queries in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

## No comments:

Post a Comment