Microsoft Access (or any other relational database product that supports SQL) can be used to produce a table of distinct values in a given column along with the probability density of each of these values using SQL.

If you are interested, you can find my earlier lessons on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, and creating histograms out of given data.

Technically speaking, since all data in a database consists of discrete values, the measures we are calculating in this post should correctly be referred to as Probability Masses. A Probability Mass Function (pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value. If you actually consider the data in your database to be samples from a truly continuous distribution and want a better measure of the probability of small subranges of it, you may be interested in creating a histogram out of the data for further analysis.

In this post we will assume that we have a table myTable that contains a field myField with several different values. The probability mass of any given value in myField is the number of times that given value occurs in myField divided by the total number of rows in myTable. The sum of all probability masses will always equal one.

This is a pretty basic definition, and as such, it is a good place to begin. Later we will see how to deal with fields that have NULL values that one may want to exclude from the calculations and/or output.

Given the definition above, it is easy to see that the SQL below will produce a table of probability masses:

select myField, count(*)/(select count(*) from myTable) from myTableThe sub-query in the above SQL provides us the denominator of the probability mass by providing us the count of the total number of rows in myTable. The count(*) in the outer query counts the number of occurrences of a given value of myField in the column. The group by clause produces one row of output for each distinct value of myField, and the order by clause arranges these rows of output in ascending order of value of myField.

group by myField order by myField

The order by clause can be modified to arrange the rows in the output by any other measure, such as the probability mass of each value of myField also.

What happens when there are NULL values in myField? You will see that your output will consist of one row for the NULL values along with a probability mass for it (it will be the top row if your order by clause is the same as the one used in the SQL provided here). If you do not want that row in your output, you have to explicitly exclude that row from the output in your SQL as below:

select myField, count(*)/(select count(*) from myTable) from myTableThe query above uses a where clause to filter out NULL values in myField. Now, in the output, you will notice that there is no row with a NULL value of myField. However, the probability masses of the values that are included in the output may not add up to one. The fraction that is missing (if any) is the fraction of NULL values that we have excluded from the output.

where myField is not NULL

group by myField order by myField

In fact, one can modify the where clause of the above SQL to exclude not just NULL's, but any other values or ranges we want from the output (or, we can explicitly list the values or ranges we want to include in the output too). Moreover, the where clause can be based on any columns in the table, not just myField. However, since the sub-query still includes these values in counting up the rows in the table, our denominator never changes, thus resulting in the probability masses not adding up to one.

To get the probability masses to add up to one, we need to adjust the denominator of our calculations to only include rows that are included in the outer query. This may be valid in some contexts (for instance, when we eliminate NULL values in myField), but may not be valid in some other contexts. So, one has to use caution and statistical judgment in determining whether the probability masses that are being calculated should actually add up to one or not.

To adjust the denominator so that it also includes only those rows that are considered by our outer query, there are two options. The first and easiest option is to include the outer query's where clause verbatim in the subquery. Thus our SQL could look something like this:

select myField, count(*)/(select count(*) from myTable where C) from myTableThe condition in the outer query is denoted here by C, and it is used again in the subquery so that the number of rows in the outer query (before the grouping occurs) is the same as the number of rows in the subquery. This will make the probability masses add up to one.

where C

group by myField order by myField

If you are only eliminating NULL values from myField in the calculations, there is a much more elegant way of removing the rows where myField is NULL from the count in the subquery:

select myField, count(*)/(select count(myField) from myTable) from myTableThe subquery in this case takes advantage of the fact that count() with the name of a field provided only produces the count of rows with non-NULL values in that field. Thus the use of count(myField) automatically eliminates the rows in which myField is NULL, which is the same restriction as that imposed by the where clause of the outer query!

where myField is not NULL

group by myField order by myField

Hope this post has been helpful in solving any problems you might have had with calculating probability masses out of your data in Access. If you have any problems or concerns with the queries in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

## No comments:

Post a Comment