Microsoft Access (or any other relational database product that supports SQL) can be used to produce a table of distinct values in a given column along with the cumulative probabilities of each of these values using SQL.

If you are interested, you can find my earlier lessons on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, and calculating probability masses out of given data.

Note that cumulative distributions are always monotone non-decreasing and right-continuous. Moreover, they satisfy the following limits:

In this post we will assume that we have a table myTable that contains a field myField with several different values. The cumulative distribution of any given value in myField is the number of times a value less than or equal to that given value occurs in myField divided by the total number of rows in myTable. Note that for the maximum value of myField in myTable, the cumulative distribution will be 1 since the number of times a value less than or equal to that maximum value occurs in myField is the same as the number of rows in myTable.

This is a pretty basic definition, and as such, it is a good place to begin. Later we will see how to deal with fields that have NULL values that one may want to exclude from the calculations and/or output.

Given the definition above, it is easy to see that the SQL below will produce a table of cumulative distributions:

select T.myField, (select count(*) from myTable where myTable.myField <= T.myField)/There are two subqueries in the query above. The first subquery provides the numerator for the calculation of the cumulative distribution by finding the number of rows in myTable with the value of myField less than or equal to the value of myField in the outer query.

(select count(*) from myTable) from myTable as T group by myField order by myField

The second sub-query in the above SQL provides us the denominator of the cumulative distribution by providing us the count of the total number of rows in myTable. The group by clause produces one row of output for each distinct value of myField, and the order by clause arranges these rows of output in ascending order of value of myField.

The order by clause can be modified to arrange the rows in the output by any other measure, but in general, it makes more sense to look at a cumulative distribution with the values of the random variable in ascending order.

What happens when there are NULL values in myField? The output gets pretty interesting and can become quite hard to interpret. You will see that your output will consist of one row for the NULL values along with a zero cumulative density for it! Moreover, you will see that the maximum value of myField in myTable does not have a cumulative density of 1!! Mystifying indeed . . .

Actually, there is a simple explanation for the problem. You see, NULL values in Access are very special values. When you compare a NULL value with any other numerical value by using standard mathematical comparison operators like "=", "<", "<=", ">", ">=", "<>", etc., the result is ALWAYS FALSE. A NULL value is not even equal to another NULL value.

This is the reason the cumulative distribution of the NULL values comes out as zero. Essentially, the first SQL subquery that calculates the numerator of the cumulative distribution gets a count of zero for values of myField that are equal to the NULL value of myField from the outer query. This results in the zero cumulative distribution for the NULL value of myField.

Every other value of myField has a problem with its numerator also because the NULL values in myField are never counted as being less than or equal to any given value of myField from the outer query. That results in the cumulative distributions not coming out to 1 for the maximum value of myField.

There are two ways to deal with this problem. One would be to remove NULL values from consideration entirely. Thus, we would neither calculate a cumulative distribution for NULL values of myField nor would we include NULL values of myField in either the numerator or denominator of the cumulative distribution. The SQL below shows how to do this:

select T.myField, (select count(myField) from myTable where myTable.myField <= T.myField)/The query above uses a where clause to filter out NULL values in myField. Now, in the output, you will notice that there is no row with a NULL value of myField. Moreover, we have excluded the NULL values from the subqueries by using count(myField) instead of count(*). Thus, you will see that the cumulative distribution for the maximum value of myField is 1.

(select count(myField) from myTable) from myTable as T where myField is not NULL

group by myField order by myField

The second way to deal with this problem is to assume that the NULL values are the minimum value myField can take (sort of like negative infinity), and use this assumption to include those values in the cumulative distribution results. This means that the NULL values of myField are included in the calculation of the numerator for all values of myField (because they are less than or equal to all other values of myField). The SQL below illustrates how to implement this second method by including all NULL values of myField in the first subquery:

select T.myField, (select count(*) from myTable where myTable.myField <= T.myFieldHope this post has been helpful in solving any problems you might have had with calculating cumulative distributions out of your data in Access. If you have any problems or concerns with the queries in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

or myTable.myField is NULL)/

(select count(*) from myTable) from myTable as T group by myField order by myField

## 1 comment:

Post a Comment