Search The Web

Today's Headlines

Monday, December 21, 2009

Microsoft Access Tips & Tricks: Histograms

In this lesson, we will deal with the creation of histograms from a large quantity of data. If you are interested, you can find my earlier lessons on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, and calculating running differences.

The final result of a query that produces a histogram would be a set of ranges of values and the number of observations of a given field that fall in each range. If your data consists entirely of whole numbers, and you don't need histogram buckets of different sizes, Microsoft Access actually provides a built in function that can be very useful to create histograms very easily.

That function is the partition() function. You can read about partition() in the Access online help files. As you can see, you just pass it a number and tell the partition function where your buckets start, where they end and how large each bucket is to be. Access takes care of the rest, and slots your number in the appropriate bucket.

To use the function in a query that produces the required histogram, you can write SQL like the below:

select partition(myField, Start, Stop, Interval) as Bucket
count(myField) as Frequency from myTable
group by partition(myField, Start, Stop, Interval)
order by 1
Here, we are counting the number of elements that fall into each bucket by using the group by clause on the buckets, and the aggregate function count(), on the field we want to bucket. We also print out the bucket boundaries for reference, and order the entire query by the bucket boundaries so that we have an output that looks like a normal histogram (though this will be one where the buckets are arranged up and down rather than across).

Some things to keep in mind here:
  • You have to provide appropriate Start, Stop and Interval values for use with the partition function
  • If your data falls below the Start Value, as mentioned in the help files, partition will create a bucket called " : Start - 1" and put the count of such data in that bucket
  • Similarly, if your data falls above the Stop Value, partition will create a bucket called "Stop + 1: " and put the count of such data in that bucket
  • Make sure the arguments to the two partition() functions are exactly the same. Otherwise, the results will obviously be wrong.
Some other things that will become apparent after you start using the SQL above:
  • The result will only include only buckets into which at least one value of myField falls. Thus, the list of buckets will include gaps unless the data is well-distributed across all the buckets
  • The bucket values will always start from the Start value provided and end at the Stop value provided (apart from the two catch-all buckets that catch data values above and below the range provided), and this may result in the last bucket being smaller than the other buckets
  • If your data contains any non-integer values, they will be truncated, and the resulting integer value will be bucketed
  • Each bucket will include all data values that are greater than or equal to the lower boundary value of the bucket, and that are less than or equal to the upper boundary value of the bucket.
These are some of the reasons why the partition() function may not be appropriate for your data. Your data may consist of non-integer data that you do not want truncated to fit in buckets. You may want buckets that are not equal in size across the entire range of values. If you have needs that can not be satisfied by use of the partition() function, it is time to devise your own solution. You may also need something other than the partition() function if you use a version of Access like Access 97, which does not have a partition() function built-in!

To create histograms under such circumstances, you have two choices. One is quick and easy to implement, but gets cumbersome after a while. The other solution involves creating an extra table in your database, but over the long run it is a lot more flexible and maintainable. We will explore both these solutions below.

The first solution is implemented using SQL as below:
select sum(iif(myField < B1max, 1,0)) as B1,
sum(iif(myField < B2max and myField >= B1max, 1,0)) as B2,
sum(iif(myField < B3max and myField >= B2max, 1,0)) as B3
from myTable
Here, we are using iif() to generate a 1 or 0 for each value of myField depending on whether it falls into a particular bucket or not. You can read about the iif() function and how to use it in the Access online help files here. Iif() evaluates its first argument and converts it into a boolean value. If the boolean value is TRUE, it returns the second argument, otherwise, it returns the third argument.

In this case, we return a 1 whenever a value in myField falls within a bucket. We return a zero for that value of myField from all other buckets. These 1's and 0's are then summed up using the aggregate function sum() to give us the bucket populations.

The bucket boundaries are defined by B1max, B2max, etc. and each value of myField falls into exactly one bucket if the bucket boundaries are set up correctly (and the SQL is coded correctly, obviously). You have to make sure that there are no gaps between the buckets and that bucket boundaries increase monotonically.

The simple rule being used in the SQL above is that if the value of myField is less than the upper limit of a bucket, and is greater than or equal to the upper limit of the bucket below, then it falls into this bucket. As you can see, it is very easy to change the conditions (make sure you change them in such a way that each value of myField can fall unambiguously into one and only one bucket) and also alter the bucket boundaries (and with the boundaries, the sizes change too). You can also add any number of buckets by expanding the SQL above and adding more lines to it (even though only 3 buckets are being created in the example above). Thus, the SQL above is quite flexible.

The following points have to be noted about the results of the SQL above:
  • The query produces one row of results that just contains the population of each bucket. That way, it is similar to a traditional histogram that goes from left to right. However, the query results do not reflect the bucket boundaries (unless you name the calculated fields appropriately instead of just as B1, B2, etc.), so make sure you remember them and document them somewhere
  • All the buckets are represented in the output, even if the population of a bucket is zero
  • If there are gaps in the bucket definitions and data falls into those gaps, the population of such data is not reflected in the query results
  • If there are overlaps between the bucket definitions and data falls into these overlapping ranges, they will be counted in all buckets that include that overlapping range
  • All data values in myField that are below the upper limit of the first bucket are counted in that bucket. As such, the range of the first bucket is negative infinity to B1max. You can set up a lower limit for the first bucket also by changing the SQL above, but you have to note that all data that falls below such a lower limit will not be counted in any bucket
  • All data values in myField that are above the upper limit of the last bucket are not counted in any bucket. You can create a catch-all final bucket with a very large upper limit to catch such very high values
  • You can leave the bucket boundaries as parameters instead of filling values while creating the query. This will create a parameter query which will accept user input of the bucket boundaries at run-time to create histograms that can be changed from run to run. However, if you have named the calculated fields after the bucket boundaries (instead of B1, B2, etc.), they will not change as the boundaries change, so be careful!
  • However, the number of buckets can not be changed without editing the actual SQL query. The user can provide bucket boundaries at run-time without knowing anything about SQL, but they can not reduce or increase the number of buckets without changing the query itself
The second way of creating histograms using SQL is to create a table of bucket boundaries, and then use a join between that table and your actual data table to create the histogram. The SQL for such an implementation is given below:

Select Bucket, Minimum, Maximum, count(*) as NumberCount from
(Select Buckets.Bucket, Buckets.Minimum, Buckets.Maximum, myTable.myField from
myTable inner join Buckets on
(myTable.myField >= Buckets.Minimum and myTable.myField < Buckets.Maximum))
group by Bucket, Minimum, Maximum
In this case, bucket boundaries are stored in a table named "Buckets". It consists of a field named Bucket (you can use this to store a bucket number or a description of the bucket depending on whether you create it as a numerical or text field), and two numerical fields named Minimum and Maximum to store the bucket boundaries.

The SQL above joins each value of myField from myTable with the appropriate bucket in Buckets by checking whether the value of myField is greater than or equal to the Minimum specified for that bucket and less than the Maximum specified for that bucket. It then groups all the results by Bucket, Minimum and Maximum so that each bucket in Buckets gets its own row. The aggregate function count(*) is used to produce the population of each bucket.

Note the following details about the SQL as implemented above:
  • This produces one row of results for each bucket that includes the bucket number or name (the Bucket field in Buckets), the minimum and maximum for that bucket, and the population of that bucket. So, this is a histogram that goes up and down rather than left to right
  • A value falls into a bucket only if it is greater than or equal to the minimum and is strictly less than the maximum. If you need a different rule to slot your values, you can modify the SQL above as appropriate
  • If the buckets have gaps between them, and data values fall into those gaps, they will not be counted in any of the buckets
  • Conversely, if buckets have overlapping ranges and data values fall into these overlapping ranges, they will be counted in all the buckets that have the overlap
  • Any value that falls below the lowest value of Minimum or highest value of Maximum will not be counted in any of the buckets. To avoid this you can create catch-all buckets at the top and bottom of the Buckets table
  • Only buckets that have any values in them will be reported in the query results. This is because of the inner join being used in the query. If you want the other buckets to be part of the output even though they have a population of zero, change the join type to a right join
I consider this last implementation the most flexible because the actual query does not have to be edited as the number of buckets or their boundaries change. If the user is provided a form through which they can edit the Buckets table, then this query can be used to produce customized histograms with any number of buckets the user desires. And the buckets can have any boundaries the user desires.

Hope this post has been helpful in solving any problems you might have had with constructing histograms out of your data in Access. If you have any problems or concerns with the queries in this lesson, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future lessons, please feel free to let me know through your comments too. Good luck!

1 comment:

Anonymous said...

This was a tremendous support. I found many descriptions using both, the partition function and fixed interval queries. None of both was sufficient. With the SQL code provided I found the solution for a problem I have been struggling with for days...

Thanks a lot,
Karsten

Visitors Country Map

Free counters!

Content From TheFreeDictionary.com

In the News

Article of the Day

This Day in History

Today's Birthday

Quote of the Day

Word of the Day

Match Up
Match each word in the left column with its synonym on the right. When finished, click Answer to see the results. Good luck!

 

Hangman

Spelling Bee
difficulty level:
score: -
please wait...
 
spell the word:

Search The Web