Search The Web

Today's Headlines

Friday, March 9, 2012

Microsoft Access Tips & Tricks: Identify Runs Of Data

In this post, I am going to develop an SQL query that will help you identify runs of data in your data tables. What do I mean by a "run of data"? Suppose you have a database in which you enter details of your child's little league soccer team. And each game has a game number, and an indication of whether your child's team won, tied or lost. A "run" in this case can be a set of consecutive wins, losses or ties.

Your data may contain multiple runs, and this SQL query will help you identify all those runs and any other details about those runs that you want. Then you can order them by their length, for instance, to identify the longest and shortest runs. Have you ever imagined how play by play commentators on TV are able to reel out statistics such as "this will be the 8th consecutive game in which player X has accomplished such and such against so and so, and his longest run of such accomplishments is 24 games", and so on? Identifying runs of data will give you the ability to pry such insights out of your data too!

If you are interested, you can find my earlier posts on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, calculating probability masses out of given data, calculating cumulative distributions out of given data, finding percentile scores, percentile values, calculating distinct counts, full outer joins, parameter queries, crosstab queries, working with system objects, listing table fields,finding unmatched rows, calculating statistics with grouping, job-candidate matching, job-candidate matching with skill levels, great circle distances, great circle initial headings, using Excel functions in Access, using the windows file-picker, using the Access date-picker, setting tab indexes quickly and correctly, pre-filling forms based on previous entries, highlighting form controls, conditional formatting, performing date manipulations, counting letters, words, sentences and paragraphs, calculating age, propercasing text entries, flattening a table (using SQL), flattening a table (using VBA), cascading comboboxes, parsing file names, and opening files from within Access.

For the purposes of this post, I will assume that you have a table of data called MyTable, that contains weather data for your city. In particular, you have the observation date, the observed high temperature for that date, and the average high temperature for that date. Your task is to identify runs of dates that had temperatures higher than average.

To identify runs of data in VBA is relatively simple. You open a recordset with the data from your table, order it any way you want, then walk down the recordset figuring out whether each record is the beginning or end of a run that satisfies the given criteria. Then calculate the lengths of the runs using the identified beginning and end points of the runs. Doing it in SQL is more challenging, but also much more convenient when you want to do it quickly. Mucking around with VBA every time you want to change the criteria that define a run can get tedious and error-prone.

The SQL solution in this post is going to have three distinct parts: In the first part, we are going to identify the starting point of every run of data that satisfies the conditions we are looking for. In the second part, we are going to identify the ending point of every run of data. In the third part, we are going to combine the starting and ending points to define each run, so that you can identify interesting statistics such as the length of each run. Or you can order the runs to identify the longest runs, shortest runs, etc. Or you can group the statistics to find the average length of a run, median length of a run, etc.

Identifying the starting points of all runs of data: In this part, we are going to find out every row that starts a run of data that satisfies the condition we are looking for. In particular, referring back to our task for this post, we are going to identify every date on which the observed temperature is higher than the average temperature, and the observed temperature on the previous date was not higher than the average temperature.

The SQL to do this is conceptually quite simple. We self-join the table with itself and join each row with the previous row in the table. We can then use a where condition to find out which rows satisfy the criteria we are looking for while at the same time the previous row with which it is joined does not satisfy the criteria. The SQL for this is shown below:
SELECT T1.[WeatherDate] as RunStart
FROM MyTable T1 inner join MyTable T2
on (T1.[WeatherDate] = T2.[WeatherDate] +1
where
T1.[ObsTemp] > T1.[AvgTemp] AND NOT(T2.[ObsTemp] > T2.[AvgTemp])
I have used NOT for identifying dates on which the condition is not satisfied because negating an entire condition using NOT is less error-prone than using other methods, especially when the condition is complicated, and contains a lot of AND's and/or OR's. In this case, that was not necessary, but I did it to be consistent so that I will do it this way when the condition is complicated without having to remember to do it one way now and a different way then.

Now, there is a problem with the SQL above. And that problem becomes apparent when there are gaps in the data. Suppose your table had data for the 31st of March, but not the 30th of March (for some reason that is irrelevant to this post). If the condition is satisfied on the 31st of March, that row is not identified as a RunStart because that row is not joined with any other row in the table.

The same problem happens even if there is no gap in the table. It also happens when the very first row of the table satisfies the condition. That first row is not joined with any other row in the table (because there does not exist a row with the previous date), so it is not identified as the start of a run. To overcome these problems, we modify the above SQL as below:
SELECT T1.[WeatherDate] as RunStart
FROM MyTable T1 left join MyTable T2
on T1.[WeatherDate] = T2.[WeatherDate] + 1
where
T1.[ObsTemp] > T1.[AvgTemp] AND
(NOT(T2.[ObsTemp] > T2.[AvgTemp]) OR T2.[ObsTemp] is NULL OR T2.[AvgTemp] is NULL)
Now, the left join makes sure that all dates without a previous date in the table are still present in the output. The previous date data in those rows is NULL, and we have decided that NULL data cannot satisfy the condition we are looking for. So, we identify all such dates as starts of runs as long as those dates satisfy the conditions we are looking for.

Identifying the ending points of all runs of data: In this part, we are going to find out every row that ends a run of data that satisfies the condition we are looking for. In particular, referring back to our task for this post, we are going to identify every date on which the observed temperature is higher than the average temperature, and the observed temperature on the next date was not higher than the average temperature.

Given the SQL for identifying the starting points of all the runs, it must be easy to modify that to identify the ends of all the runs. The modified SQL is below:
SELECT T1.[WeatherDate] as RunEnd
FROM MyTable T1 left join MyTable T2
on T1.[WeatherDate] = T2.[WeatherDate] - 1
where
T1.[ObsTemp] > T1.[AvgTemp] AND
(NOT(T2.[ObsTemp] > T2.[AvgTemp]) OR T2.[ObsTemp] is NULL OR T2.[AvgTemp] is NULL)
Note that the join condition now says "T1.[WeatherDate] = T2.[WeatherDate] - 1". So, we are joining each row with the next row in the table, rather than the previous row. And the left join ensures that we identify ends of runs even when the next date for a particular date is not present in the table, or the last row of data in the table satisfies the condition.

Define all runs of data: Now we have come to the crucial third part of our task. We are going to use the queries above to uniquely identify each run of data and produce statistics about the run that include the starting date, the ending date and the length in days of each run. We are going to use some trickery to do this, though!

What we are going to do is use the SQL statements above as temporary tables. We are going to join the two temporary tables using the condition that the end of a run must always be greater than or equal to the start of a run (note that there can be single observation runs of data in which the start of the run is also the end of the run. That is why it is important to emphasize that the end of a run is greater than OR EQUAL TO the start of a run, not necessarily strictly greater than).

Obviously, this is going to result in every run start date being joined with every run end date that is greater than or equal to itself. To identify the true end date for a given start date, we take the minimum of all the run end dates that are joined with that run start date. So, we group by run start dates and pick the minimum of the run end dates. This grouping is the trickery that I refer to in the previous paragraph! Obviously, the minimum of the run end dates that is greater than or equal to a given run start date is the true end date for that run. Once you are convinced of that, the SQL below should not present you any problems.
select RunStart as Start, MIN(RunEnd) as End, MIN(RunEnd) - RunStart + 1 as Length from
(SELECT T1.[WeatherDate] as RunStart
FROM MyTable T1 left join MyTable T2
on T1.[WeatherDate] = T2.[WeatherDate] + 1
where
T1.[ObsTemp] > T1.[AvgTemp] AND
(NOT(T2.[ObsTemp] > T2.[AvgTemp]) OR T2.[ObsTemp] is NULL OR T2.[AvgTemp] is NULL)) as SQL1
inner join
(SELECT T1.[WeatherDate] as RunEnd
FROM MyTable T1 left join MyTable T2
on T1.[WeatherDate] = T2.[WeatherDate] - 1
where
T1.[ObsTemp] > T1.[AvgTemp] AND
(NOT(T2.[ObsTemp] > T2.[AvgTemp]) OR T2.[ObsTemp] is NULL OR T2.[AvgTemp] is NULL)) as SQL2
on SQL1.RunStart <= SQL2.RunEnd
group by SQL1.RunStart
You can now order the results from this query by any fields you want. In particular, ordering by the start dates in ascending order will give you the earliest runs first, while ordering by the start dates in descending order will give you the latest runs first. Ordering by the length in ascending or descending order will enable you to identify the shortest or longest runs of data in the table respectively.

You can add other WHERE conditions in both queries to further restrict the runs that are identified. For instance, you can add conditions to only identify runs of above-average temperatures in winter, or in a particular year, or only when the minimum temperatures are below the averages, or whatever you want. If you have the data for it, you can identify the runs in it. Now you can be a celebrity weatherman at your parties, reeling off weather statistics that others can only wonder where you are getting from!

Hope this post has been helpful in solving any problems you might have had with identifying runs of data in Microsoft Access. If you have any problems or concerns with the SQL in this post, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future posts , please feel free to let me know through your comments too. Good luck!

No comments:

Visitors Country Map

Free counters!

Content From TheFreeDictionary.com

In the News

Article of the Day

This Day in History

Today's Birthday

Quote of the Day

Word of the Day

Match Up
Match each word in the left column with its synonym on the right. When finished, click Answer to see the results. Good luck!

 

Hangman

Spelling Bee
difficulty level:
score: -
please wait...
 
spell the word:

Search The Web