## Monday, December 7, 2009

### Microsoft Access Tips & Tricks: Running Sums & Averages

In previous lessons in this series, we have talked about the calculation of the median, the mode, and the geometric and harmonic means of a field. We have also learnt how to rank every row in a query, and how to pull out random rows from a table using SQL. In this lesson, we will deal with the calculation of a running sum or running average of a given field in a query.

Why would you want a running sum and/or running average of a field? Basically, it is just another statistic that may be interesting to you. For instance, if you have a travel diary in Access, and you want to calculate the progression of the total distance or travel time as more and more trips are added to your diary, you would want a running sum. If you want the progression of how the average distance or travel time has varied over time, you might want to associate a running average with each trip that tells you what the average was at the end of that trip.

We will assume that the data in the table makes every row unique in some fashion. Either there is a timestamp associated with each row that tells you how the rows are to be ordered or you have used an autonumber field as the primary key. We will first deal with the easier case of calculating running sums, then we will deal separately with calculating running averages.

You can easily calculate running sums with SQL that is very similar to how we calculated ranks for each row in an earlier lesson. There are actually two ways of calculating a running sum as shown below (equivalent to the two constructs we explored for ranking every row), and both the methods produce identical results. Let us first deal with the case where our query produces output with just one field from the table. Let that field be myField. We are assuming that each value of myField is unique and the ordering of myField in ascending order makes some logical sense. The two SQL constructs below will produce myField as well as a running sum of myField for each row in the query results. Both the constructs produce identical results, so you can choose to use whichever one catches your fancy!
`SELECT (SELECT sum(myField) from myTable where myField <= T.myField)as RunningSum,myField from myTable as T ORDER BY myFieldSELECT (SELECT sum(myField) + T.myField from myTable where myField < T.myField)as RunningSum,myField from myTable as T ORDER BY myField`

In the subquery inside each of these queries, we are adding the value of myField from the rows which have a value of myField less than (or less than or equal to) the value of myField we are dealing with in the outer query. We are then using that sum + the current value of myField, or the sum itself (if the comparison is less than or equal to since less than or equal to would include the current value of myField in the sum) as the running sum for the row.

Actually though, under most conditions, you would probably want to order by one field while displaying some other field and its running sum (for instance, in the case of travel diary, you might want to order by a date/time field that denotes when the trip took place, or an autonumber field that tells you in which order the trips took place, and actually display the distance and its running sum).

In the next example, we see two equivalent SQL queries that order rows by an autonumber field called myAutoNumberField (we assume that myAutoNumberField is not a random autonumber field since sorting by a random autonumber field will not make logical sense in most cases) while displaying myField and a running sum for myField:
`SELECT (SELECT sum(myField) from myTable where myAutoNumberField <= T.myAutoNumberField)as RunningSum,myField from myTable as T ORDER BY myAutoNumberFieldSELECT (SELECT sum(myField) + T.myField from myTable where myAutoNumberField < T.myAutoNumberField)as RunningSum,myField from myTable as T ORDER BY myAutoNumberField`
Notice that the autonumber field itself is not displayed in the results of the query at all. We can still use it in the subquery to calculate the running sum without any problems. If we do want to display the autonumber field also, we can add it to the outer query and it will not make a difference to the running sums.

What happens if we are pulling out a subset of rows in the table based on some filtering criteria? Suppose we use some condition C (which is a boolean construct that determines whether a given row is part of the result set or not) to select rows for inclusion in the results of the query. The SQL below shows how to calculate a running sum in such a query:
`SELECT (SELECT sum(myField) from myTable where (myField <= T.myField) AND (C))as RunningSum,myField from myTable as T where C ORDER BY myFieldSELECT (SELECT sum(myField) + T.myField from myTable where (myField < T.myField) AND (C))as RunningSum,myField from myTable as T where C ORDER BY myField`
Once again, we have assumed that ordering the query by myField makes logical sense. More often than not, you probably want to order by timestamp or autonumber field while calculating cumulative statistics for a different field. The SQL for doing that is shown below:
`SELECT (SELECT sum(myField) from myTablewhere (myAutoNumberField <= T.myAutoNumberField) AND (C))as RunningSum,myField from myTable as T where C ORDER BY myAutoNumberFieldSELECT (SELECT sum(myField) + T.myField from myTablewhere (myAutoNumberField < T.myAutoNumberField) AND (C))as RunningSum,myField from myTable as T where C ORDER BY myAutoNumberField`
Now, let us see how we tackle the case of calculating running sums for more than one field at the same time. For instance, we may need the running sums for both distance and time (perhaps, so that we can use them to calculate a running average speed) from a travel diary. We will assume that the rows in the query are going to be ordered by an autonumber field (ordering by distance or time in the case of a travel diary does not make much logical sense).

The main problem you encounter is that the way the original query is structured, the subquery can return only one field for each row pulled out by the query. So, if you modify the query as below, it will be syntactically incorrect, and Access will complain:
`SELECT (SELECT sum(myField1), sum(myField2) from myTablewhere myAutoNumberField <= T.myAutoNumberField),myField1, myField2 from myTable as T ORDER BY myAutoNumberField`

So, how exactly do we get around this limitation? It is actually quite simple, but it does make the SQL quite long, and makes it look complicated. But remember, even though it looks complicated, it actually is not. Essentially, what we are going to do is calculate the two running sums in two separate subqueries, as shown below:

`SELECT (SELECT sum(myField1) from myTablewhere myAutoNumberField <= T.myAutoNumberField)as RunningSumField1,(SELECT sum(myField2) from myTablewhere myAutoNumberField <= T.myAutoNumberField)as RunningSumField2,myField1, myField2 from myTable as T ORDER BY myAutoNumberField`

Now that we have tackled running sums, let us tackle the problem of running averages. This is where the previous lesson on ranking rows in a query comes in handy. Let us recall that the running average corresponding to a row in the query results is nothing but the running sum of that row divided by the rank of that row. Thus, calculating the running average turns out to be quite simple once we combine the SQL from both lessons.

First we will cover the case of a running average for one field. The SQL below should be self-explanatory given the definition in the previous paragraph. If you are not sure, you should review the lesson on ranking every row in a query also:
`SELECT((SELECT sum(myField) from myTable where myAutoNumberField <= T.myAutoNumberField)/(SELECT count(*) from myTable where myAutoNumberField <= T.myAutoNumberField))as RunningAverage,myField from myTable as T ORDER BY myAutoNumberField`

To extend this SQL for multiple fields, we follow the same basic technique we used for extending running sums to multiple fields: we create each running average as a separate subquery. The SQL below illustrates the technique:
`SELECT((SELECT sum(myField1) from myTable where myAutoNumberField <= T.myAutoNumberField)/(SELECT count(*) from myTable where myAutoNumberField <= T.myAutoNumberField))as RunningAverage1,((SELECT sum(myField2) from myTable where myAutoNumberField <= T.myAutoNumberField)/(SELECT count(*) from myTable where myAutoNumberField <= T.myAutoNumberField))as RunningAverage2,myField1, myField2 from myTable as T ORDER BY myAutoNumberField`
If you want both running sums and running averages, you have to expand the SQL above to include the subqueries that calculate each of them explicitly, as shown below:
`SELECT((SELECT sum(myField) from myTable where myAutoNumberField <= T.myAutoNumberField)/(SELECT count(*) from myTable where myAutoNumberField <= T.myAutoNumberField))as RunningAverage,(SELECT sum(myField) from myTable where myAutoNumberField <= T.myAutoNumberField)as RunningSum,myField from myTable as T ORDER BY myAutoNumberField`
It should be easy enough to extend this type of construct to include multiple fields if necessary. You can also explicitly write out the rank in addition to the running statistics by adding the subquery for it to the above query.

Hope this post has been helpful in solving any problems you might have had with calculating running statistics such as running sums and averages. If you have any problems or concerns with the queries in this lesson, please feel free to let me know by posting a comment. Good luck!

SHM said...

i want to create a program that calculates fuel mileage of a vehicle. odometer readings entered at every time may not be increasing every time as meter may get reset and starts with zero. sometimes odometer may not function properly and needs to be fixed. readings will change after fixing the odometer. how to take care of these two problems in my program. i need this to creat in microsoft access 2007. should i need VB program to write in this access querry. i am not familiar with VB. it will be useful if i can get the program for this.

Blogannath said...

First, what you need are running differences, not running sums. I have a later blog post about running differences that might get you started.

However, if your odometer readings are not accurate, nothing can help you. Garbage in, Garbage out is the first rule of computation. No program can figure out by itself whether and when the odometer is reset, out of order or repaired. Think about how a human could work around these problems (what specific pieces of information are required and how those pieces of information are going to be used), and then you can try implementing that procedure as a computer program.

Peter Harrison said...
This comment has been removed by the author.
Blogannath said...

What is the difference in calculation between the two, and what difference in output did you expect? I have no idea what your data looks like, and since you haven't mentioned what the difference is between how you calculate runningaverage1 and runningaverage2, I am not sure I understand what the problem is.

Peter Harrison said...

Sorry, I have just seen the errors I have made. and having changed it, my results are fine.

thanks
Pete

Blogannath said...

Excellent. Glad you were able to work it out.

Peter Harrison said...
This comment has been removed by the author.
Blogannath said...

The syntax of my example queries is explained at every step in the post, so I am not sure which part of it you are having trouble with. You are basically using subqueries to get the totals (the running sums) you are interested in, and outputting them as part of the overall query output.

I don't know whether you have read the post on ranking every row of your query output, which is what this post is based on. In either case, using joins within the subquery should be no problem syntactically.

It may be best to break the problem down into smaller steps and tackle them one at a time before putting them together for the final solution. Work on the joins first to make sure you are pulling the correct fields from the correct tables, then start adding the aggregate functions to see if the sums are working fine, then finally combine them into a working solution.

Peter Harrison said...

Having read through your previous post I have got it working although I have to include the transactionID (autonumber field) in the group by statement. This nullifies the aggregate.

Here is what I ended up with
SELECT
((SELECT Sum(IIf([vat],iif([weekendingdate] Between #04/01/2011# and #31/12/2099#, [takings]/1.2,iif([weekendingdate] BETWEEN #30/11/2008# and #01/01/2010#,[takings]/1.15,[takings]/1.175)),[takings])) AS Total
FROM departments INNER JOIN (([business units] INNER JOIN [n-weeklytransactions] ON [business units].buID = [n-weeklytransactions].businessUnit) INNER JOIN dates ON [n-weeklytransactions].ztotaldate = dates.Startdate) ON departments.deptid = [n-weeklytransactions].departmentId
WHERE transactionID<=t.transactionID)/
(SELECT Count(*) AS CountOftakings
FROM [n-weeklytransactions]
WHERE transactionID<=t.transactionID)) AS RunningAverage1, dates.WeekNo, t.weekendingdate
FROM [business units] INNER JOIN (departments INNER JOIN ([n-weeklytransactions] AS t INNER JOIN dates ON t.ztotaldate = dates.Startdate) ON departments.deptid = t.departmentId) ON [business units].buID = departments.bus_unit
GROUP BY dates.WeekNo, t.weekendingdate, t.transactionID
HAVING (((dates.WeekNo)<53) AND ((DatePart('yyyy',t![weekendingdate]))=2014))
ORDER BY dates.WeekNo;

Any ideas?

Many thanks
Pete

Blogannath said...

You should not need to include t.transactionID in the Group By clause because it is not included in the SELECT statement. I would check the parentheses (the Access SQL view always includes a ton of extra parentheses making it hard to read the SQL) to make sure that t.transactionID truly is not included as part of the SELECT statement.

Also, I notice that weeknumber and a calculated part of weekendingdate are included in the HAVING clause. This is incorrect. They are simple data fields, so they should be in the WHERE clause. Only aggregates should be in the HAVING clause. Maybe that is tripping up Access and causing it to complain about transactionID, causing you to include it as part of the GROUP BY clause.

Peter Harrison said...
This comment has been removed by the author.
Peter Harrison said...

I have spent hours trying to make this work in Access abut always have the same problem of the transactionID not being part of the aggregate function. However, I moved it to MySQL and it worked first time!

Once again however, I have a problem that the figures do not quite work.

I'm assuming that this is because I have syntax problems but I do not know where.

On the first row, the running average should be that same as the value it's using (value/1 = value)? But it doesn't. I think that the Order By and Group By must be effecting it but I cannot see where.

Here's my SQL
SELECT round(((SELECT Sum(If(departments.vat,if(weekendingdate Between '2011-01-04' and '2099-12-31', takings/1.2,if(weekendingdate BETWEEN '2008-11-30' and '2010-01-01',takings/1.15,takings/1.175)),takings)) AS Total
FROM business_units INNER JOIN (departments INNER JOIN weeklytransactions ON departments.deptid = weeklytransactions.departmentId) ON business_units.buID = departments.bus_unit
WHERE transactionID<=t.transactionID)/(SELECT Count(*) AS CountOftakings
FROM weeklytransactions
WHERE weeklytransactions.departmentId<=t.departmentId)),2) AS RunningAverage1, round(Sum(If(departments.vat,if(weekendingdate Between '2011-01-04' and '2099-12-31', takings/1.2,if(weekendingdate BETWEEN '2008-11-30' and '2010-01-01',takings/1.15,takings/1.175)),takings)),2) AS Total, t.weekendingdate, t.ztotaldate, week(weekendingdate,3) AS weekno, departments.department, business_units.business_unit
FROM business_units INNER JOIN (departments INNER JOIN weeklytransactions AS t ON departments.deptid = t.departmentId) ON business_units.buID = departments.bus_unit
WHERE (((year(t.weekendingdate))=2014))

And here's some sample output
174.16 195.95 2014-01-05 00:00:00 2013-12-30 00:00:00 1 Ambient/Other Foodhall
260.66 310.02 2014-01-05 00:00:00 2013-12-30 00:00:00 1 Butchery Foodhall
346.71 234.65 2014-01-05 00:00:00 2013-12-30 00:00:00 1 Deli Foodhall
518.15 198.88 2014-01-05 00:00:00 2013-12-30 00:00:00 1 Drinks Cafe
1035.70 507.67 2014-01-05 00:00:00 2013-12-30 00:00:00 1 Food Cafe
149.36 36.61 2014-01-05 00:00:00 2013-12-30 00:00:00 1 Veg Foodhall
208.74 47.71 2014-01-05 00:00:00 2013-12-30 00:00:00 1 Wine/VAT Foodhall

I have added the ztotaldate field to show each individual record and the problem with the average

Regards
Pete

Blogannath said...

I am not sure what your data represents and what you want to calculate. But, I see a couple of things in your query that may be causing your problems.

First, the subqueries that you are using for the numerator and denominator of the running average are not going against the same tables (in particular, the WHERE condition of the denominator seems suspect to me). Again, I am not sure what your table structures are, but if you notice in my example queries, the numerator and denominator differ only in that the sum is changed to a count in the denominator. The actual query itself, including the WHERE clauses of each of the subqueries is the same. That is how you make sure that the same set of records is chosen for the numerator and denominator. The running average makes sense only when the same records participate in both the numerator and denominator of the calculation.

At this point, I would suggest that you execute each subquery individually to make sure you are getting the correct numbers. In fact, remove the aggregate functions from the subqueries and execute them separately. Verify the output of each query by looking at the table contents manually and doing the calculations by hand. Correct any errors you find, and then try to put the final query back together.

## Content From TheFreeDictionary.com

In the News

Article of the Day

This Day in History

Today's Birthday

Quote of the Day

Word of the Day

Match Up
 Match each word in the left column with its synonym on the right. When finished, click Answer to see the results. Good luck!

Hangman

Spelling Bee
difficulty level:
score: -