Search The Web

Today's Headlines

Friday, January 21, 2011

Microsoft Access Tips & Tricks: Flattening A Table

OK, I admit, the title is a little cryptic and esoteric. What exactly does “flattening a table” really mean? In the context of this post, it is going to mean a process by which we are going to reduce the number of rows in the table while increasing the width. Hence the comparison to flattening something. The problem has been referred to using other, more wordier descriptions such as concatenating column values from multiple rows into a single column, concatenating values from within the same column, grouping with concatenation, aggregate function for concatenation, etc. I will give a concrete example of what I plan on doing in just a bit.

If you are interested, you can find my earlier posts on finding the median, the mode, the geometric and harmonic means, ranking every row in a query, selecting random rows out of a table, calculating running sums and averages, calculating running differences, creating histograms, calculating probability masses out of given data, calculating cumulative distributions out of given data, finding percentile scores, percentile values, calculating distinct counts, full outer joins, parameter queries, crosstab queries, working with system objects, listing table fields,finding unmatched rows, calculating statistics with grouping, job-candidate matching, job-candidate matching with skill levels, great circle distances, great circle initial headings, using Excel functions in Access, using the windows file-picker, using the Access date-picker, setting tab indexes quickly and correctly, pre-filling forms based on previous entries, highlighting form controls, conditional formatting, performing date manipulations, counting letters, words, sentences and paragraphs, calculating age, and propercasing text entries.

As promised, let me start with a concrete example of what I mean by “flattening a table”. Then we will look at how exactly to do it. The example I am going to pick up has to do with the the database I use for documenting and cataloging my large collection of electronic books. As you are aware, many books have multiple authors. I have an authors table that has three fields: a book ID field (numerical), an author field (text) and an author number field (numerical). Let us say this table has the following entries:

Book ID Author Name Author Number
1 A 1
1 B 2
2 C 1
2 D 2
2 E 3
2 F 4
3 G 1

What this table contains is details about three books. Their bookID’s are 1, 2 and 3. The book with bookID of 1 has 2 authors: A and B. A is the first author, and B is the second author. The book with bookID of 2 has 4 authors (C, D, E, and F). And the book with bookID of 3 has just one author (G).

Note that this table structure is considered the correct table structure for storing details such as this. It is normalized correctly rather than using repeating columns such as Author1, Author2, Author3, etc. If you have table structure like that, it is wrong and should be fixed. In addition to many other disadvantages, it makes it very difficult to deal with books that have more authors than the number of columns you have guessed at during table design as being adequate. Also, think about the difficulty of finding out all books with a particular author when you have a denormalized table structure!

Anyways, even though the table structure above is correct, normalized and all that good stuff, we still have a minor problem: if we want to see all the authors of a book on one line for any reason (maybe we want to show such a list in a report or form), then it becomes a little tricky. In essence, we need a query to produce output like the one below, if possible:

Book ID Author Names
1 A, B
2 C, D, E, F
3 G

This is what we mean by “flattening a table”. We have taken a table with 7 rows, and created a table with 3 rows out of it. And we have made the second column of the new table a multi-value column (once again, note that this is not a normalized table structure. You may create something like this for the express purpose of feeding a form or report, but you should not store data permanently in your database with multi-valued columns. Every field in a database table should contain atomic data). You can also call this “denormalizing a table”.

The question is how to go from the first table to the second table. Most people will tell you that this can not be done using SQL. They will insist that this has to be done only using VBA. The truth of the matter is, SQL is a very versatile language, and it is difficult to think of something that can not be done using SQL if you apply your mind to it hard enough! And no, that is not an open invitation to challenge me to create a linux kernel using SQL!! So, in this post, I am going to talk about a purely SQL solution to this problem. I will then talk about the disadvantages of the pure SQL approach to the problem. These disadvantages may make it better to use a little VBA to tackle the problem. I will explain the VBA solution to the problem in my next post. But for now, let us focus on the SQL solution.

Note that the second table is tantalizingly similar to the output from a Group By query. We group by Book ID, and instead of using a normal aggregate function like sum(), avg(), etc., we use a concatenate() function. Unfortunately, for some reason known only to the designers and implementers of SQL, the language does not have an aggregate function that performs concatenation. In fact, if such a function existed, the problem would be trivial, and I wouldn’t be writing about it here!

So, in the absence of such an aggregate function, how are we to write SQL to create the second table from the first? The answer lies in a series of self-joins. Self-joins are just like joins between two tables, only the two tables are the same. A table is joined with itself to create a self-join. Let us see how to use a series of self-joins to create a flattened table.

Consider the SQL below:

select myTable.BookID, myTable.AuthorName & ", " & mytable1.AuthorName as Authors
From myTable inner join myTable as myTable1 on myTable.BookID = myTable1.BookID
This is a self-join, because myTable is joined with itself (it is given an alias during the join so that we can refer to the fields in the two tables unambiguously). This gives us the output below:

BookIDAuthors
1A, A
1A, B
1B, A
1B, B
2C, C
2C, D
2C, E
2C, F
2D, C
2D, D
2D, E
2 D, F
2 E, C
2 E, D
2 E, E,
2 E, F
2 F, C
2 F, D
2 F, E
2 F, F
3 G, G

Obviously, not exactly what we want, but we are making progress. The first task is to eliminate duplicates like (A, A) from the Authors column. In fact, we should also eliminate multiple permutations of the same combination such (A, B) and (B, A). The simplest way to do this is to change the join condition to eliminate such rows from the join. Take a look at the SQL below:

select myTable.BookID, myTable.AuthorName & ", " & mytable1.AuthorName as Authors
From myTable inner join myTable as myTable1 on myTable.BookID = myTable1.BookID
AND myTable.AuthorName < myTable1.AuthorName
Here we use the AuthorName field to eliminate duplicates, and also to order the resulting combination in increasing order of AuthorName (thus an alphabetical listing of authors). We can create a list of authors by AuthorNumber (which might make more sense in this application), but it will be a little more involved. Why it will be more involved will become apparent as you read through the rest of this post.

Let us look at the output from the above query.

BookID Authors
1 A, B
2 C, D
2 C, E
2 C, F
2 D, E
2 D, F
2 E, F

This is a lot of progress! But we lost the row corresponding to BookID 3 because that book has only one author. The self-join fails for that BookID because it can not find another row in the table with the same BookID and an author name greater than G. The simplest way to fix this is to make the inner joins left joins. Take a look at the SQL below:

select myTable.BookID, myTable.AuthorName & ", " & mytable1.AuthorName as Authors
From myTable LEFT join myTable as myTable1 on myTable.BookID = myTable1.BookID
AND myTable.AuthorName < myTable1.AuthorName
The output from the query above is shown in the table below:

BookID Authors
1 A, B
1 B,
2 C, D
2 C, E
2 C, F
2 D, E
2 D, F
2 E, F
2 F,
3 G,

Because of the left join, in addition to the combinations, we also get the last author of each book on a separate row since the left join of that row in the original table fails to find a suitable row in the table to join with, resulting in that row being produced by itself. We are still making progress whether you believe it or not!

Actually, the results of the query above contain the results we are looking for (in addition to extraneous rows which we need to get rid of). Obviously, our SQL can only create combinations of at most two authors for each book because we have only one self-join in our SQL. But first, let us eliminate the extra rows in the above query, and then see how to get all the authors for books like Book 2, which have 4 authors.

Eliminating the unwanted rows is simply a matter of wrapping the query that generated the results above in an outer query that groups on BookID and takes the minimum of the Authors field (because we used a “<” comparison in our join condition, we take the minimum. If we had used a “>” comparison in our join condition, we would take the maximum). And using authorNumber instead of authorName in the inner query can create complications at this step because you can not tell in advance whether to take the minimum or maximum of authors in the outer query. That is why I have made a compromise and satisfied myself with getting an alphabetical listing of authors. The SQL below does exactly that:

select BookID, min(authors) From
(select myTable.BookID, myTable.AuthorName & ", " & mytable1.AuthorName as Authors
From myTable left join myTable as myTable1 on myTable.BookID = myTable1.BookID
AND myTable.AuthorName < myTable1.AuthorName)
group by BookID
Now, we get the output below:

BookID Authors
1 A, B
2 C, D
3 G,

Now, it is just a matter of getting the full list of authors for each book, rather than just the first two. Here is where the disadvantages of the pure SQL approach start to show themselves. You have to decide up front the maximum number of authors you are going to show for any book on your list. If you are satisfied with showing 5 authors, and think that the loss of information for books with more than 5 authors (which are extremely rare) is tolerable, you would just do 4 self-joins instead of the 1 self-join we have done so far. Remember that the number of self-joins will be one less than the maximum number of authors you can get in your output.

So, the final SQL for flattening the given table, and giving you up to 5 authors per book in the output is presented below:
select BookID, min(authors) from
(select myTable.BookID, myTable.AuthorName & ", " & mytable1.AuthorName & ", " &
myTable2.AuthorName & ", " & myTable3.AuthorName & ", " & myTable4.AuthorName as Authors
From ((((myTable left join myTable as myTable1 on myTable.BookID = myTable1.BookID
AND myTable.AuthorName < myTable1.AuthorName)
left join myTable as myTable2 on myTable1.BookID = myTable2.BookID AND myTable1.AuthorName < myTable2.AuthorName)
left join myTable as myTable3 on myTable2.BookID = myTable3.BookID AND myTable2.AuthorName < myTable3.AuthorName)
left join myTable as myTable4 on myTable3.BookID = myTable4.BookID AND myTable3.AuthorName < myTable4.AuthorName))
group by BookID
The output from the above SQL is presented below:

BookID Authors
1 A, B, , ,
2 C, D, E, F,
3 G, , , ,

Notice that there are extra commas in the output because of the way we concatenated the author names. If we omitted the commas between the names in the concatenation, we would still end up with extra spaces, but they would be less noticeable in the output.

So, you can flatten a table using pure SQL. But there are several problems with the approach that might preclude its use in your application:
  • You may not be able to tell in advance how many values you want to concatenate in the final output for any given row
  • You may not be able to live with any loss of information (in the example above, we stopped at 4 left joins with the full knowledge that if a book had more than 5 authors, we would see only the first 5 in the output)
  • If you use a separator except spaces during the concatenation, extra separators will appear in the output for rows which have fewer values to concatenate
  • If your concatenation is of one column, but you want to use a different column in the table to determine the order in which the values need to be concatenated, it gets complicated (that is why I did not tackle the problem of ordering the authors by author number in this post)
  • If your table is very large, the self-joins create more and more rows exponentially. You could run out of memory and other resources when you run the subquery or when you run the outer query to group by BookID
So, take this solution as a proof of concept rather than something that will work for you all the time. This proof of concept proves that SQL is a lot more powerful than many naysayers would have you believe! However, many production databases contain millions of rows, and a series of self-joins like this can make your database administrator very unhappy and/or angry with you! In the next post in this series, I will go into solution of this problem using a few lines of VBA code.

Hope this post has been helpful in solving any problems you might have had with flattening a table in Access. If you have any problems or concerns, please feel free to let me know by posting a comment. If you have other questions on Access that you would like me to address in future posts, please feel free to let me know through your comments too. Good luck!

1 comment:

jose said...

The separator issue can be fixed by using Access' IIF function. For example, line 2 of the final SQL would be changed to:

(select myTable.BookID, myTable.AuthorName & Iif(mytable1.AuthorName Is Not Null, ", " & mytable1.AuthorName, Null) & Iif(mytable2.AuthorName Is Not Null, ", " &

Visitors Country Map

Free counters!

Content From TheFreeDictionary.com

In the News

Article of the Day

This Day in History

Today's Birthday

Quote of the Day

Word of the Day

Match Up
Match each word in the left column with its synonym on the right. When finished, click Answer to see the results. Good luck!

 

Hangman

Spelling Bee
difficulty level:
score: -
please wait...
 
spell the word:

Search The Web