Chapter 4 Data

The name of the technique, market basket analysis, seems to limit its application. But the principles of analyzing connection rules, with the core concepts of support, confidence and lift, are widely applicable, as evident from the Titanic data video (see 2) and in our example below.

As an example of market basket analysis we will use a small data set with characteristics of 10 people who have shown criminal behavior.

Psychologists have interviewed these individuals and coded the information from the conversations using keywords (such as "drugs" if there was a drug addiction; "divorced", and so on).

The question now is whether these characteristics (items) are related to each other. For example, are drug users relatively often divorced; or vice versa, do criminals who are divorced more often use drugs?

Note that this technique is geared toward finding associations. It is unsupervised, as we do not have a target variable to explain or predict. Nor do we have a theory that uses logical reasoning to link the LHS to the RHS.

That said, there often is a logic to the associations found. Consumers who buy bread, may also peanut butter and chocolate sprinkles to use on the bread. But the point is, we do not start analyzing the data to test a priori theories. We will keep associations even if we do not fully understand their logic, like in the beer and diapers example in the video!

We can record the information from the interviews in a database.

Such a data file (in Excel) can look like this:

In the first interview, for example, it turns out that the interviewee has divorced parents and has been guilty of shoplifting.

Shoplifting does not occur in any of the other interviews.

One conclusion is that shoplifting is always associated with divorced parents (although the evidence for this conclusion, with only one case of shoplifting, is pretty thin).

Conversely, there are several interviewees with divorced parents who have not been guilty of shoplifting.

Note that due to the lack of structure in the data set, it is not so easy to identify these types of relationships even in this small data set!

A number of things stand out in the database.

  • The structure is different from what we are used to.

Columns normally have a fixed meaning. It contained a variable with an unambiguous meaning or label (e.g. age or gender). The columns in this file are the "loose" comments noted down by the psychologist during the interview. The drug comment appears as the first comment (e.g. for person 2) or the third comment (person 3). It is not relevant to this example, but in principle the interviewer can code comments twice or more in one and the same interview.

  • The number of filled columns differs from one person to the next. While person 3 has 5 comments, person 1 has only two.

This way of storing data has great advantages.

Think of the thousands of products offered by supermarkets, out of which only a very small proportion is part of each transaction.

If every column were to represent a product (item), then every record in the data file would mainly consist of zeros or empty cells! An online store would have to create a line with as many columns as there are items in the assortment, even though each transaction includes only one or a few items.

In its quest to be all things to all people, Amazon has built an unbelievable catalog of more than 12 million products, books, media, wine, and services. If you expand this to Amazon Marketplace sellers, as well, the number is closer to more than 350 million products.

In our small example, too, this method of data storage has advantages: the interviewer can note down the comments during the interview, without having to worry about the sequence and the number of possible comments.

In the same vein, the cashier in the supermarket also scans the products in any "random" order in which they are placed on the belt!

Self test: try to see for yourself if you can detect some pattern in the data. Although there are only 10 people in the file and the number of comments has a maximum of 5, it is not an easy task! Algorithms help!