Chapter 3 Association Rules

3.1 Item Sets

The relationship between items is the joint occurrence of items in the shopping basket.

In the jargon of market basket analysis, we speak of item sets, and of association rules.

An item set is simply a collection of one or more items. We display these collections with curly brackets {}.

An example of an item set in a customer's shopping basket in the supermarket is the combination of bread and peanut butter, shown as:

{bread, peanut butter}

But also

{bread}

and

{bread, peanut butter, toilet cleaner}

are examples of item sets.

The number of possible item sets quickly takes on enormous proportions.

Suppose a very small store sells 10 types of products (or items). A customer may or may not buy any item, which results in:

\(2^{10} = 1024\)

possible item sets!

For a slightly larger supermarket with 100 different items, the number of options is

\(2^{10} = 1.27*10^{30}\)

(1270000000000000000 ... stops at 28 zeros)

In those situations it is not feasible to evaluate all possible combinations. We have to look for a smart and efficient algorithm to generate rules that connect item sets.

An example of such a rule is:

{bread} → {peanut butter, sprinkles}

In this rule, a link is made between an item or item set to the left of the arrow (commonly referred to as LHS, or left hand side) and an item or item set to the right of the arrow (RHS, or right hand side).

The left side sets the condition, and the right side is where it leads to.

In words, this rule says that the purchase of bread results in the purchase of peanut butter and sprinkles.

The market basket analysis algorithm is a smart method to detect “interesting” connection rules between items.

What is interesting depends on the type of application.

In a large supermarket with a thousand products and many thousands of customers and millions of transactions per year, combinations that make up a small part of the whole can carry important information. Different criteria will apply in, say, medical applications or smaller data sets.