# Association Rules (3)

2022-11-24 21:33:11

## 关联规则

### 1.算法背景

#### 1.1 啤酒和尿布的故事

​ 沃尔玛的 “啤酒与尿布”It is a classic case in the marketing world,The case is officially published in 1998年的《哈佛 商业评论》上面的,This should be regarded as the most authoritative report found so far. 20世纪90年代的美国沃尔玛超市中,When Wal-Mart's supermarket managers analyzed sales data, they discovered one an incomprehensible phenomenon：在某些特定的情况下,“啤酒”与“尿布”The two seem unrelated items will often appear in the same shopping basket,这种独特的销售现象引起了管理人员的注意,经过 后续调查发现,这种现象出现在年轻的父亲身上. 在美国有婴儿的家庭中,一般是母亲在家中照看婴儿,年轻的父亲前去超市购买尿布. 父亲在购买尿布的同时,往往会顺便为自己购买啤酒,In this way, there will be two things, beer and diapers Products that look irrelevant often appear in the same shopping basket.

#### 1.2 Thoughts inspired by the story

​ “啤酒与尿布”It is a relationship obtained through manual observation and analysis.Faced with so many supermarket consumption records, How can we efficiently discover the correlation between items？ 换句话说,Can we pursue such an algorithm in mathematics,can run on a computer,thus efficient to find such rules？ 1993年美国学者Agrawal 提出通过分析购物篮中的商品集合,To find out the relationship between products Association algorithms for relationships,并根据商品之间的关系,找出客户的购买行为.Agrawal from Mathematics and Computing From the perspective of computer algorithm, the calculation method of commodity association relationship is proposed——Apriori算法.

### 2.算法原理

#### 2.1 案例引导

Consider the following simple transaction list for a grocery store：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AQf3NyuX-1668817812518)(imgs\1.PNG)]

#### 2.2 基本概念

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-O6Bstzrb-1668817812518)(imgs\1.PNG)]

#### $Sup(X)= \frac {某个项集 X The number of occurrences in the transaction set}{事务集中记录的总个数}$,

（概念13 关联规则：形如X=>Y的表达式,X和Y是不相交的项集.例如,X={A},Y={B,C},关联规则{A}=>{B,C},This rule indicates if the customer buysA,Will most likely buy it tooB、C.）

#### C o n （ X = > Y ） = S u p （ 𝑋 𝑌 ） S u p （ 𝑋 ） = 4 / 7 5 / 7 = 4 / 5 = 0.8 Con（X=>Y）= \frac{Sup（𝑋𝑌）}{Sup（𝑋）} = \frac{4/7 }{5/7}= 4/ 5 = 0.8 ,

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WNKXmhih-1668817812519)(imgs\1.PNG)]

#### L i f t （ X = > Y ） = c o n （ X = > Y ） s u p （ Y ） = 0.8 5 / 7 Lift（X=>Y） = \frac{con（X=>Y）}{ sup（Y）}=\frac {0.8}{5/7} =1.12,

S u p （ X ） = 4 / 7 = 0.57 ≥ 0.5 Sup（X）= 4/7 = 0.57 ≥ 0.5 ,称 X = {A,C}Just a frequent one2项集.

**（**连接步：在频繁2项集L2的基础上,to generate candidates3项集C3.一般情况下,从 L k L_k C k + 1 C_{k+1} To perform a merge operation:

C k + 1 = L k × L k = （ X ∪ Y ∣ X , Y ∈ L k , ∣ X ∩ Y = k − 1 项 ） C_{k+1}=L_k×L_k = （{X∪Y|X,Y∈L_k,|X∩Y=k-1项}） .

It is intuitively understood that it can follow a certain order,Such as the lexicographic order of each item name or the support count of each item,对 L k L_k 中的每个kThe ordering of the itemsets,If the former of two frequent itemsetsk-1项相同,Then they can be combined into one length isk+1的候选项集.Doing so ensures the completeness of candidate item sets and avoids duplication.

将L_k-1All itemsets in l_i排序,是的l_i[1]<l_i[2]<...<l_i[k-1],

Get all join results as candidatesK项集C_k


1.拿出C_k中的一个候选k项集l_i,找到l_i的所有k-1项集的子集.

2.检查是否l_i的所有k-1Itemset subsets are all presentL_k-1中,If one personk-1The itemset subset is absentL_k-1中,从C_k中删除l_i,如果都在,保留l_i.

3.Cycle check speakC_keach candidate in k项集.

###### 2. 剪枝步

This step is used for quick minificationCkThe number of itemsets included.

Ck是Lk的超集,Subset tested compressionCk后,The database is now scanned,确定CkThe count of each candidate in ,从而确定Lk.

#### 2.4 算法流程

Step1：令K = 1 ,计算单个商品（项目）的支持度,并筛选出频繁1项集（For example, the minimum support is 0.3 = 2.1/7 ）

Step2：（从K=2开始）根据K-1项的频繁项目集生成候选K项目集,并进行预剪枝

Step3：由候选K项目集生成频繁K项集（筛选出满足minimum supportk项集）

Step4：Generate association rules using frequent itemsets,Will get as often as possiblek={1,2,…K}项集,依次取出.对于每一个k,以s->{l-s}的方式生成 2 k − 2 2^k-2 个关联规则,And calculate the confidence of the rule, Propose strong association rules that meet the requirements,s为频繁项集l的真子集,l-sis frequent itemset removalsitemsets other than .（算法结束）

#### 2.4 算法流程

（Set the minimum support The duration is0.3 = 2.1/7 ,最小置信度为0.75）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5sBNNgg5-1668817812519)(imgs\1.PNG)]

Step1：令K = 1 ,计算单个商品（项目）的支持度,并筛选出频繁1项集（Minimum support The duration is0.3 = 2.1/7 ）

{A},{B},{C},{D},{E},{F} The support degrees are respectively：0.71,0.86,0.71,0.57,0.57,0.29 . So get frequent1项集为{A},{B},{C},{D},{E}

Step2：（K=2）根据K-1项的频繁项目集生成候选K项集,并进行预剪枝, 频繁1项集为{A},{B},{C},{D},{E}

1.生成候选K项集

1.1连接

将L_k-1All itemsets in l_i排序,使得l_i[1]<l_i[2]<...<l_i[k-1],

Get all join results as candidatesK项集C_k


1.2 剪枝

剪枝：

1.拿出C_k中的一个候选k项集l_i,找到l_i的所有k-1A subset of items.

2.检查是否l_i的所有k-1Itemset subsets are all presentL_k-1中,If one personk-1The itemset subset is absentL_k-1中,从C_k中删除l_i,如果都在,保留l_i.

3.循环检查C_keach candidate in k项集.



Step3：由候选K项目集生成频繁K项集（筛选出满足minimum supportk项集）

1.计算C_kfor each candidatek项集的支持度,如果≥Minimum support持度,则添加到L_k.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hvmjtakG-1668817812520)(imgs\7.PNG)]

{A,B} {A,C} {B,C} {B,D} {B,E} {C,D} 得到频繁2项集

Step2：（K=3）根据K-1项的频繁项目集生成候选K项目集,并进行预剪枝 （Minimum supportThe duration is0.3 ）

1.生成候选K项集

1.1 连接：{A,B,C} {B,C,D} {B,C,E} {B,D,E}

1.2 剪枝：{A,B,C} {B,C,D}

Step3：由候选K项目集生成频繁K项集（筛选出满足minimum supportk项集）

1.计算C_kfor each candidatek项集的支持度,如果≥Minimum support持度,则添加到L_k.

Sup（ABC） = 3/7 ,Sup（BCD） = 2/7

{A,B,C}得到频繁3项集

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6xbyqBCb-1668817812520)(imgs\1.PNG)]

Step4：Generate association rules using frequent itemsets,Will get as often as possiblek={1,2,…K}项集,依次取出.对于每一个k,以s->{l-s}的方式生成 2 k − 2 2^k-2 个关联规则,And calculate the confidence of the rule, Propose strong association rules that meet the requirements,s为频繁项集l的真子集,l-sis frequent itemset removalsitemsets other than .（算法结束）

{A} ->{B,C}、 {B} ->{A,C} 、{C} ->{A,B} ,{A,B} ->{C}、 {A,C} ->{B}、{B,C} ->{A}

We end up using it a lot3The strong association rules for itemset acquisition are ：

{B, C} -> {A}, {A, C} -> {B}, {A, B} -> {C}

（注意到 {A} -> {B,C} 的置信度是0.6,所以{B,C} ->{A} 和{A} -> {B,C} There is still a difference between the two association rules）

### 3.code

1.安装相应的包

pip install efficient-apriori（需要python3.6以上）

from efficient_apriori import apriori#导入相应的包
# 设置数据集
data = [('A','B','C','D'),
('A','B','C','E'),
('B','D','E','F'),
('B','C','D','E'),
('A','C','D','F'),
('A','B','C'),
('A','B','E'),
]
# 挖掘频繁项集和频繁规则
#min_support：Minimum support持度
#min_confidence：最小置信度
#itemsets：频繁项集
#rules：强关联规则
itemsets, rules = apriori(data, min_support=0.3,  min_confidence=0.75)
print(itemsets)
print(rules)

from efficient_apriori import apriori
# 设置数据集
data = [('牛奶','面包','尿布'),
('可乐','面包', '尿布', '啤酒'),
('牛奶','尿布', '啤酒', '鸡蛋'),
('面包', '牛奶', '尿布', '啤酒'),
('面包', '牛奶', '尿布', '可乐')]
# 挖掘频繁项集和频繁规则
itemsets, rules = apriori(data, min_support=0.5,  min_confidence=1)
print(itemsets)
print(rules)


priori import apriori

# 设置数据集

data = [(‘牛奶’,‘面包’,‘尿布’),
(‘可乐’,‘面包’, ‘尿布’, ‘啤酒’),
(‘牛奶’,‘尿布’, ‘啤酒’, ‘鸡蛋’),
(‘面包’, ‘牛奶’, ‘尿布’, ‘啤酒’),
(‘面包’, ‘牛奶’, ‘尿布’, ‘可乐’)]

# 挖掘频繁项集和频繁规则

itemsets, rules = apriori(data, min_support=0.5, min_confidence=1)
print(itemsets)
print(rules)




https://chowdera.com/2022/328/202211242127010050.html