当前位置:网站首页>Association Rules (3)

Association Rules (3)

2022-11-24 21:33:11Yuhan Yuchen

关联规则

1.算法背景

1.1 啤酒和尿布的故事

​ 沃尔玛的 “啤酒与尿布”It is a classic case in the marketing world,The case is officially published in 1998年的《哈佛 商业评论》上面的,This should be regarded as the most authoritative report found so far. 20世纪90年代的美国沃尔玛超市中,When Wal-Mart's supermarket managers analyzed sales data, they discovered one an incomprehensible phenomenon:在某些特定的情况下,“啤酒”与“尿布”The two seem unrelated items will often appear in the same shopping basket,这种独特的销售现象引起了管理人员的注意,经过 后续调查发现,这种现象出现在年轻的父亲身上. 在美国有婴儿的家庭中,一般是母亲在家中照看婴儿,年轻的父亲前去超市购买尿布. 父亲在购买尿布的同时,往往会顺便为自己购买啤酒,In this way, there will be two things, beer and diapers Products that look irrelevant often appear in the same shopping basket.

1.2 Thoughts inspired by the story

​ “啤酒与尿布”It is a relationship obtained through manual observation and analysis.Faced with so many supermarket consumption records, How can we efficiently discover the correlation between items? 换句话说,Can we pursue such an algorithm in mathematics,can run on a computer,thus efficient to find such rules? 1993年美国学者Agrawal 提出通过分析购物篮中的商品集合,To find out the relationship between products Association algorithms for relationships,并根据商品之间的关系,找出客户的购买行为.Agrawal from Mathematics and Computing From the perspective of computer algorithm, the calculation method of commodity association relationship is proposed——Apriori算法.

2.算法原理

2.1 案例引导

Consider the following simple transaction list for a grocery store:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AQf3NyuX-1668817812518)(imgs\1.PNG)]

共有7条交易记录.A B C and so on represent different commodities.

2.2 基本概念

概念1 事务集:如右表,{ABCD,ABCF,BDEF….},所有的流水记录构成的集合

概念2 记录(事务):ABCD 叫做一条记录(事务)

概念3 项目(项):A , B , C ….叫做一个项目(项)

概念4 项目集(项集):由项组成的集合,如{A,B,E,F},{A,B,C}就是一个项集

概念 5 K项集:项集中元素的个数为K,如{A,B,E,F}就是4项集

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-O6Bstzrb-1668817812518)(imgs\1.PNG)]

概念 6 支持度(Support):

$Sup(X)= \frac {某个项集 X The number of occurrences in the transaction set}{事务集中记录的总个数} $,

简单理解就是概率(频率)

如 项集X = {A,C} 则 Sup(X)= 4 /7 = 0.57

(概念13 关联规则:形如X=>Y的表达式,X和Y是不相交的项集.例如,X={A},Y={B,C},关联规则{A}=>{B,C},This rule indicates if the customer buysA,Will most likely buy it tooB、C.)

概念 7 置信度(Confidence):

C o n ( X = > Y ) = S u p ( 𝑋 𝑌 ) S u p ( 𝑋 ) Con(X=>Y)= \frac{Sup(𝑋𝑌)}{Sup(𝑋)} ConX=>Y=SupXSupXY ,

简单理解就是条件概率

如 X = {A}, Y = {C} ,则

C o n ( X = > Y ) = S u p ( 𝑋 𝑌 ) S u p ( 𝑋 ) = 4 / 7 5 / 7 = 4 / 5 = 0.8 Con(X=>Y)= \frac{Sup(𝑋𝑌)}{Sup(𝑋)} = \frac{4/7 }{5/7}= 4/ 5 = 0.8 ConX=>Y=SupXSupXY=5/74/7=4/5=0.8 ,

概念 8 Minimum support持度(min_support):人为规定的一个支持度.

概念 9 最小置信度(min_confidence):人为规定的一个置信度.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WNKXmhih-1668817812519)(imgs\1.PNG)]

概念 10 提升度 :

L i f t ( X = > Y ) = c o n ( X = > Y ) s u p ( Y ) Lift(X=>Y) = \frac{con(X=>Y)}{ sup(Y)} LiftX=>Y=supYconX=>Y ,

理解为Y在X发生的基础上发生 的概率与Y单独发生概率的比值.

如 X = {A}, Y = {C} ,则

L i f t ( X = > Y ) = c o n ( X = > Y ) s u p ( Y ) = 0.8 5 / 7 Lift(X=>Y) = \frac{con(X=>Y)}{ sup(Y)}=\frac {0.8}{5/7} LiftX=>Y=supYconX=>Y=5/70.8 =1.12,

概念 11 频繁K项(目)集:满足(大于等于)minimum supportK项集.

如,Set the minimum support to 0.5,则,

S u p ( X ) = 4 / 7 = 0.57 ≥ 0.5 Sup(X)= 4/7 = 0.57 ≥ 0.5 SupX=4/7=0.570.5 ,称 X = {A,C}Just a frequent one2项集.

概念12 候选K项(目)集:用来生成频繁K项集的K项集.(Not equivalent to allK项集)

概念13 关联规则:形如X=>Y的表达式,X和Y是不相交的项集.例如,X={A},Y={B,C},关联规则{A}=>{B,C},该规则表明A和B、CThere is a relationship between sales.

概念14 强关联规则:At the same time satisfy the association rules of minimum support and minimum confidence.

**(**连接步:在频繁2项集L2的基础上,to generate candidates3项集C3.一般情况下,从 L k L_k Lk C k + 1 C_{k+1} Ck+1To perform a merge operation:

C k + 1 = L k × L k = ( X ∪ Y ∣ X , Y ∈ L k , ∣ X ∩ Y = k − 1 项 ) C_{k+1}=L_k×L_k = ({X∪Y|X,Y∈L_k,|X∩Y=k-1项}) Ck+1=Lk×Lk=XYX,YLk,XY=k1.

It is intuitively understood that it can follow a certain order,Such as the lexicographic order of each item name or the support count of each item,对 L k L_k Lk中的每个kThe ordering of the itemsets,If the former of two frequent itemsetsk-1项相同,Then they can be combined into one length isk+1的候选项集.Doing so ensures the completeness of candidate item sets and avoids duplication.

连接步:

将L_k-1All itemsets in l_i排序,是的l_i[1]<l_i[2]<...<l_i[k-1],

执行连接$L_{k-1} × L_{k-1}$ ,即L_k-1The itemsets in are connected to each other

如果L_i和L_j的前k-2Items are the same,Then the two are connected.The result of the connection is {l_i[1],l_i[2],...,l_i[k-1],l_j[k-1]}

Get all join results as candidatesK项集C_k

剪枝:

1.拿出C_k中的一个候选k项集l_i,找到l_i的所有k-1项集的子集.

2.检查是否l_i的所有k-1Itemset subsets are all presentL_k-1中,If one personk-1The itemset subset is absentL_k-1中,从C_k中删除l_i,如果都在,保留l_i.

3.Cycle check speakC_keach candidate in k项集.

计算C_kfor each candidatek项集的支持度,如果≥Minimum support持度,则,添加到L_k.

1The collection of itemsets produces candidatesk项集集合.

为了计算出Lk,根据Apriori性质,需要从Lk-1Selects all joinable pair joins to generate candidatesk项集的集合,记作Ck.Assume that the items in the itemset are sorted lexicographically,A connectable pair is two frequent itemsets that differ only in the last item.例如,若Lk-1的元素l1和l2是可连接的,则l1和l2of two itemsetsk-1Only the last of the items differs,This condition is only used to ensure that no duplicates are generated.

2. 剪枝步

This step is used for quick minificationCkThe number of itemsets included.

由Apriori性质可得,任何非频繁的(k-1)项集都不是频繁k项集的子集,因此,如果Ckone of the candidateskany one of the itemsets(k-1)项子集不在Lk-1中,Then this candidatekItemsets must not be frequent,从而可以从Ck中删除.

Ck是Lk的超集,Subset tested compressionCk后,The database is now scanned,确定CkThe count of each candidate in ,从而确定Lk.

2.3 两个定理

定理 1 : 如果X是一个频繁K项集,则它的所有子集一定也是频繁的.

定理 2 :如果X不是K-1项频繁,则它一定不是频繁K项集.

2.4 算法流程

Step1:令K = 1 ,计算单个商品(项目)的支持度,并筛选出频繁1项集(For example, the minimum support is 0.3 = 2.1/7 )

Step2:(从K=2开始)根据K-1项的频繁项目集生成候选K项目集,并进行预剪枝

Step3:由候选K项目集生成频繁K项集(筛选出满足minimum supportk项集)

重复步骤2和3,直到无法筛选出满足minimum support集合.(第一阶段结束)

Step4:Generate association rules using frequent itemsets,Will get as often as possiblek={1,2,…K}项集,依次取出.对于每一个k,以s->{l-s}的方式生成 2 k − 2 2^k-2 2k2个关联规则,And calculate the confidence of the rule, Propose strong association rules that meet the requirements,s为频繁项集l的真子集,l-sis frequent itemset removalsitemsets other than .(算法结束)

2.4 算法流程

(Set the minimum support The duration is0.3 = 2.1/7 ,最小置信度为0.75)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5sBNNgg5-1668817812519)(imgs\1.PNG)]

Step1:令K = 1 ,计算单个商品(项目)的支持度,并筛选出频繁1项集(Minimum support The duration is0.3 = 2.1/7 )

{A},{B},{C},{D},{E},{F} The support degrees are respectively:0.71,0.86,0.71,0.57,0.57,0.29 . So get frequent1项集为{A},{B},{C},{D},{E}

Step2:(K=2)根据K-1项的频繁项目集生成候选K项集,并进行预剪枝, 频繁1项集为{A},{B},{C},{D},{E}

如何根据K-1项的频繁项目集生成候选K项目? (Minimum supportThe duration is0.3 = 2.1/7 )

1.生成候选K项集

1.1连接

将L_k-1All itemsets in l_i排序,使得l_i[1]<l_i[2]<...<l_i[k-1],

执行连接$L_{k-1} × L_{k-1}$ ,即L_k-1The itemsets in are connected to each other

如果L_i和L_j的前k-2Items are the same,Then the two are connected.The result of the connection is {l_i[1],l_i[2],...,l_i[k-1],l_j[k-1]}

Get all join results as candidatesK项集C_k

生成{A,B},{A,C},{A,D},{A,E},{B,C},{B,D}{B,E}{C,D}{C,E}{D,E).

1.2 剪枝

剪枝:

1.拿出C_k中的一个候选k项集l_i,找到l_i的所有k-1A subset of items.

2.检查是否l_i的所有k-1Itemset subsets are all presentL_k-1中,If one personk-1The itemset subset is absentL_k-1中,从C_k中删除l_i,如果都在,保留l_i.

3.循环检查C_keach candidate in k项集.

生成{A,B},{A,C},{A,D},{A,E},{B,C},{B,D}{B,E}{C,D}{C,E}{D,E).

Step3:由候选K项目集生成频繁K项集(筛选出满足minimum supportk项集)

1.计算C_kfor each candidatek项集的支持度,如果≥Minimum support持度,则添加到L_k.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hvmjtakG-1668817812520)(imgs\7.PNG)]

{A,B} {A,C} {B,C} {B,D} {B,E} {C,D} 得到频繁2项集

Step2:(K=3)根据K-1项的频繁项目集生成候选K项目集,并进行预剪枝 (Minimum supportThe duration is0.3 )

1.生成候选K项集

1.1 连接:{A,B,C} {B,C,D} {B,C,E} {B,D,E}

1.2 剪枝:{A,B,C} {B,C,D}

Step3:由候选K项目集生成频繁K项集(筛选出满足minimum supportk项集)

1.计算C_kfor each candidatek项集的支持度,如果≥Minimum support持度,则添加到L_k.

Sup(ABC) = 3/7 ,Sup(BCD) = 2/7

{A,B,C}得到频繁3项集

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6xbyqBCb-1668817812520)(imgs\1.PNG)]

Step4:Generate association rules using frequent itemsets,Will get as often as possiblek={1,2,…K}项集,依次取出.对于每一个k,以s->{l-s}的方式生成 2 k − 2 2^k-2 2k2个关联规则,And calculate the confidence of the rule, Propose strong association rules that meet the requirements,s为频繁项集l的真子集,l-sis frequent itemset removalsitemsets other than .(算法结束)

当k=3,Get the ultimate frequent3项集: {A,B,C}….(这里只有一个,When the amount of data is large, there may be many) 依次取出:第一个是{A,B,C},Get all its proper subsets {A} {B} {C} {A,C} {A,B} {B,C}

以以s->{l-s}way to form association rules:

{A} ->{B,C}、 {B} ->{A,C} 、{C} ->{A,B} ,{A,B} ->{C}、 {A,C} ->{B}、{B,C} ->{A}

计算置信度,Association rules greater than or equal to our given confidence are strong association rules.以{A} ->{B,C}为例 Con( {A} ->{B,C} ) = 𝑆𝑢𝑝( 𝐴,𝐵,𝐶 ) /𝑆𝑢𝑝( 𝐴 ) = 3 /5 = 0.6

因此 {A} ->{B,C} is not a strong association rule.

计算置信度,以{B,C} ->{A} 为例 Con( {B,C} ->{A} ) = 𝑆𝑢𝑝( 𝐴,𝐵,𝐶 ) /𝑆𝑢𝑝( 𝐵,𝐶 ) = 3/ 4 = 0.75

因此{B,C} ->{A} is a strong association rule.

同理,Compute confidence for other association rules.Finally, output the rules that meet the conditions.

We end up using it a lot3The strong association rules for itemset acquisition are :

{B, C} -> {A}, {A, C} -> {B}, {A, B} -> {C}

(注意到 {A} -> {B,C} 的置信度是0.6,所以{B,C} ->{A} 和{A} -> {B,C} There is still a difference between the two association rules)

3.code

1.安装相应的包

pip install efficient-apriori(需要python3.6以上)

from efficient_apriori import apriori#导入相应的包
# 设置数据集
data = [('A','B','C','D'),
           ('A','B','C','E'),
           ('B','D','E','F'),
           ('B','C','D','E'),
           ('A','C','D','F'),
        ('A','B','C'),
        ('A','B','E'),
       ]
# 挖掘频繁项集和频繁规则
#min_support:Minimum support持度
#min_confidence:最小置信度
#itemsets:频繁项集
#rules:强关联规则
itemsets, rules = apriori(data, min_support=0.3,  min_confidence=0.75)
print(itemsets)
print(rules)
from efficient_apriori import apriori
# 设置数据集
data = [('牛奶','面包','尿布'),
           ('可乐','面包', '尿布', '啤酒'),
           ('牛奶','尿布', '啤酒', '鸡蛋'),
           ('面包', '牛奶', '尿布', '啤酒'),
           ('面包', '牛奶', '尿布', '可乐')]
# 挖掘频繁项集和频繁规则
itemsets, rules = apriori(data, min_support=0.5,  min_confidence=1)
print(itemsets)
print(rules)

priori import apriori

设置数据集

data = [(‘牛奶’,‘面包’,‘尿布’),
(‘可乐’,‘面包’, ‘尿布’, ‘啤酒’),
(‘牛奶’,‘尿布’, ‘啤酒’, ‘鸡蛋’),
(‘面包’, ‘牛奶’, ‘尿布’, ‘啤酒’),
(‘面包’, ‘牛奶’, ‘尿布’, ‘可乐’)]

挖掘频繁项集和频繁规则

itemsets, rules = apriori(data, min_support=0.5, min_confidence=1)
print(itemsets)
print(rules)


原网站

版权声明
本文为[Yuhan Yuchen]所创,转载请带上原文链接,感谢
https://chowdera.com/2022/328/202211242127010050.html

随机推荐