tranID で group-by した時グループに属するitemが1個のもの、を除外する

ファイルの準備

dat_singleをコピってitemが1つしかないtranIDの行を作成する。

axjack:dat $ diff dat_single.csv dat_single.v2.csv
35a36,41
> 11,ビール
> 12,ビール
> 13,ビール
> 14,ワイン
> 15,ワイン
> 16,ワイン

前処理

アソシエーション分析用のファイルはCSVから読み込むと良い、とのことなのでファイル読み込み→加工→ファイル書き出し→再度読み込みします。

# load library ####
library(dplyr)
library(readr)

# single形式のCSVファイルを読み込む
mytemp <- read_csv(file="dat/dat_single.v2.csv"
                ,col_types = c("cc")
                ,col_names = c("tranID","item")
                ,quote = '\"'
                #,skip = 1 #ヘッダがある場合は1行目をスキップする
                )

# transactionの長さ1ではないものを抽出し
mytemp %>% unique() %>% group_by(tranID) %>% filter( n() != 1 ) -> mytemp.f

# ファイルに書き出す
write_csv(mytemp.f,path="dat/dat_single.f.csv", col_names = FALSE)

末尾の６行が消えたか確認すると、

> tail(mytemp,   n=6)
# A tibble: 6 x 2
  tranID item  
  <chr>  <chr> 
1 11     ビール
2 12     ビール
3 13     ビール
4 14     ワイン
5 15     ワイン
6 16     ワイン
> tail(mytemp.f, n=6)
# A tibble: 6 x 2
# Groups:   tranID [3]
  tranID item      
  <chr>  <chr>     
1 8      チーズ    
2 9      ワイン    
3 9      牛肉      
4 9      じゃがいも
5 10     ワイン    
6 10     こんにゃく

消えました。ではCSVファイルをread.transactionsで読み込みます。

データの確認

###########################
# アソシエーション分析 ####
###########################
# load library #### 
library(arules)

# load CSV ####
# (1)元データ
ds <- read.transactions(
  file = "dat/dat_single.csv"
  , format = "single"
  , sep = ","
  , cols = c(1,2)
  , rm.duplicates = T
#  , skip = 1
)

# (2)レコード追加したデータ
dsv2 <- read.transactions(
  file = "dat/dat_single.v2.csv"
  , format = "single"
  , sep = ","
  , cols = c(1,2)
  , rm.duplicates = T
  #  , skip = 1
)

# (3) "(2)"からlength=1のtransactionを削除したデータ
dsf <- read.transactions(
  file = "dat/dat_single.f.csv"
  , format = "single"
  , sep = ","
  , cols = c(1,2)
  , rm.duplicates = T
  #  , skip = 1
)

トランザクションの行数列数を確認すると

> ds
transactions in sparse format with
 10 transactions (rows) and
 19 items (columns)
> dsv2
transactions in sparse format with
 16 transactions (rows) and
 19 items (columns)
> dsf
transactions in sparse format with
 10 transactions (rows) and
 19 items (columns)

たしかにdsv2は6トランザクション消すことに成功しています。summaryをみて見ると、

> summary(ds)
transactions as itemMatrix in sparse format with
 10 rows (elements/itemsets/transactions) and
 19 columns (items) and a density of 0.1842105 

most frequent items:
ソーセージ     ビール     ワイン       牛肉     生野菜    (Other) 
         3          3          3          3          3         20 

element (itemset/transaction) length distribution:
sizes
2 3 4 5 
1 4 4 1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0     3.0     3.5     3.5     4.0     5.0

> summary(dsv2)
transactions as itemMatrix in sparse format with
 16 rows (elements/itemsets/transactions) and
 19 columns (items) and a density of 0.1348684 

most frequent items:
    ビール     ワイン ソーセージ       牛肉     生野菜    (Other) 
         6          6          3          3          3         20 

element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 
6 1 4 4 1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   3.000   2.562   4.000   5.000

よい感じです。あとはaprioriに作ったtransactionを食わせればアソシエーション分析の完成かと。ここから先は前回と同じなので省略。

axjack's blog

### axjack is said to be an abbreviation for An eXistent JApanese Cool Klutz ###

とりあえずアソシエーション分析する〜その２〜

tranID で group-by した時グループに属するitemが1個のもの、を除外する

ファイルの準備

前処理

データの確認

参考