axjack's blog

axjack is said to be an abbreviation for An eXistent JApanese Cool Klutz.

次の元号をRで予想してみる〜前処理編〜

はじめに

平成もあと3ヶ月足らずで終わってしまいますね。色んな方が次の元号を予想している流れがあるので、自分でも予想してみようと思います。今回は前処理編と称して、ひとまず大化から平成までの元号を取得します。取得するとは、webスクレイピングしてみるということです。どうやってwebスクレイピングするかというと、Rの{rvest}を使ってみます。

コード

rvestの準備

installしてlibraryします。

install.packages("rvest")
library(rvest)
library(dplyr) #テスト用で一瞬使います

{rvest}が使えるかテストします。

> read_html("http://example.com/") %>% html_nodes("p") %>% html_text
[1] "This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission."
[2] "More information..."   

大丈夫そうなので次に進みます。

rvestでwebスクレイピング→テキスト処理(1文字ずつに分解)

元号国立公文書館 デジタルアーカイブから取得します。が、このサイトは和暦元号・中国元号・朝鮮元号それぞれ記載されているので、後ほど力技で和暦のみ抽出します。 www.digital.archives.go.jp

以下、デジタルアーカイブのサイトから和暦元号を取得 → テキスト処理(1文字ずつに分解)するまで、を一気に書きます。

# table td周りをごそっと取得
src_url <- 'https://www.digital.archives.go.jp/DAS/meta/era#1'
gengou_html <- read_html(src_url)
gengou_html_nd <- html_nodes(gengou_html,"table td")

# 元号のみ取得し
gengou_ <- html_text(gengou_html_nd)[1:5 %% 5 == 1]

# 和暦で最も古い「大化」の添字を見つける
which(gengou_ == "大化")
#[1] 247


# 和暦だけ抽出
gengou_j <- gengou_[1:247]

  
# 確認
head(gengou_j)
#[1] "平成" "昭和" "大正" "明治" "慶応" "元治"

tail(gengou_j)
#[1] "和銅" "慶雲" "大宝" "朱鳥" "白雉" "大化"

# 1文字ずつにパースするが、
gengou_jParse <- strsplit(gengou_j,"")

# パースするとリスト形式となるので、
head(gengou_jParse)[1]
#[[1]]
#[1] "平" "成"

# リストをベクトル形式に変換する
gengou_jParse2 <- unlist(gengou_jParse)

結果確認

いい感じに元号を取得できました。

> gengou_jParse2
  [1] "平" "成" "昭" "和" "大" "正" "明" "治" "慶" "応" "元" "治" "文" "久" "万"
 [16] "延" "安" "政" "嘉" "永" "弘" "化" "天" "保" "文" "政" "文" "化" "享" "和"
 [31] "寛" "政" "天" "明" "安" "永" "明" "和" "宝" "暦" "寛" "延" "延" "享" "寛"
 [46] "保" "元" "文" "享" "保" "正" "徳" "宝" "永" "元" "禄" "貞" "享" "天" "和"
 [61] "延" "宝" "寛" "文" "万" "治" "明" "暦" "承" "応" "慶" "安" "正" "保" "寛"
 [76] "永" "元" "和" "慶" "長" "文" "禄" "天" "正" "元" "亀" "永" "禄" "弘" "治"
 [91] "天" "文" "享" "禄" "大" "永" "永" "正" "文" "亀" "明" "応" "延" "徳" "長"
[106] "享" "文" "明" "応" "仁" "文" "正" "寛" "正" "長" "禄" "康" "正" "享" "徳"
[121] "宝" "徳" "文" "安" "嘉" "吉" "永" "享" "正" "長" "応" "永" "明" "徳" "康"
[136] "応" "嘉" "慶" "至" "徳" "永" "徳" "康" "暦" "永" "和" "応" "安" "貞" "治"
[151] "康" "安" "延" "文" "文" "和" "観" "応" "貞" "和" "康" "永" "暦" "応" "元"
[166] "中" "弘" "和" "天" "授" "文" "中" "建" "徳" "正" "平" "興" "国" "延" "元"
[181] "建" "武" "正" "慶" "元" "弘" "元" "徳" "嘉" "暦" "正" "中" "元" "亨" "元"
[196] "応" "文" "保" "正" "和" "応" "長" "延" "慶" "徳" "治" "嘉" "元" "乾" "元"
[211] "正" "安" "永" "仁" "正" "応" "弘" "安" "建" "治" "文" "永" "弘" "長" "文"
[226] "応" "正" "元" "正" "嘉" "康" "元" "建" "長" "宝" "治" "寛" "元" "仁" "治"
[241] "延" "応" "暦" "仁" "嘉" "禎" "文" "暦" "天" "福" "貞" "永" "寛" "喜" "安"
[256] "定" "嘉" "禄" "元" "仁" "貞" "応" "承" "久" "建" "保" "建" "暦" "承" "元"
[271] "建" "永" "元" "久" "建" "仁" "正" "治" "建" "久" "文" "治" "元" "暦" "寿"
[286] "永" "養" "和" "治" "承" "安" "元" "承" "安" "嘉" "応" "仁" "安" "永" "万"
[301] "長" "寛" "応" "保" "永" "暦" "平" "治" "保" "元" "久" "寿" "仁" "平" "久"
[316] "安" "天" "養" "康" "治" "永" "治" "保" "延" "長" "承" "天" "承" "大" "治"
[331] "天" "治" "保" "安" "元" "永" "永" "久" "天" "永" "天" "仁" "嘉" "承" "長"
[346] "治" "康" "和" "承" "徳" "永" "長" "嘉" "保" "寛" "治" "応" "徳" "永" "保"
[361] "承" "暦" "承" "保" "延" "久" "治" "暦" "康" "平" "天" "喜" "永" "承" "寛"
[376] "徳" "長" "久" "長" "暦" "長" "元" "万" "寿" "治" "安" "寛" "仁" "長" "和"
[391] "寛" "弘" "長" "保" "長" "徳" "正" "暦" "永" "祚" "永" "延" "寛" "和" "永"
[406] "観" "天" "元" "貞" "元" "天" "延" "天" "禄" "安" "和" "康" "保" "応" "和"
[421] "天" "徳" "天" "暦" "天" "慶" "承" "平" "延" "長" "延" "喜" "昌" "泰" "寛"
[436] "平" "仁" "和" "元" "慶" "貞" "観" "天" "安" "斉" "衡" "仁" "寿" "嘉" "祥"
[451] "承" "和" "天" "長" "弘" "仁" "大" "同" "延" "暦" "天" "応" "宝" "亀" "神"
[466] "護" "景" "雲" "天" "平" "神" "護" "天" "平" "宝" "字" "天" "平" "勝" "宝"
[481] "天" "平" "感" "宝" "天" "平" "神" "亀" "養" "老" "霊" "亀" "和" "銅" "慶"
[496] "雲" "大" "宝" "朱" "鳥" "白" "雉" "大" "化"

終わりに

次が果たしてあるのか無いのかいつになるのか、は分かりませんが・・・この情報を元に新元号が発表される4月1日より前になんらかの予想ができたらなぁと。

参考にしたサイト

axjack is said to be an abbreviation for An eXistent JApanese Cool Klutz.