言語処理100本ノック 2020 第4章後半

nlp100.github.io

第4章正規表現の後半(35-39まで)解説書きます。
matplotlibでグラフ作るの結構楽しい。
ただ、MacBookでmatplotlibに日本語を使うと盛大に文字化けするので、
その対処に苦労した... 第4章前半はこちら

matplotlibで日本語を文字化けさせない準備(MacBook)

MacBookを使っている場合、matplotlibのプロット内容に日本語があると、見事に文字化けして、日本語が□□□のように出てしまう。
matplotlibが入っていない場合、pip install matplotlibでインストールすること。

私の環境は、MacOS Catalina 10.15.4, Python3.8.2です。

1. IPAexゴシックフォントのダウンロード

まずは、IPAexゴシックフォント下記のサイトからダウンロードしておき、
ipaexg.ttfを ~/Library/Fonts配下にコピーしておく。

moji.or.jp

2. matplotlibの設定を確認

コマンドラインでpythonを起動し、以下の通り入力して自分のmatplotlibの設定を確かめる。

$ python
Python 3.8.2 (default, Apr 12 2020, 23:40:15) 
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
>>> print(matplotlib.matplotlib_fname())
/Users/saturn/.pyenv/versions/3.8.2/lib/python3.8/site-packages/matplotlib/mpl-data/matplotlibrc
>>> print(matplotlib.rcParams['font.family']) 
['sans-serif']
>>> print(matplotlib.get_configdir())
/Users/saturn/.matplotlib
>>> print(matplotlib.get_cachedir())
/Users/saturn/.matplotlib

この場合、以下のことが分かった

matplotlib,matplotlib_fname()から、matplotlibの設定ファイルmatplotlibrcの場所
matplotlib.rcParams['font.family']から、matplotlibで使用しているフォントはsans-serif
- そりゃ文字化けするわ...
matplotlib.get_configdir()とmatplotlib.get_cachedir()から、設定があるディレクトリと
キャッシュがあるディレクトリの場所

3. matplotlibrcの作成

~/.matplotlib/にmatplotlibrcを作る。中身は以下の通り。
IPAexGothicをデフォルトのフォントとして変更するカスタマイズ設定を入れる。

font.family : IPAexGothic # default Font

4. matplotlibのキャッシュファイルを全削除

古い設定が残っていることがあるので、キャッシュファイルを全削除する

rm -rf ~/.matplotlib/fontlist*.cache

5. fontlist.jsonの削除

ls /Users/saturn/.matplotlib/
fontlist-v310.json matplotlibrc       tex.cache

私の場合、fontlist-v310.jsonを消したら文字化けしなくなった。
このfontlist.jsonは、環境によって名前が違うかもしれないので確認のこと。

参考

tarao-mendo.blogspot.com

35. 単語の出現頻度

f:id:saturn-glave:20200520215002p:plain

必殺、collections.Counterが大活躍。競プロで学んだことがまたひとつ活かせた。
句読点やカギ括弧などは最初に除外してしまった。

ポイント

リスト内包表記を活用して条件に該当する値すべてをリストから除外(結果は新しいオブジェクトとして生成される)
collections.Counter(list)でリストの内容を簡単に集計
most_common(num)で、リストの集計結果からnum番目トップまでの結果を表示
- numを指定しない場合は、全部表示される

解答

# coding:utf-8
import collections


def neko_load():
    path = './neko.txt.mecab'
    neko = []
    sentence = []

    with open(path) as f:
        for line in f:
            # [表層系, それ以外]として一行の中身をリスト化
            tmp_line = line.rstrip('\n').split('\t')
            # EOSの部分で一文とカウント。EOSだった場合はsentenceに何も入らない
            if len(tmp_line) == 1 and len(sentence) != 0:
                neko.append(sentence)
                sentence = []
            elif len(tmp_line) == 2:
                # 品詞以降の部分をカンマ区切りする
                tmp = tmp_line[1].split(',')
                morpheme = {'surface': tmp_line[0],
                            'base': tmp[6],
                            'pos': tmp[0],
                            'pos1': tmp[1]}
                sentence.append(morpheme)
    return(neko)


def main():
    cat = neko_load()
    # print(cat)
    cat_word = []

    for line in cat:
        for word in line:
            cat_word.append(word['surface'])

    # 句読点、空白などを抜く
    cat_word = [item for item in cat_word if item != '、' and item != '。'
                and item != '\u3000' and item != '「' and item != '」']
    # print(cat_word)
    count_cat = collections.Counter(cat_word)
    print(count_cat.most_common())


if __name__ == "__main__":
    main()

出力は以下の通り。最初の10個だけ。

[('の', 9194), ('て', 6868), ('は', 6420), ('に', 6243), ('を', 6071), ('と', 5508), ('が', 5337), ('た', 3988), ('で', 3806), ('も', 2479)

全部助詞だった。

参考記事

note.nkmk.me

36. 頻度上位10語

f:id:saturn-glave:20200520215817p:plain

matplotlibの出番。この問題よりmatplotlibの文字化けの方が苦労した...
頻度上位10件だけほしいので、most_common(10)で上位10件のみ取り出しておく。
結果はタプルのリストで出てくるため、単語と出現回数で別々にリストを作り、
それをmatplotlibに読み込ませてグラフを作成する

ポイント

matplotlibで棒グラフを書く時はmatplotlib.pyplot.bar()を使う

pythondatascience.plavox.info

解答スクリプト

# coding:utf-8
import collections
import matplotlib.pyplot as plt


def neko_load():
    path = './neko.txt.mecab'
    neko = []
    sentence = []

    with open(path) as f:
        for line in f:
            # [表層系, それ以外]として一行の中身をリスト化
            tmp_line = line.rstrip('\n').split('\t')
            # EOSの部分で一文とカウント。EOSだった場合はsentenceに何も入らない
            if len(tmp_line) == 1 and len(sentence) != 0:
                neko.append(sentence)
                sentence = []
            elif len(tmp_line) == 2:
                # 品詞以降の部分をカンマ区切りする
                tmp = tmp_line[1].split(',')
                morpheme = {'surface': tmp_line[0],
                            'base': tmp[6],
                            'pos': tmp[0],
                            'pos1': tmp[1]}
                sentence.append(morpheme)
    return(neko)


def main():
    cat = neko_load()
    # print(cat)
    cat_word = []

    for line in cat:
        for word in line:
            cat_word.append(word['surface'])

    # 句読点、空白を抜く
    cat_word = [item for item in cat_word if item != '、' and item != '。'
                and item != '\u3000' and item != '「' and item != '」']
    # print(cat_word)
    count_cat = collections.Counter(cat_word)
    top10 = count_cat.most_common(10)
    # print(top10)
    top10_word = [item[0] for item in top10]
    top10_count = [item[1] for item in top10]
    print(top10_word)
    print(top10_count)
    plt.bar(top10_word, top10_count)
    plt.title('頻度上位10語')
    plt.show()


if __name__ == "__main__":
    main()

出力は以下の通り。

f:id:saturn-glave:20200520220744p:plain

37. 「猫」と共起頻度の高い上位10語

f:id:saturn-glave:20200520221303p:plain

きっと前問から察するに助詞ばかり出てきて面白くないので、
猫と共起する「名詞」と「動詞」に絞って実装した。
また、「猫」も大量に出てくるので、除外。

ここで「共起」について、「猫」が出てくる一文の中で一緒に出てくる単語を想定している。

解答スクリプト

# coding:utf-8
import collections
import matplotlib.pyplot as plt


def neko_load():
    path = './neko.txt.mecab'
    neko = []
    sentence = []

    with open(path) as f:
        for line in f:
            # [表層系, それ以外]として一行の中身をリスト化
            tmp_line = line.rstrip('\n').split('\t')
            # EOSの部分で一文とカウント。EOSだった場合はsentenceに何も入らない
            if len(tmp_line) == 1 and len(sentence) != 0:
                neko.append(sentence)
                sentence = []
            elif len(tmp_line) == 2:
                # 品詞以降の部分をカンマ区切りする
                tmp = tmp_line[1].split(',')
                morpheme = {'surface': tmp_line[0],
                            'base': tmp[6],
                            'pos': tmp[0],
                            'pos1': tmp[1]}
                sentence.append(morpheme)
    return(neko)


def main():
    cat = neko_load()
    cat_word = []

    # 単語を抽出。今回は猫と同じ文に出てきた名詞と動詞のみ抽出
    for line in cat:
        if any(word['surface'] == '猫' for word in line):
            for word in line:
                if word['pos'] == '名詞' or word['pos'] == '動詞':
                    cat_word.append(word['surface'])

    # 句読点、空白、'猫'を抜く
    cat_word = [item for item in cat_word if item != '、' and item != '。'
                and item != '\u3000' and item != '「' and item != '」' and item != '猫']
    print(cat_word)

    count_cat = collections.Counter(cat_word)
    top10 = count_cat.most_common(10)

    top10_word = [item[0] for item in top10]
    top10_count = [item[1] for item in top10]

    plt.bar(top10_word, top10_count)
    plt.title('猫と共起する単語上位10語')
    plt.show()


if __name__ == "__main__":
    main()

結果は以下の通り。「猫」といえば「吾輩」ですよね〜

f:id:saturn-glave:20200528233136p:plain

38. ヒストグラム

f:id:saturn-glave:20200520221523p:plain

せっかくなのでmatplotlibでもっと遊んでみた。
なんとなくプロットすると、最大でも出現頻度が20回程度だったので、
表示範囲を絞ることに。

地味に可変長引数を利用した要素取り出しが便利だった

note.nkmk.me

可変長引数の利用

# 解答スクリプトより
cat_list = list(zip(*cat_freq))[1]

cat_freqの中身は、[('の', 9194), ('て', 6868)]のように、
単語とその出現回数でまとめたタプルのリストになっている。
今回は出現回数だけほしいので、zipで単語と出現回数を同時に見て、
可変長引数を利用して出現回数だけ取り出した。
結果は出現回数がまとまったタプルになっている。

可変長引数を使えば、例えばリストの要素をまとめて展開して渡せる。

ポイント

ヒストグラムを書く際は、matplotlib.pyplot.hist()
各棒の境目を出したい時は、histでec='colorname'を指定
目盛りの境目を絞りたいは、xlim,(xmin=xx, xmax=xx), ylim(ymin=xx, ymax=xx)で指定
目盛りの表示の区切りを変えたい時は、xticks, yticksの中でリストで指定

 plt.xticks([0,5,10,15,20])
 # x軸を0から20まで5刻みに目盛りを振る

解答スクリプト

# coding:utf-8
import collections
import matplotlib.pyplot as plt


def neko_load():
    path = './neko.txt.mecab'
    neko = []
    sentence = []

    with open(path) as f:
        for line in f:
            # [表層系, それ以外]として一行の中身をリスト化
            tmp_line = line.rstrip('\n').split('\t')
            # EOSの部分で一文とカウント。EOSだった場合はsentenceに何も入らない
            if len(tmp_line) == 1 and len(sentence) != 0:
                neko.append(sentence)
                sentence = []
            elif len(tmp_line) == 2:
                # 品詞以降の部分をカンマ区切りする
                tmp = tmp_line[1].split(',')
                morpheme = {'surface': tmp_line[0],
                            'base': tmp[6],
                            'pos': tmp[0],
                            'pos1': tmp[1]}
                sentence.append(morpheme)
    return(neko)


def main():
    cat = neko_load()
    cat_word = []

    # 単語を抽出
    for line in cat:
        for word in line:
            cat_word.append(word['surface'])

    # 句読点、空白、記号類を抜く
    cat_word = [item for item in cat_word if item != '、' and item != '。'
                and item != '\u3000' and item != '「' and item != '」']
    # print(cat_word)

    count_cat = collections.Counter(cat_word)
    cat_freq = count_cat.most_common()
    # print(cat_freq)
    # 出現数だけリスト化
    cat_list = list(zip(*cat_freq))[1]
    print(cat_list)

    plt.hist(cat_list, bins=20, range=(1, 20), color='salmon', ec='darkred')
    plt.title('吾輩は猫である　単語出現頻度')
    plt.xlim(xmin=1, xmax=20)
    plt.xticks([1, 5, 10, 15, 20])
    plt.xlabel('出現頻度')
    plt.ylabel('単語の種類数')
    plt.show()


if __name__ == "__main__":
    main()

f:id:saturn-glave:20200528233509p:plain

ちょっと見た目を整えてみた。満足。

参考

https://matplotlib.org/examples/color/named_colors.htmlmatplotlib.org

pythondatascience.plavox.info

39. Zipfの法則

f:id:saturn-glave:20200520221539p:plain

色々見てみたけど、x, y軸が対数の散布図で書くのがそれっぽそう。
(言語処理100本ノック 2020のトップページにある4章のアイコンがそれ)

Zipfの法則とは

mjin.doshisha.ac.jp

ちょっと調べてみて、上記のサイトに行き着いた。
テキストデータを分析するとき、出現頻度がn番目に大きい要素について、
全体の要素に占める割合が1/nに落ち着く。

ポイント

散布図を書く際は、matplotlib.pyplot.scatter()
x軸、y軸を対数にしたい時は、xscale, yscaleでlogを指定

解答スクリプト

今回も見た目をごりごりいじった。結果、プロットが★になった。

# coding:utf-8
import collections
import matplotlib.pyplot as plt


def neko_load():
    path = './neko.txt.mecab'
    neko = []
    sentence = []

    with open(path) as f:
        for line in f:
            # [表層系, それ以外]として一行の中身をリスト化
            tmp_line = line.rstrip('\n').split('\t')
            # EOSの部分で一文とカウント。EOSだった場合はsentenceに何も入らない
            if len(tmp_line) == 1 and len(sentence) != 0:
                neko.append(sentence)
                sentence = []
            elif len(tmp_line) == 2:
                # 品詞以降の部分をカンマ区切りする
                tmp = tmp_line[1].split(',')
                morpheme = {'surface': tmp_line[0],
                            'base': tmp[6],
                            'pos': tmp[0],
                            'pos1': tmp[1]}
                sentence.append(morpheme)
    return(neko)


def main():
    cat = neko_load()
    cat_word = []

    # 単語を抽出
    for line in cat:
        for word in line:
            cat_word.append(word['surface'])

    # 句読点、空白、記号類を抜く
    cat_word = [item for item in cat_word if item != '、' and item != '。'
                and item != '\u3000' and item != '「' and item != '」']
    # print(cat_word)

    count_cat = collections.Counter(cat_word)
    cat_freq = count_cat.most_common()
    # print(cat_freq)
    # 出現数だけリスト化
    cat_list = list(zip(*cat_freq))[1]
    # print(cat_list)
    rank = [_ for _ in range(1, len(cat_list) + 1)]

    plt.scatter(rank, cat_list, c='salmon', s=30, marker='*', alpha=0.5,
                edgecolors='darkred', linewidths='1')
    plt.title('吾輩は猫である　Zipfの法則')
    plt.xscale('log')
    plt.yscale('log')
    plt.xlim(xmin=1, xmax=len(cat_list) + 1)
    plt.ylim(ymin=1, ymax=cat_list[0])
    plt.xlabel('単語の出現頻度順位')
    plt.ylabel('出現頻度')
    plt.grid(axis='both')
    plt.show()


if __name__ == "__main__":
    main()

f:id:saturn-glave:20200530212711p:plain

参考

pythondatascience.plavox.info

Port 53

明日のための技術メモ

言語処理100本ノック 2020 第4章後半

目次

matplotlibで日本語を文字化けさせない準備(MacBook)

1. IPAexゴシックフォントのダウンロード

2. matplotlibの設定を確認

3. matplotlibrcの作成

4. matplotlibのキャッシュファイルを全削除

5. fontlist.jsonの削除

参考

35. 単語の出現頻度

ポイント

解答

参考記事

36. 頻度上位10語

ポイント

解答スクリプト

37. 「猫」と共起頻度の高い上位10語

解答スクリプト

38. ヒストグラム

可変長引数の利用

ポイント

解答スクリプト

参考

39. Zipfの法則

Zipfの法則とは

ポイント

解答スクリプト

参考