自然言語処理で、レビューの識別やってみた！その２【図解速習DeepLearning】#013

こんにちは！こーたろーです。

本日は、映画情報サイトにあるレビューを識別する（その２）を行っていきます。

昨日は、Embedding層とMLPで構成されたものを取り扱いました。

自然言語処理で、レビューの識別やってみた！その１【図解速習DeepLearning】#012 - 福岡の社会人データサイエンティストの部屋

今回は、TF_Hubで提供されているEmbeddingモデルを使って、転移学習を行っていきます。

１．必要なライブラリーのインポート

import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

２．データセットの準備

今回は、ポジティブ度合いを10段階でラベル付けされたIMDB映画レビューで行っていきます。

呼び出すデータセットは、既にデータ整形されたものとなっているため、前回のように前処理は不要となっています。

def load_directory_data(directory):
  data = {}
  data["sentence"] = []
  data["sentiment"] = []
  for file_path in os.listdir(directory):
    with tf.io.gfile.GFile(os.path.join(directory, file_path), "r") as f:
      data["sentence"].append(f.read())
      data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
  return pd.DataFrame.from_dict(data)

def load_dataset(directory):
  pos_df = load_directory_data(os.path.join(directory, "pos"))
  neg_df = load_directory_data(os.path.join(directory, "neg"))
  pos_df["polarity"] = 1
  neg_df["polarity"] = 0
  return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

def download_and_load_datasets(force_download=False):
  dataset = tf.keras.utils.get_file(
      fname="aclImdb.tar.gz", 
      origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
      extract=True)
  
  train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                       "aclImdb", "train"))
  test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                      "aclImdb", "test"))
  
  return train_df, test_df

tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

train_df, test_df = download_and_load_datasets()
train_df.head()

f:id:dsf-kotaro:20210210195759p:plain

３．モデルを作成

【入力関数】

TensorflowのEstimatorフレームワークを活用していきます。
こちらは、Pandasのデータフレームをラッピングする入力関数を提供しているため、データフレームのまま入力できるため、非常に便利です。

train_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(
    train_df, train_df["polarity"], num_epochs=None, shuffle=True)

predict_train_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(
    train_df, train_df["polarity"], shuffle=False)

predict_test_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(
    test_df, test_df["polarity"], shuffle=False)

【特徴カラム】

TF_Hubでは、与えられたテキストにモジュールを適用させて、その出力を渡す【特徴カラム】というものがあるようです。
今回は、「nnlm-en-dim128」というモジュールを使用しています。

embedded_text_feature_column = hub.text_embedding_column(
    key="sentence", 
    module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")

【推論器】

分類器には、DNN_Classifierを使用しています。

estimator = tf.compat.v1.estimator.DNNClassifier(
    hidden_units=[500, 100],
    feature_columns=[embedded_text_feature_column],
    n_classes=2,
    optimizer=tf.compat.v1.train.AdagradOptimizer(learning_rate=0.003))

４．モデルの学習

estimator.train(input_fn=train_input_fn, steps=1000);

５．モデルの評価

学習データセットとテストデータセットの両方に対して予測を実行します。

train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)

print("Training set accuracy: {accuracy}".format(**train_eval_result))
print("Test set accuracy: {accuracy}".format(**test_eval_result))

f:id:dsf-kotaro:20210210195944p:plain

誤分類の分布を表示して、理解を深めてみましょう。
混同行列を見て確認していきます。

def get_predictions(estimator, input_fn):
  return [x["class_ids"][0] for x in estimator.predict(input_fn=input_fn)]

LABELS = [ "negative", "positive"]

with tf.Graph().as_default():
  cm = tf.math.confusion_matrix(train_df["polarity"], 
                           get_predictions(estimator, predict_train_input_fn))
  with tf.compat.v1.Session() as session:
    cm_out = session.run(cm)

cm_out = cm_out.astype(float) / cm_out.sum(axis=1)[:, np.newaxis]

sns.heatmap(cm_out, annot=True, xticklabels=LABELS, yticklabels=LABELS);
plt.xlabel("Predicted");
plt.ylabel("True");

f:id:dsf-kotaro:20210210200029p:plain

上記では、2項分類で行った結果を示しています。

予測と正解を比べてみると、ネガティブが80%以上、ポジティブは70％以上の正解率で識別できているのが分かります。

６．発展：転移学習

今回使用したデータは、10段階評価をラベリングされているため、分類ではなく、回帰に変更すると、０から10のスケールを利用できるようになります。

nnlm-en-dim128:　事前学習済みのテキストembeddingモジュール
random-nnlm-en-dim128:　nnlm-en-dim128:と同じ語彙やネットワークを持つテキストembeddingモジュール

これら２つのモードで学習させます。

ケース１：分類器のみを学習させる

ケース２：モジュールと分類器を共に学習させる

def train_and_evaluate_with_module(hub_module, train_module=False):
  embedded_text_feature_column = hub.text_embedding_column(
      key="sentence", module_spec=hub_module, trainable=train_module)

  estimator = tf.compat.v1.estimator.DNNClassifier(
      hidden_units=[500, 100],
      feature_columns=[embedded_text_feature_column],
      n_classes=2,
      optimizer=tf.compat.v1.train.AdagradOptimizer(learning_rate=0.003))

  estimator.train(input_fn=train_input_fn, steps=1000)

  train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
  test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)

  training_set_accuracy = train_eval_result["accuracy"]
  test_set_accuracy = test_eval_result["accuracy"]

  return {
      "Training accuracy": training_set_accuracy,
      "Test accuracy": test_set_accuracy
  }


results = {}
results["nnlm-en-dim128"] = train_and_evaluate_with_module(
    "https://tfhub.dev/google/nnlm-en-dim128/1")
results["nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module(
    "https://tfhub.dev/google/nnlm-en-dim128/1", True)
results["random-nnlm-en-dim128"] = train_and_evaluate_with_module(
    "https://tfhub.dev/google/random-nnlm-en-dim128/1")
results["random-nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module(
    "https://tfhub.dev/google/random-nnlm-en-dim128/1", True)

結果を比較してみましょう。

pd.DataFrame.from_dict(results, orient="index")

f:id:dsf-kotaro:20210210200134p:plain

結果をみた感じでは、モジュールと分類器はどちらも学習させる方が、汎化性能が高いような感じですね。

他のモデルでもそうなのかもしれませんが、転移学習の際は、転移元のデータ（重み）の更新までやる方が効率が高いようです。

以前のモデルもそうでした。

今回はここまでです。

次回もまたお楽しみに！

ではでは。。

福岡人データサイエンティストの部屋

データサイエンスを極めるため、日々の学習を綴っています。

自然言語処理で、レビューの識別やってみた！その２【図解速習DeepLearning】#013