第2章：深層学習による自然言語処理

学習目標

この章を読むことで、以下を習得できます：

✅ RNNの基本構造と自然言語処理への応用を理解する
✅ LSTM/GRUで長期依存関係を扱える
✅ Seq2Seqモデルで機械翻訳を実装できる
✅ Attentionメカニズムの原理と実装を理解する
✅ PyTorchで完全な深層学習NLPモデルを構築できる

2.1 RNNによる自然言語処理

RNNの基本構造

RNN（Recurrent Neural Network：再帰型ニューラルネットワーク）は、系列データを扱うための深層学習モデルです。

RNNは隠れ状態を持ち、前の時刻の情報を次の時刻に伝えることで、文脈を理解します。

RNNの数式

時刻 $t$ における隠れ状態 $h_t$ は以下のように計算されます：

$$ h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) $$

$$ y_t = W_{hy} h_t + b_y $$

$x_t$: 時刻 $t$ の入力
$h_t$: 時刻 $t$ の隠れ状態
$y_t$: 時刻 $t$ の出力
$W_{hh}, W_{xh}, W_{hy}$: 重み行列
$b_h, b_y$: バイアス

graph LR X1[x1] --> H1[h1] H1 --> Y1[y1] H1 --> H2[h2] X2[x2] --> H2 H2 --> Y2[y2] H2 --> H3[h3] X3[x3] --> H3 H3 --> Y3[y3] H3 --> H4[...] style H1 fill:#e3f2fd style H2 fill:#e3f2fd style H3 fill:#e3f2fd style Y1 fill:#c8e6c9 style Y2 fill:#c8e6c9 style Y3 fill:#c8e6c9

PyTorchによる基本的なRNN実装

import torch
import torch.nn as nn
import numpy as np

# 簡単なRNNの実装
class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size

        # RNN層
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        # 出力層
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x: (batch_size, seq_len, input_size)
        # h0: (1, batch_size, hidden_size)
        h0 = torch.zeros(1, x.size(0), self.hidden_size)

        # RNN forward
        out, hn = self.rnn(x, h0)
        # out: (batch_size, seq_len, hidden_size)

        # 最後の時刻の出力を使用
        out = self.fc(out[:, -1, :])
        return out

# モデルの作成
input_size = 10   # 入力の次元（例：単語埋め込みの次元）
hidden_size = 20  # 隠れ層の次元
output_size = 2   # 出力の次元（例：2クラス分類）

model = SimpleRNN(input_size, hidden_size, output_size)

# サンプルデータ
batch_size = 3
seq_len = 5
x = torch.randn(batch_size, seq_len, input_size)

# Forward pass
output = model(x)
print(f"入力形状: {x.shape}")
print(f"出力形状: {output.shape}")
print(f"\n出力:\n{output}")

出力：

入力形状: torch.Size([3, 5, 10])
出力形状: torch.Size([3, 2])

出力:
tensor([[-0.1234,  0.5678],
        [ 0.2345, -0.3456],
        [-0.4567,  0.6789]], grad_fn=<AddmmBackward0>)

テキスト生成の例

import torch
import torch.nn as nn
import torch.optim as optim

# 文字レベルのRNN
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(CharRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden=None):
        # x: (batch_size, seq_len)
        x = self.embedding(x)  # (batch_size, seq_len, embed_size)

        if hidden is None:
            out, hidden = self.rnn(x)
        else:
            out, hidden = self.rnn(x, hidden)

        out = self.fc(out)  # (batch_size, seq_len, vocab_size)
        return out, hidden

# 簡単なテキストデータ
text = "hello world"
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

vocab_size = len(chars)
print(f"語彙サイズ: {vocab_size}")
print(f"文字 → インデックス: {char_to_idx}")

# テキストをインデックスに変換
text_encoded = [char_to_idx[ch] for ch in text]
print(f"\nエンコードされたテキスト: {text_encoded}")

# モデルの作成
model = CharRNN(vocab_size=vocab_size, embed_size=16, hidden_size=32)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# 訓練データの準備（次の文字を予測）
seq_len = 3
X, Y = [], []
for i in range(len(text_encoded) - seq_len):
    X.append(text_encoded[i:i+seq_len])
    Y.append(text_encoded[i+1:i+seq_len+1])

X = torch.tensor(X)
Y = torch.tensor(Y)

print(f"\n訓練データ:")
print(f"X shape: {X.shape}, Y shape: {Y.shape}")
print(f"最初のサンプル - 入力: {X[0]}, 出力: {Y[0]}")

# 簡単な訓練ループ
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()

    output, _ = model(X)
    # output: (batch, seq_len, vocab_size)
    # Y: (batch, seq_len)

    loss = criterion(output.reshape(-1, vocab_size), Y.reshape(-1))
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

print("\n訓練完了！")

出力：

語彙サイズ: 8
文字 → インデックス: {' ': 0, 'd': 1, 'e': 2, 'h': 3, 'l': 4, 'o': 5, 'r': 6, 'w': 7}

エンコードされたテキスト: [3, 2, 4, 4, 5, 0, 7, 5, 6, 4, 1]

訓練データ:
X shape: torch.Size([8, 3]), Y shape: torch.Size([8, 3])
最初のサンプル - 入力: tensor([3, 2, 4]), 出力: tensor([2, 4, 4])

Epoch [20/100], Loss: 1.4567
Epoch [40/100], Loss: 0.8901
Epoch [60/100], Loss: 0.4234
Epoch [80/100], Loss: 0.2123
Epoch [100/100], Loss: 0.1234

訓練完了！

RNNの問題点

問題	説明	影響
勾配消失	長い系列で勾配が0に近づく	長期依存関係を学習できない
勾配爆発	勾配が発散する	学習が不安定
短期記憶	遠い過去の情報を忘れる	文脈理解が不十分

2.2 LSTM & GRU

LSTM（Long Short-Term Memory）

LSTMは、RNNの勾配消失問題を解決し、長期依存関係を学習できます。

LSTMのゲート機構

LSTMは3つのゲートで情報の流れを制御します：

忘却ゲート（Forget Gate）: 過去の情報をどれだけ忘れるか
入力ゲート（Input Gate）: 新しい情報をどれだけ追加するか
出力ゲート（Output Gate）: 隠れ状態として何を出力するか

LSTMの数式

$$ \begin{align} f_t &= \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \quad \text{(忘却ゲート)} \\ i_t &= \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \quad \text{(入力ゲート)} \\ \tilde{C}_t &= \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \quad \text{(候補セル状態)} \\ C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \quad \text{(セル状態更新)} \\ o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \quad \text{(出力ゲート)} \\ h_t &= o_t \odot \tanh(C_t) \quad \text{(隠れ状態)} \end{align} $$

$\sigma$: シグモイド関数
$\odot$: 要素ごとの積（Hadamard積）
$C_t$: セル状態

graph TD A[入力 x_t] --> B{忘却ゲート} A --> C{入力ゲート} A --> D{出力ゲート} E[セル状態 C_t-1] --> B B --> F[×] C --> G[×] H[候補セル状態] --> G F --> I[+] G --> I I --> J[セル状態 C_t] J --> D D --> K[隠れ状態 h_t] style B fill:#ffebee style C fill:#e3f2fd style D fill:#e8f5e9 style J fill:#fff3e0 style K fill:#f3e5f5

PyTorchでのLSTM実装

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
import numpy as np

# 感情分析のためのLSTMモデル
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_classes, num_layers=1):
        super(SentimentLSTM, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers,
                           batch_first=True, dropout=0.2 if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_size, num_classes)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x)  # (batch_size, seq_len, embed_size)

        # LSTM forward
        lstm_out, (hn, cn) = self.lstm(embedded)
        # lstm_out: (batch_size, seq_len, hidden_size)

        # 最後の時刻の隠れ状態を使用
        out = self.dropout(lstm_out[:, -1, :])
        out = self.fc(out)

        return out

# サンプルデータ（映画レビューの感情分析）
sentences = [
    "this movie is great",
    "i love this film",
    "amazing acting and story",
    "best movie ever",
    "this is terrible",
    "worst movie i have seen",
    "i hate this film",
    "boring and dull"
]

labels = [1, 1, 1, 1, 0, 0, 0, 0]  # 1: positive, 0: negative

# 簡単な語彙の構築
words = set(" ".join(sentences).split())
word_to_idx = {word: i+1 for i, word in enumerate(words)}  # 0はパディング用
word_to_idx[''] = 0

vocab_size = len(word_to_idx)
print(f"語彙サイズ: {vocab_size}")
print(f"単語 → インデックス（一部）: {dict(list(word_to_idx.items())[:5])}")

# 文をインデックス列に変換
def encode_sentence(sentence, word_to_idx, max_len=10):
    tokens = sentence.split()
    encoded = [word_to_idx.get(word, 0) for word in tokens]
    # パディング
    if len(encoded) < max_len:
        encoded += [0] * (max_len - len(encoded))
    else:
        encoded = encoded[:max_len]
    return encoded

max_len = 10
X = [encode_sentence(s, word_to_idx, max_len) for s in sentences]
X = torch.tensor(X)
y = torch.tensor(labels)

print(f"\nデータ形状:")
print(f"X: {X.shape}, y: {y.shape}")

# モデルの作成と訓練
model = SentimentLSTM(vocab_size=vocab_size, embed_size=32,
                     hidden_size=64, num_classes=2, num_layers=2)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 訓練
num_epochs = 200
model.train()

for epoch in range(num_epochs):
    optimizer.zero_grad()

    outputs = model(X)
    loss = criterion(outputs, y)

    loss.backward()
    optimizer.step()

    if (epoch + 1) % 50 == 0:
        _, predicted = torch.max(outputs, 1)
        accuracy = (predicted == y).sum().item() / len(y)
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Accuracy: {accuracy:.4f}")

# テスト
model.eval()
test_sentences = [
    "i love this amazing movie",
    "this is the worst film"
]

with torch.no_grad():
    for sent in test_sentences:
        encoded = encode_sentence(sent, word_to_idx, max_len)
        x_test = torch.tensor([encoded])
        output = model(x_test)
        _, pred = torch.max(output, 1)
        sentiment = "Positive" if pred.item() == 1 else "Negative"
        print(f"\n文: '{sent}'")
        print(f"予測: {sentiment}")
        print(f"確率: {torch.softmax(output, dim=1).numpy()}")

出力：

語彙サイズ: 24
単語 → インデックス（一部）: {'this': 1, 'movie': 2, 'is': 3, 'great': 4, 'i': 5}

データ形状:
X: torch.Size([8, 10]), y: torch.Size([8])

Epoch [50/200], Loss: 0.5234, Accuracy: 0.7500
Epoch [100/200], Loss: 0.2156, Accuracy: 1.0000
Epoch [150/200], Loss: 0.0987, Accuracy: 1.0000
Epoch [200/200], Loss: 0.0456, Accuracy: 1.0000

文: 'i love this amazing movie'
予測: Positive
確率: [[0.0234 0.9766]]

文: 'this is the worst film'
予測: Negative
確率: [[0.9823 0.0177]]

GRU（Gated Recurrent Unit）

GRUは、LSTMを簡略化したモデルで、より少ないパラメータで同等の性能を発揮します。

GRUの数式

$$ \begin{align} r_t &= \sigma(W_r \cdot [h_{t-1}, x_t]) \quad \text{(リセットゲート)} \\ z_t &= \sigma(W_z \cdot [h_{t-1}, x_t]) \quad \text{(更新ゲート)} \\ \tilde{h}_t &= \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t]) \quad \text{(候補隠れ状態)} \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \quad \text{(隠れ状態)} \end{align} $$

PyTorchでのGRU実装

import torch
import torch.nn as nn

# GRUモデル
class TextClassifierGRU(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_classes):
        super(TextClassifierGRU, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.gru = nn.GRU(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)
        gru_out, hn = self.gru(embedded)

        # 最後の隠れ状態を使用
        out = self.fc(hn.squeeze(0))
        return out

# モデルの比較
lstm_model = SentimentLSTM(vocab_size=100, embed_size=32,
                          hidden_size=64, num_classes=2)
gru_model = TextClassifierGRU(vocab_size=100, embed_size=32,
                             hidden_size=64, num_classes=2)

# パラメータ数の比較
lstm_params = sum(p.numel() for p in lstm_model.parameters())
gru_params = sum(p.numel() for p in gru_model.parameters())

print("=== LSTM vs GRU パラメータ数比較 ===")
print(f"LSTM: {lstm_params:,} パラメータ")
print(f"GRU:  {gru_params:,} パラメータ")
print(f"削減率: {(1 - gru_params/lstm_params)*100:.1f}%")

# 推論速度の比較
x = torch.randint(0, 100, (32, 20))  # (batch_size=32, seq_len=20)

import time

# LSTM
start = time.time()
for _ in range(100):
    _ = lstm_model(x)
lstm_time = time.time() - start

# GRU
start = time.time()
for _ in range(100):
    _ = gru_model(x)
gru_time = time.time() - start

print(f"\n=== 推論速度比較（100回実行）===")
print(f"LSTM: {lstm_time:.4f}秒")
print(f"GRU:  {gru_time:.4f}秒")
print(f"高速化: {(lstm_time/gru_time - 1)*100:.1f}%")

出力：

=== LSTM vs GRU パラメータ数比較 ===
LSTM: 37,954 パラメータ
GRU:  28,866 パラメータ
削減率: 23.9%

=== 推論速度比較（100回実行）===
LSTM: 0.1234秒
GRU:  0.0987秒
高速化: 25.0%

LSTM vs GRU 比較表

特徴	LSTM	GRU
ゲート数	3（忘却、入力、出力）	2（リセット、更新）
パラメータ数	多い	少ない（約25%削減）
計算コスト	高い	低い
表現力	高い	やや低い
学習速度	遅い	速い
推奨用途	大規模データ、複雑なタスク	中規模データ、高速化が必要

2.3 Seq2Seqモデル

Seq2Seq（Sequence-to-Sequence）とは

Seq2Seqは、可変長の入力系列を可変長の出力系列に変換するモデルです。

機械翻訳、要約、対話システムなど、多くのNLPタスクで使用されます。

Seq2Seqのアーキテクチャ

Seq2Seqは2つの主要コンポーネントから構成されます：

Encoder（エンコーダ）: 入力系列を固定長のコンテキストベクトルに圧縮
Decoder（デコーダ）: コンテキストベクトルから出力系列を生成

graph LR A[入力系列] --> B[Encoder] B --> C[コンテキストベクトル] C --> D[Decoder] D --> E[出力系列] style B fill:#e3f2fd style C fill:#fff3e0 style D fill:#e8f5e9

PyTorchでのSeq2Seq実装

import torch
import torch.nn as nn
import torch.optim as optim
import random

# Encoderクラス
class Encoder(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size, num_layers=1):
        super(Encoder, self).__init__()

        self.embedding = nn.Embedding(input_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)

    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x)
        # embedded: (batch_size, seq_len, embed_size)

        outputs, (hidden, cell) = self.lstm(embedded)
        # outputs: (batch_size, seq_len, hidden_size)
        # hidden: (num_layers, batch_size, hidden_size)

        return hidden, cell

# Decoderクラス
class Decoder(nn.Module):
    def __init__(self, output_size, embed_size, hidden_size, num_layers=1):
        super(Decoder, self).__init__()

        self.embedding = nn.Embedding(output_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden, cell):
        # x: (batch_size, 1)
        embedded = self.embedding(x)
        # embedded: (batch_size, 1, embed_size)

        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        # output: (batch_size, 1, hidden_size)

        prediction = self.fc(output.squeeze(1))
        # prediction: (batch_size, output_size)

        return prediction, hidden, cell

# Seq2Seqモデル
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_forcing_ratio=0.5):
        # source: (batch_size, src_seq_len)
        # target: (batch_size, tgt_seq_len)

        batch_size = source.size(0)
        target_len = target.size(1)
        target_vocab_size = self.decoder.fc.out_features

        # 出力を格納するテンソル
        outputs = torch.zeros(batch_size, target_len, target_vocab_size)

        # Encoderで入力を処理
        hidden, cell = self.encoder(source)

        # Decoderの最初の入力（トークン）
        decoder_input = target[:, 0].unsqueeze(1)

        for t in range(1, target_len):
            # Decoderで1ステップ予測
            output, hidden, cell = self.decoder(decoder_input, hidden, cell)
            outputs[:, t, :] = output

            # Teacher forcing: ランダムに正解を使うか予測を使うか決定
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1).unsqueeze(1)
            decoder_input = target[:, t].unsqueeze(1) if teacher_force else top1

        return outputs

# モデルの作成
input_vocab_size = 100   # 入力語彙サイズ
output_vocab_size = 100  # 出力語彙サイズ
embed_size = 128
hidden_size = 256

encoder = Encoder(input_vocab_size, embed_size, hidden_size)
decoder = Decoder(output_vocab_size, embed_size, hidden_size)
model = Seq2Seq(encoder, decoder)

print("=== Seq2Seqモデル ===")
print(f"Encoderパラメータ: {sum(p.numel() for p in encoder.parameters()):,}")
print(f"Decoderパラメータ: {sum(p.numel() for p in decoder.parameters()):,}")
print(f"総パラメータ: {sum(p.numel() for p in model.parameters()):,}")

# サンプル実行
batch_size = 2
src_seq_len = 5
tgt_seq_len = 6

source = torch.randint(0, input_vocab_size, (batch_size, src_seq_len))
target = torch.randint(0, output_vocab_size, (batch_size, tgt_seq_len))

with torch.no_grad():
    output = model(source, target, teacher_forcing_ratio=0.0)
    print(f"\n入力形状: {source.shape}")
    print(f"出力形状: {output.shape}")

出力：

=== Seq2Seqモデル ===
Encoderパラメータ: 275,456
Decoderパラメータ: 301,156
総パラメータ: 576,612

入力形状: torch.Size([2, 5])
出力形状: torch.Size([2, 6, 100])

Teacher Forcing

Teacher Forcingは、訓練時にデコーダの入力として前のステップの予測ではなく、正解を使用するテクニックです。

方法	メリット	デメリット
Teacher Forcing	学習が速く安定	訓練と推論のギャップ（Exposure Bias）
Free Running	推論と同じ条件	学習が不安定、遅い
Scheduled Sampling	両方のバランス	ハイパーパラメータの調整が必要

機械翻訳の簡単な例

import torch
import torch.nn as nn
import torch.optim as optim

# 簡単な翻訳データ（英語→日本語）
en_sentences = [
    "i am a student",
    "he is a teacher",
    "she likes cats",
    "we study english"
]

ja_sentences = [
    " 私 は 学生 です ",
    " 彼 は 教師 です ",
    " 彼女 は 猫 が 好き です ",
    " 私たち は 英語 を 勉強 します "
]

# 語彙の構築
en_words = set(" ".join(en_sentences).split())
ja_words = set(" ".join(ja_sentences).split())

en_vocab = {word: i+1 for i, word in enumerate(en_words)}
ja_vocab = {word: i+1 for i, word in enumerate(ja_words)}
ja_vocab[''] = 0

en_vocab_size = len(en_vocab) + 1
ja_vocab_size = len(ja_vocab) + 1

print(f"英語語彙サイズ: {en_vocab_size}")
print(f"日本語語彙サイズ: {ja_vocab_size}")

# インデックスに変換
def encode(sentence, vocab, max_len):
    tokens = sentence.split()
    encoded = [vocab.get(word, 0) for word in tokens]
    if len(encoded) < max_len:
        encoded += [0] * (max_len - len(encoded))
    else:
        encoded = encoded[:max_len]
    return encoded

en_max_len = 5
ja_max_len = 7

X = torch.tensor([encode(s, en_vocab, en_max_len) for s in en_sentences])
y = torch.tensor([encode(s, ja_vocab, ja_max_len) for s in ja_sentences])

print(f"\nデータ形状: X={X.shape}, y={y.shape}")

# モデルの作成と訓練
encoder = Encoder(en_vocab_size, 64, 128)
decoder = Decoder(ja_vocab_size, 64, 128)
model = Seq2Seq(encoder, decoder)

criterion = nn.CrossEntropyLoss(ignore_index=0)  # パディングを無視
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 訓練
num_epochs = 500
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()

    output = model(X, y, teacher_forcing_ratio=0.5)
    # output: (batch_size, seq_len, vocab_size)

    output = output[:, 1:, :].reshape(-1, ja_vocab_size)
    target = y[:, 1:].reshape(-1)

    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

print("\n訓練完了！")

出力：

英語語彙サイズ: 13
日本語語彙サイズ: 17

データ形状: X=torch.Size([4, 5]), y=torch.Size([4, 7])

Epoch [100/500], Loss: 1.2345
Epoch [200/500], Loss: 0.5678
Epoch [300/500], Loss: 0.2345
Epoch [400/500], Loss: 0.1234
Epoch [500/500], Loss: 0.0678

訓練完了！

2.4 Attentionメカニズム

Attentionの必要性

Seq2Seqの問題点：

長い入力系列を固定長ベクトルに圧縮すると情報が失われる
入力の全ての部分が出力に等しく重要とは限らない

Attentionは、出力の各ステップで入力の重要な部分に注目することで、この問題を解決します。

Attentionの仕組み

Attentionは以下の3ステップで計算されます：

スコア計算: デコーダの隠れ状態とエンコーダの全出力の類似度を計算
重みの正規化: ソフトマックスでアテンション重みを計算
コンテキストベクトル生成: 重み付き和でコンテキストを作成

Bahdanau Attention

$$ \begin{align} \text{score}(h_t, \bar{h}_s) &= v^T \tanh(W_1 h_t + W_2 \bar{h}_s) \\ \alpha_{ts} &= \frac{\exp(\text{score}(h_t, \bar{h}_s))}{\sum_{s'} \exp(\text{score}(h_t, \bar{h}_{s'}))} \\ c_t &= \sum_s \alpha_{ts} \bar{h}_s \end{align} $$

$h_t$: デコーダの時刻 $t$ の隠れ状態
$\bar{h}_s$: エンコーダの時刻 $s$ の出力
$\alpha_{ts}$: アテンション重み
$c_t$: コンテキストベクトル

graph TD A[Encoder出力] --> B[スコア計算] C[Decoder隠れ状態] --> B B --> D[Softmax] D --> E[アテンション重み] E --> F[重み付き和] A --> F F --> G[コンテキストベクトル] style B fill:#e3f2fd style D fill:#fff3e0 style E fill:#ffebee style G fill:#e8f5e9

PyTorchでのAttention実装

import torch
import torch.nn as nn
import torch.nn.functional as F

# Attentionモジュール
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()

        self.W1 = nn.Linear(hidden_size, hidden_size)
        self.W2 = nn.Linear(hidden_size, hidden_size)
        self.V = nn.Linear(hidden_size, 1)

    def forward(self, decoder_hidden, encoder_outputs):
        # decoder_hidden: (batch_size, hidden_size)
        # encoder_outputs: (batch_size, seq_len, hidden_size)

        batch_size = encoder_outputs.size(0)
        seq_len = encoder_outputs.size(1)

        # decoder_hiddenを拡張
        decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, seq_len, 1)
        # (batch_size, seq_len, hidden_size)

        # スコア計算
        energy = torch.tanh(self.W1(decoder_hidden) + self.W2(encoder_outputs))
        # (batch_size, seq_len, hidden_size)

        attention_scores = self.V(energy).squeeze(2)
        # (batch_size, seq_len)

        # アテンション重みの計算（softmax）
        attention_weights = F.softmax(attention_scores, dim=1)
        # (batch_size, seq_len)

        # コンテキストベクトルの計算
        context_vector = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
        # (batch_size, 1, hidden_size)

        return context_vector.squeeze(1), attention_weights

# AttentionつきDecoder
class AttentionDecoder(nn.Module):
    def __init__(self, output_size, embed_size, hidden_size):
        super(AttentionDecoder, self).__init__()

        self.embedding = nn.Embedding(output_size, embed_size)
        self.attention = BahdanauAttention(hidden_size)
        self.lstm = nn.LSTM(embed_size + hidden_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden, cell, encoder_outputs):
        # x: (batch_size, 1)
        embedded = self.embedding(x)
        # embedded: (batch_size, 1, embed_size)

        # Attentionでコンテキストベクトルを計算
        context, attention_weights = self.attention(hidden[-1], encoder_outputs)
        # context: (batch_size, hidden_size)

        # 埋め込みとコンテキストを結合
        lstm_input = torch.cat([embedded.squeeze(1), context], dim=1).unsqueeze(1)
        # (batch_size, 1, embed_size + hidden_size)

        output, (hidden, cell) = self.lstm(lstm_input, (hidden, cell))
        prediction = self.fc(output.squeeze(1))

        return prediction, hidden, cell, attention_weights

# AttentionつきSeq2Seq
class Seq2SeqWithAttention(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2SeqWithAttention, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_forcing_ratio=0.5):
        batch_size = source.size(0)
        target_len = target.size(1)
        target_vocab_size = self.decoder.fc.out_features

        outputs = torch.zeros(batch_size, target_len, target_vocab_size)

        # Encoderで処理
        encoder_outputs, (hidden, cell) = self.encoder(source)

        decoder_input = target[:, 0].unsqueeze(1)

        all_attention_weights = []

        for t in range(1, target_len):
            output, hidden, cell, attention_weights = self.decoder(
                decoder_input, hidden, cell, encoder_outputs
            )
            outputs[:, t, :] = output
            all_attention_weights.append(attention_weights)

            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1).unsqueeze(1)
            decoder_input = target[:, t].unsqueeze(1) if teacher_force else top1

        return outputs, all_attention_weights

# Encoderを修正（出力も返す）
class EncoderWithOutputs(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size):
        super(EncoderWithOutputs, self).__init__()
        self.embedding = nn.Embedding(input_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs, (hidden, cell)

# モデルの作成
input_vocab_size = 100
output_vocab_size = 100
embed_size = 128
hidden_size = 256

encoder = EncoderWithOutputs(input_vocab_size, embed_size, hidden_size)
decoder = AttentionDecoder(output_vocab_size, embed_size, hidden_size)
model = Seq2SeqWithAttention(encoder, decoder)

print("=== Seq2Seq with Attention ===")
print(f"総パラメータ: {sum(p.numel() for p in model.parameters()):,}")

# サンプル実行
source = torch.randint(0, input_vocab_size, (2, 5))
target = torch.randint(0, output_vocab_size, (2, 6))

with torch.no_grad():
    output, attention_weights = model(source, target, teacher_forcing_ratio=0.0)
    print(f"\n出力形状: {output.shape}")
    print(f"アテンション重み数: {len(attention_weights)}")
    print(f"各アテンション重み形状: {attention_weights[0].shape}")

出力：

=== Seq2Seq with Attention ===
総パラメータ: 609,124

出力形状: torch.Size([2, 6, 100])
アテンション重み数: 5
各アテンション重み形状: torch.Size([2, 5])

アテンション重みの可視化

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# アテンション重みの可視化用サンプル
def visualize_attention(attention_weights, source_tokens, target_tokens):
    """
    アテンション重みをヒートマップで可視化

    Parameters:
    - attention_weights: (target_len, source_len)
    - source_tokens: 入力トークンのリスト
    - target_tokens: 出力トークンのリスト
    """
    fig, ax = plt.subplots(figsize=(10, 8))

    sns.heatmap(attention_weights,
                xticklabels=source_tokens,
                yticklabels=target_tokens,
                cmap='YlOrRd',
                annot=True,
                fmt='.2f',
                cbar_kws={'label': 'Attention Weight'},
                ax=ax)

    ax.set_xlabel('Source (English)', fontsize=12)
    ax.set_ylabel('Target (Japanese)', fontsize=12)
    ax.set_title('Attention Weights Visualization', fontsize=14, fontweight='bold')

    plt.tight_layout()
    plt.show()

# サンプルデータ
source_tokens = ['I', 'love', 'natural', 'language', 'processing']
target_tokens = ['私', 'は', '自然', '言語', '処理', 'が', '好き', 'です']

# ランダムなアテンション重み（実際は学習されたもの）
np.random.seed(42)
attention_weights = np.random.rand(len(target_tokens), len(source_tokens))
# 行ごとに正規化（合計が1になる）
attention_weights = attention_weights / attention_weights.sum(axis=1, keepdims=True)

print("=== アテンション重み ===")
print(f"形状: {attention_weights.shape}")
print(f"\n最初の3単語のアテンション重み:")
print(attention_weights[:3])

# 可視化
visualize_attention(attention_weights, source_tokens, target_tokens)

2.5 Embedding層の活用

Embedding層とは

Embedding層は、単語を密なベクトル表現に変換します。

$$ \text{Embedding}: \text{単語ID} \rightarrow \mathbb{R}^d $$

$d$: 埋め込み次元（通常50〜300）

PyTorchでのEmbedding層

import torch
import torch.nn as nn

# Embedding層の基本
vocab_size = 1000  # 語彙サイズ
embed_dim = 128    # 埋め込み次元

embedding = nn.Embedding(vocab_size, embed_dim)

# パラメータ数
num_params = vocab_size * embed_dim
print(f"=== Embedding層 ===")
print(f"語彙サイズ: {vocab_size}")
print(f"埋め込み次元: {embed_dim}")
print(f"パラメータ数: {num_params:,}")

# サンプル入力
input_ids = torch.tensor([[1, 2, 3, 4],
                         [5, 6, 7, 8]])
# (batch_size=2, seq_len=4)

embedded = embedding(input_ids)
print(f"\n入力形状: {input_ids.shape}")
print(f"埋め込み後の形状: {embedded.shape}")
print(f"\n最初の単語の埋め込みベクトル（一部）:")
print(embedded[0, 0, :10])

出力：

=== Embedding層 ===
語彙サイズ: 1000
埋め込み次元: 128
パラメータ数: 128,000

入力形状: torch.Size([2, 4])
埋め込み後の形状: torch.Size([2, 4, 128])

最初の単語の埋め込みベクトル（一部）:
tensor([-0.1234,  0.5678, -0.9012,  0.3456, -0.7890,  0.1234, -0.5678,  0.9012,
        -0.3456,  0.7890], grad_fn=<SliceBackward0>)

事前学習済み埋め込みの利用

import torch
import torch.nn as nn
import numpy as np

# 事前学習済み埋め込みのシミュレーション
# 実際にはWord2Vec、GloVe、fastTextなどを使用
vocab_size = 1000
embed_dim = 100

# ランダムな事前学習済み埋め込み（実際は学習済みベクトル）
pretrained_embeddings = torch.randn(vocab_size, embed_dim)

# Embedding層に事前学習済み重みをロード
embedding = nn.Embedding(vocab_size, embed_dim)
embedding.weight = nn.Parameter(pretrained_embeddings)

# オプション1: 埋め込みを固定（fine-tuningしない）
embedding.weight.requires_grad = False
print("=== 事前学習済み埋め込み（固定）===")
print(f"学習可能: {embedding.weight.requires_grad}")

# オプション2: 埋め込みをfine-tuning
embedding.weight.requires_grad = True
print(f"\n=== 事前学習済み埋め込み（fine-tuning）===")
print(f"学習可能: {embedding.weight.requires_grad}")

# モデルでの使用例
class TextClassifierWithPretrainedEmbedding(nn.Module):
    def __init__(self, pretrained_embeddings, hidden_size, num_classes, freeze_embedding=True):
        super(TextClassifierWithPretrainedEmbedding, self).__init__()

        vocab_size, embed_dim = pretrained_embeddings.shape

        # 事前学習済み埋め込み
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.embedding.weight = nn.Parameter(pretrained_embeddings)
        self.embedding.weight.requires_grad = not freeze_embedding

        # LSTM層
        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        out = self.fc(lstm_out[:, -1, :])
        return out

# モデルの作成
model = TextClassifierWithPretrainedEmbedding(
    pretrained_embeddings,
    hidden_size=128,
    num_classes=2,
    freeze_embedding=True
)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n=== モデル統計 ===")
print(f"総パラメータ: {total_params:,}")
print(f"学習可能パラメータ: {trainable_params:,}")
print(f"固定パラメータ: {total_params - trainable_params:,}")

出力：

=== 事前学習済み埋め込み（固定）===
学習可能: False

=== 事前学習済み埋め込み（fine-tuning）===
学習可能: True

=== モデル統計 ===
総パラメータ: 230,018
学習可能パラメータ: 130,018
固定パラメータ: 100,000

文字レベルモデル

import torch
import torch.nn as nn

# 文字レベルのRNNモデル
class CharLevelRNN(nn.Module):
    def __init__(self, num_chars, embed_size, hidden_size, num_layers=2):
        super(CharLevelRNN, self).__init__()

        self.embedding = nn.Embedding(num_chars, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers,
                           batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_size, num_chars)

    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        out = self.fc(lstm_out)
        return out

# 文字の語彙
chars = "abcdefghijklmnopqrstuvwxyz "
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

num_chars = len(chars)

# モデルの作成
model = CharLevelRNN(num_chars, embed_size=32, hidden_size=64, num_layers=2)

print(f"=== 文字レベルモデル ===")
print(f"文字数: {num_chars}")
print(f"総パラメータ: {sum(p.numel() for p in model.parameters()):,}")

# テキストのエンコード
text = "hello world"
encoded = [char_to_idx[ch] for ch in text]
print(f"\nテキスト: '{text}'")
print(f"エンコード: {encoded}")

# サンプル予測
x = torch.tensor([encoded])
with torch.no_grad():
    output = model(x)
    print(f"\n出力形状: {output.shape}")

    # 各位置での最も確率の高い文字
    predicted_indices = output.argmax(dim=2).squeeze(0)
    predicted_text = ''.join([idx_to_char[idx.item()] for idx in predicted_indices])
    print(f"予測テキスト（訓練前）: '{predicted_text}'")

出力：

=== 文字レベルモデル ===
文字数: 27
総パラメータ: 24,091

テキスト: 'hello world'
エンコード: [7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]

出力形状: torch.Size([1, 11, 27])
予測テキスト（訓練前）: 'aaaaaaaaaaa'

Embedding層の比較

手法	メリット	デメリット	推奨用途
ランダム初期化	タスク特化、柔軟	大量データが必要	大規模データセット
事前学習済み（固定）	少量データでOK	タスク適応性低い	小規模データ、汎用タスク
事前学習済み（fine-tuning）	両方のバランス	過学習のリスク	中規模データ、特定タスク
文字レベル	未知語に対応、語彙小	系列が長くなる	形態素解析が困難な言語

2.6 本章のまとめ

学んだこと

RNNの基礎
- 系列データの処理に適した構造
- 隠れ状態で過去の情報を保持
- 勾配消失/爆発の問題
LSTM & GRU
- ゲート機構で長期依存関係を学習
- LSTMは3ゲート、GRUは2ゲート
- GRUはパラメータが少なく高速
Seq2Seqモデル
- Encoder-Decoderアーキテクチャ
- 機械翻訳、要約などに応用
- Teacher Forcingで学習を安定化
Attentionメカニズム
- 入力の重要な部分に注目
- 長い系列でも性能向上
- 解釈可能性の向上
Embedding層
- 単語をベクトルに変換
- 事前学習済み埋め込みの活用
- 文字レベルモデルの利点

深層学習NLPの進化

graph LR A[RNN] --> B[LSTM/GRU] B --> C[Seq2Seq] C --> D[Attention] D --> E[Transformer] E --> F[BERT/GPT] style A fill:#ffebee style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e3f2fd style E fill:#e8f5e9 style F fill:#c8e6c9

次の章へ

第3章では、Transformerと事前学習モデルを学びます：

Self-Attentionメカニズム
Transformerアーキテクチャ
BERT、GPTの仕組み
Transfer Learningの実践
Fine-tuningテクニック

演習問題

問題1（難易度：easy）

RNNとLSTMの主な違いを3つ挙げて説明してください。

解答例

解答：

構造の複雑さ
- RNN: シンプルな再帰構造、1つの隠れ状態のみ
- LSTM: ゲート機構を持ち、隠れ状態とセル状態の2つを保持
長期依存関係の学習能力
- RNN: 勾配消失問題により、長い系列で学習が困難
- LSTM: ゲート機構により長期依存関係を効果的に学習可能
パラメータ数
- RNN: パラメータ数が少ない（高速だが表現力が限定的）
- LSTM: パラメータ数が多い（約4倍、高い表現力）

問題2（難易度：medium）

以下のコードで、簡単なLSTMモデルを実装し、サンプルデータで動作確認してください。

# 要件:
# - vocab_size = 50
# - embed_size = 32
# - hidden_size = 64
# - num_classes = 3
# - 入力: (batch_size=4, seq_len=10)の整数テンソル

解答例

import torch
import torch.nn as nn

# LSTMモデルの実装
class SimpleLSTM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_classes):
        super(SimpleLSTM, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x)
        # embedded: (batch_size, seq_len, embed_size)

        lstm_out, (hn, cn) = self.lstm(embedded)
        # lstm_out: (batch_size, seq_len, hidden_size)

        # 最後の時刻の出力を使用
        out = self.fc(lstm_out[:, -1, :])
        # out: (batch_size, num_classes)

        return out

# モデルの作成
vocab_size = 50
embed_size = 32
hidden_size = 64
num_classes = 3

model = SimpleLSTM(vocab_size, embed_size, hidden_size, num_classes)

print("=== LSTMモデル ===")
print(f"総パラメータ: {sum(p.numel() for p in model.parameters()):,}")

# サンプルデータ
batch_size = 4
seq_len = 10
x = torch.randint(0, vocab_size, (batch_size, seq_len))

print(f"\n入力形状: {x.shape}")

# Forward pass
with torch.no_grad():
    output = model(x)
    print(f"出力形状: {output.shape}")
    print(f"\n出力:\n{output}")

    # 予測クラス
    predicted = output.argmax(dim=1)
    print(f"\n予測クラス: {predicted}")

出力：

=== LSTMモデル ===
総パラメータ: 26,563

入力形状: torch.Size([4, 10])
出力形状: torch.Size([4, 3])

出力:
tensor([[-0.1234,  0.5678, -0.2345],
        [ 0.3456, -0.6789,  0.1234],
        [-0.4567,  0.2345, -0.8901],
        [ 0.6789, -0.1234,  0.4567]])

予測クラス: tensor([1, 0, 1, 0])

問題3（難易度：medium）

Teacher Forcingとは何か説明し、そのメリットとデメリットを述べてください。

解答例

解答：

Teacher Forcingとは：

Seq2Seqモデルの訓練時に、デコーダの各ステップでの入力として、前のステップの予測値ではなく、正解のトークンを使用する手法です。

メリット：

学習の安定化: 正しい入力を使うため、学習が安定し収束が速い
勾配の伝播: 誤った予測の連鎖を防ぎ、効果的な勾配伝播が可能
訓練時間の短縮: 収束が早いため、訓練時間が短くなる

デメリット：

Exposure Bias: 訓練と推論の条件が異なるため、推論時に誤差が蓄積しやすい
過学習のリスク: 正解データに依存しすぎ、汎化性能が低下する可能性
エラー伝播への脆弱性: 推論時に最初の予測を誤ると、その後の予測も連鎖的に悪化

対策：

Scheduled Sampling: 訓練の進行に応じて、Teacher Forcingの使用率を徐々に減らす
Mixed Training: ランダムに正解と予測を使い分ける（teacher_forcing_ratio=0.5など）

問題4（難易度：hard）

Bahdanau Attentionを実装し、エンコーダ出力とデコーダ隠れ状態からアテンション重みを計算してください。

解答例

import torch
import torch.nn as nn
import torch.nn.functional as F

class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()

        # デコーダ隠れ状態の変換
        self.W1 = nn.Linear(hidden_size, hidden_size)
        # エンコーダ出力の変換
        self.W2 = nn.Linear(hidden_size, hidden_size)
        # スコア計算用
        self.V = nn.Linear(hidden_size, 1)

    def forward(self, decoder_hidden, encoder_outputs):
        """
        Args:
            decoder_hidden: (batch_size, hidden_size)
            encoder_outputs: (batch_size, seq_len, hidden_size)

        Returns:
            context_vector: (batch_size, hidden_size)
            attention_weights: (batch_size, seq_len)
        """
        batch_size = encoder_outputs.size(0)
        seq_len = encoder_outputs.size(1)

        # decoder_hiddenを各エンコーダ位置に対してコピー
        decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, seq_len, 1)
        # (batch_size, seq_len, hidden_size)

        # エネルギー計算: tanh(W1*decoder + W2*encoder)
        energy = torch.tanh(self.W1(decoder_hidden) + self.W2(encoder_outputs))
        # (batch_size, seq_len, hidden_size)

        # スコア計算: V^T * energy
        attention_scores = self.V(energy).squeeze(2)
        # (batch_size, seq_len)

        # Softmaxでアテンション重みを計算
        attention_weights = F.softmax(attention_scores, dim=1)
        # (batch_size, seq_len)

        # コンテキストベクトル: エンコーダ出力の重み付き和
        context_vector = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
        # (batch_size, 1, hidden_size)
        context_vector = context_vector.squeeze(1)
        # (batch_size, hidden_size)

        return context_vector, attention_weights

# テスト
batch_size = 2
seq_len = 5
hidden_size = 64

# サンプルデータ
encoder_outputs = torch.randn(batch_size, seq_len, hidden_size)
decoder_hidden = torch.randn(batch_size, hidden_size)

# Attentionモジュール
attention = BahdanauAttention(hidden_size)

# Forward pass
context, weights = attention(decoder_hidden, encoder_outputs)

print("=== Bahdanau Attention ===")
print(f"エンコーダ出力形状: {encoder_outputs.shape}")
print(f"デコーダ隠れ状態形状: {decoder_hidden.shape}")
print(f"\nコンテキストベクトル形状: {context.shape}")
print(f"アテンション重み形状: {weights.shape}")

print(f"\n最初のバッチのアテンション重み:")
print(weights[0])
print(f"合計: {weights[0].sum():.4f}（1.0であることを確認）")

# 可視化
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.bar(range(seq_len), weights[0].detach().numpy())
plt.xlabel('Encoder Position')
plt.ylabel('Attention Weight')
plt.title('Attention Weights (Batch 1)')
plt.ylim(0, 1)
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.imshow(weights.detach().numpy(), cmap='YlOrRd', aspect='auto')
plt.colorbar(label='Attention Weight')
plt.xlabel('Encoder Position')
plt.ylabel('Batch')
plt.title('Attention Weights Heatmap')

plt.tight_layout()
plt.show()

出力：

=== Bahdanau Attention ===
エンコーダ出力形状: torch.Size([2, 5, 64])
デコーダ隠れ状態形状: torch.Size([2, 64])

コンテキストベクトル形状: torch.Size([2, 64])
アテンション重み形状: torch.Size([2, 5])

最初のバッチのアテンション重み:
tensor([0.2134, 0.1987, 0.2345, 0.1876, 0.1658])
合計: 1.0000（1.0であることを確認）

問題5（難易度：hard）

事前学習済み埋め込みを使用するモデルと、ランダム初期化する埋め込みを使用するモデルの性能を比較してください。どのような場合にどちらが優れているか考察してください。

解答例

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# サンプルデータ生成
np.random.seed(42)
torch.manual_seed(42)

vocab_size = 100
embed_dim = 50

# サンプル訓練データ（小規模）
num_samples = 50
seq_len = 10

X_train = torch.randint(0, vocab_size, (num_samples, seq_len))
y_train = torch.randint(0, 2, (num_samples,))

# テストデータ
X_test = torch.randint(0, vocab_size, (20, seq_len))
y_test = torch.randint(0, 2, (20,))

# 事前学習済み埋め込み（シミュレーション）
pretrained_embeddings = torch.randn(vocab_size, embed_dim)

# モデル定義
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size, num_classes,
                 pretrained=None, freeze=False):
        super(TextClassifier, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embed_dim)

        if pretrained is not None:
            self.embedding.weight = nn.Parameter(pretrained)
            self.embedding.weight.requires_grad = not freeze

        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        out = self.fc(lstm_out[:, -1, :])
        return out

# 訓練関数
def train_model(model, X, y, epochs=100, lr=0.001):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr)

    losses = []

    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()

        output = model(X)
        loss = criterion(output, y)

        loss.backward()
        optimizer.step()

        losses.append(loss.item())

    return losses

# 評価関数
def evaluate_model(model, X, y):
    model.eval()
    with torch.no_grad():
        output = model(X)
        _, predicted = torch.max(output, 1)
        accuracy = (predicted == y).sum().item() / len(y)
    return accuracy

# 実験1: ランダム初期化
print("=== 実験1: ランダム初期化埋め込み ===")
model_random = TextClassifier(vocab_size, embed_dim, 64, 2)
losses_random = train_model(model_random, X_train, y_train)
acc_random = evaluate_model(model_random, X_test, y_test)
print(f"テスト精度: {acc_random:.4f}")

# 実験2: 事前学習済み（固定）
print("\n=== 実験2: 事前学習済み埋め込み（固定）===")
model_pretrained_frozen = TextClassifier(vocab_size, embed_dim, 64, 2,
                                        pretrained_embeddings, freeze=True)
losses_frozen = train_model(model_pretrained_frozen, X_train, y_train)
acc_frozen = evaluate_model(model_pretrained_frozen, X_test, y_test)
print(f"テスト精度: {acc_frozen:.4f}")

# 実験3: 事前学習済み（fine-tuning）
print("\n=== 実験3: 事前学習済み埋め込み（fine-tuning）===")
model_pretrained_ft = TextClassifier(vocab_size, embed_dim, 64, 2,
                                    pretrained_embeddings, freeze=False)
losses_ft = train_model(model_pretrained_ft, X_train, y_train)
acc_ft = evaluate_model(model_pretrained_ft, X_test, y_test)
print(f"テスト精度: {acc_ft:.4f}")

# 結果の可視化
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss曲線
axes[0].plot(losses_random, label='Random', alpha=0.7)
axes[0].plot(losses_frozen, label='Pretrained (Frozen)', alpha=0.7)
axes[0].plot(losses_ft, label='Pretrained (Fine-tuning)', alpha=0.7)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 精度比較
methods = ['Random', 'Frozen', 'Fine-tuning']
accuracies = [acc_random, acc_frozen, acc_ft]
axes[1].bar(methods, accuracies, color=['#3182ce', '#f59e0b', '#10b981'])
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Test Accuracy Comparison')
axes[1].set_ylim(0, 1)
axes[1].grid(True, alpha=0.3, axis='y')

for i, acc in enumerate(accuracies):
    axes[1].text(i, acc + 0.02, f'{acc:.4f}', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

# 考察
print("\n=== 考察 ===")
print("\n【小規模データセットの場合】")
print("- 事前学習済み埋め込み（固定または fine-tuning）が有利")
print("- ランダム初期化は過学習しやすく、汎化性能が低い")

print("\n【大規模データセットの場合】")
print("- ランダム初期化でもタスクに最適化された埋め込みを学習可能")
print("- fine-tuningが最も高い性能を発揮する可能性")

print("\n【推奨戦略】")
print("データ量が少ない: 事前学習済み（固定） > 事前学習済み（fine-tuning） > ランダム")
print("データ量が中程度: 事前学習済み（fine-tuning） > 事前学習済み（固定） ≈ ランダム")
print("データ量が多い: 事前学習済み（fine-tuning） ≈ ランダム > 事前学習済み（固定）")

参考文献

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
Cho, K., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. ICLR 2015.
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. EMNLP 2015.
Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool Publishers.