Experiment - Bag of Tricks

I. 실험 개요

"Bag of Tricks for Image Classification with CNN" 논문을 읽고 논문에서 소개된 다양한 기법들을 직접 적용해서 성능 변화를 확인 후 결과를 정리한다.

실험은 https://github.com/bentrevett/pytorch-image-classification의 ResNet50(Base Model), ResNet152(Teacher Model) 모델을 기반으로 진행한다.
Cosine Learning Rate Decay, Label Smoothing, Knowledge Distillation, Mixup 네 가지 기법과 다양한 조합에 대한 성능 비교를 실시한다.

II. 실험 환경

Platform Google Colab, GPU NVIDIA A100-SXM4-40GB, CUDA 11.3, Pytorch 1.12.1, Torchvision 0.13

III. 실험 방법

데이터 셋 준비

실험에 필요한 데이터 셋은 CUB200(2011년)을 사용한다. 총 11788장, 200개의 조류 클래스로 이루어진 데이터셋으로부터 8:2의 비율로 Train Set과 Test Set을 나누고, Train Set의 10%를 Valid Set으로 사용해 7.2:0.8:2의 비율로 실험을 진행한다. 추가로 이미지 데이터를 224x224의 해상도로 재조정하고 Pretrained Model에 맞도록 정규화하며 학습 데이터의 경우 랜덤하게 -5~5도의 회전, 수평 반전, Crop 기법으로 변형하여 사용한다.
모델 선정

Base 모델으로 23,917,832개의 파라미터를 학습할 수 있는 ResNet50을 사용하고, Base 모델에 Knowledge Distillation 기법을 적용하기 위해 사용될 Teacher Model으로는 58,553,608개의 파라미터를 학습할 수 있는 ResNet152로 선정했다. 두 모델 모두 기본적으로 BATCH_SIZE 64, EPOCH 10, Adam Optimizer와 OneCycle Learning Rate Scheduler를 채택하여 학습한다.

Table 1. Base 모델과 Teacher Model의 비교

(a) OneCycle Learning Rate Schedule

그림 (b)~(d)는 두 모델의 비교 그래프를 나타낸 것이다. 모델 용량이 더 크고 깊은 ResNet152가 ResNet에 비해 모든 구간에서 Accuracy가 더 높은 것으로 확인되었다. 이 것을 이후 소개될 Knowledge Distillation 기법에서의 Teacher Model으로 사용할 것이다.

Teacher Model로써 사용할 것이며 Base Model의 그래프 (붉은색)를 각 기법에서 비교군으로 사용할 것이다.

(b) Train Loss

(d) Top-5 Validation Accuracy

기법 적용

본 실험에 사용될 기법은 Cosine Learning Rate Decay, Label Smoothing, Knowledge Distillation, Mixup 네 가지이며 아래 a~e와 같은 조합을 적용하여 실험을 진행한다.

Cosine Learning Rate Decay

(a) Cosine Annealing Learning Rate Schedule

Learning Rate가 0.0001에서 0까지 Cosine 곡선에 따라 감소할 수 있도록 아래와 같은 Optimizer와 Scheduler를 정의하여 사용한다.

import torch.optim.lr_scheduler as lr_scheduler
import torch.optim as optim
FOUND_LR = 1e-3
params = [
        {'params': model.conv1.parameters(), 'lr': FOUND_LR / 10},
        {'params': model.bn1.parameters(), 'lr': FOUND_LR / 10},
        {'params': model.layer1.parameters(), 'lr': FOUND_LR / 8},
        {'params': model.layer2.parameters(), 'lr': FOUND_LR / 6},
        {'params': model.layer3.parameters(), 'lr': FOUND_LR / 4},
        {'params': model.layer4.parameters(), 'lr': FOUND_LR / 2},
        {'params': model.fc.parameters()},
        ]
optimizer = optim.Adam(params, lr = FOUND_LR)
if(CosineDecay):
    scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=TOTAL_STEPS, eta_min=0,last_epoch=-1)

Label Smoothing

One-Hot 라벨을 0과 1사이의 Soft Label로 스무딩하고 예측치와 함께 CrossEntropy를 계산하기 위해 아래와 같은 Loss Function을 정의한다. Smoothing 파라미터는 0.1으로 한다.

import torch.nn as nn
class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self):
        super(LabelSmoothingCrossEntropy, self).__init__()

    def forward(self, x, target, smoothing=0.1):
        confidence = 1.-smoothing
        logprobs = F.log_softmax(x, dim=-1)
        nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1))
        nll_loss = nll_loss.squeeze(1)
        smooth_loss = -logprobs.mean(dim=-1)
        loss = confidence * nll_loss + smoothing * smooth_loss
        return loss.mean()

if(LabelSmoothing):
    criterion = LabelSmoothingCrossEntropy()

Knowledge Distillation

Teacher Model의 예측치로부터의 KLDivergence Loss와 Base Model의 Loss 둘을 더한 Total Loss를 얻기 위해 아래와 같은 Loss Function을 정의한다. 파라미터 α와 T는 각각 0.1, 10으로 한다. 실험에서 Knowledge Distillation과 Label Smoothing 기법을 같이 적용할 경우 Student Model에 단순히 Label Smoothing Loss function을 적용할 수 없으므로 Label Smoothing 기법으로 학습한 Teacher Model을 사용하도록 한다.

class knowledge_distillation_loss(nn.Module):
    def __init__(self):
        super(knowledge_distillation_loss, self).__init__()
        self.alpha = 0.1
        self.T = 10
    def forward(self, pred, labels, teacher_pred):
            student_loss = F.cross_entropy(input=pred, target=labels)
            distillation_loss = nn.KLDivLoss(reduction='batchmean')(F.log_softmax(pred/self.T, dim=1), F.softmax(teacher_pred/self.T, dim=1)) * (self.T * self.T)
            total_loss =  self.alpha*student_loss + (1-self.alpha)*distillation_loss

            return total_loss

if(KnowledgeDistillation):
	criterion = knowledge_distillation_loss()
	teacher_model=ResNet(resnet152_config, OUTPUT_DIM)
	teacher_model.load_state_dict(torch.load('tut5-teacher-model.pt'))

Mixup (α=1, 0.2, 0.1, 0.01)

Train set 두 개를 랜덤하게 불러온 뒤, 베타 분포에 의해 얻은 0과 1사이 값에 따라 입력 x와 라벨 y를 새롭게 정의한다. 참고 문헌 [3]에서 α의 변화(베타 분포에서 α=β로 사용)에 따라 다른 학습 결과를 얻는다고 하므로 본 실험에서는 α가 1, 0.2, 0.1, 0.01인 네 경우에 대해서 결과를 비교하고자 한다.

import torch.utils.data as data
import numpy as np

train_iterator = data.DataLoader(train_data, 
                                 shuffle = True, 
                                 batch_size = BATCH_SIZE)
if(Mixup):
    mixup_iterator = data.DataLoader(train_data, 
                                    shuffle = True, 
                                    batch_size = BATCH_SIZE)

def train(model, iterator, optimizer, criterion, scheduler, device):
	if(Mixup):
	        for (ox, oy), (mx, my) in zip(iterator, mixup_iterator):
	            lam = np.random.beta(alpha, alpha)
	            x = lam*ox+(1.-lam)*mx
	            y = lam*oy+(1.-lam)*my
	            y = y.to(torch.int64)

다음의 실험 방법 e~i는 a~d 기법의 조합으로 진행한다. 단, Label Smoothing과 Knowledge Distillation을 함께 사용하는 경우 c에서 언급한 방법을 적용한다.

Cosine Decay + Label Smoothing
Cosine Decay + Knowledge Distillation
Label Smoothing + Knowledge Distillation
Cosine Decay + Label Smoothing + Knowledge Distillation
Cosine Decay + Label Smoothing + Knowledge Distillation + Mixup (α=0.2, 0.01)

결과 출력 및 성능 비교

성능 비교를 위한 시각화 기법으로 TensorBoard를 연동하여 매 Epoch 마다 결과 데이터를 저장한다. Train과 Evaluation 각각의 과정에서 Train Loss, Top-1 Validation Accuracy, Top-5 Validation Accuracy, Learning Rate 데이터를 시각화한다. 이 때, TensorBoard의 UI에서 그래프 Smoothing 기능과 Outlier 제거 기능은 해제한다.

 import tensorflow as tf
import datetime

current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
train_log_dir = 'logs/gradient_tape/' + current_time + '/train'
test_log_dir = 'logs/gradient_tape/' + current_time + '/test'
lr_log_dir = 'logs/gradient_tape/' + current_time + '/lr'
train_summary_writer = tf.summary.create_file_writer(train_log_dir)
test_summary_writer = tf.summary.create_file_writer(test_log_dir)
lr_summary_writer = tf.summary.create_file_writer(lr_log_dir)

for epoch in range(EPOCHS):
	#Train, Evaluate 이후
	with train_summary_writer.as_default():
	      tf.summary.scalar('loss', train_loss, step=epoch)
	      tf.summary.scalar('accuracy', train_acc_1, step=epoch)
	      tf.summary.scalar('accuracy5', train_acc_5, step=epoch)
	    with test_summary_writer.as_default():
	      tf.summary.scalar('loss', valid_loss, step=epoch)
	      tf.summary.scalar('accuracy', valid_acc_1, step=epoch)
	      tf.summary.scalar('accuracy5',valid_acc_5, step=epoch)
	    with lr_summary_writer.as_default():
	      tf.summary.scalar('LearningRate', scheduler.get_last_lr()[0], step=epoch)

IV. 실험 결과

Cosine Learning Rate Decay

Base 모델에 Cosine Decay 기법 적용 시 결과는 Table 2와 같으며 (a)~(d)에서 노란색으로 표시했다.

Table 2. Cosine Decay

(a) Learning Rate Schedule

매 Epoch에서의 Learning Rate는 (a)와 같으며 Base Model의 결과에 비해 Loss, Accuracy가 더 나은 지표를 보여준다. 특히 초기 Learning Rate가 OneCycle 기법 보다 높은 데에도 불구하고 (b)와 같이 더 낮은 초기 Train Loss를 가지는 것이 관찰되었다.

(b) Train Loss

(d) Top-5 Validation Accuracy

Label Smoothing

Base Model과 Teacher Model에 Label Smoothing 기법 적용 시 결과는 Table 3과 같으며 (a)~(c)에서 Label Smoothing 기법으로 학습 한 Teacher Model은 검정색, Base Model의 경우 연녹색으로 표시했다.

Table 3. Label Smoothing

(a) Train Loss

Label Smoothing 기법이 적용 된 경우 모든 구간에서 Top-1과 Top-5 모두 Base 모델 보다 더 나은 지표를 보여주는 반면, Train Loss가 Base 모델보다 느리게 감소하는 모습을 볼 수 있다.

(b) Top-1 Validation Accuracy

특이한 점은, Teacher Model의 경우 오히려 Top-5에서의 지표가 미약하게 감소한 결과가 나타났다.