[Super Resolution] Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach

Posted Jan 17, 2024 Updated Jan 18, 2024

By hjinnkim 5 min read

Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach

오늘 본 논문은 arXiv “Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach”이다. (https://arxiv.org/abs/2401.05633v1?utm_source=tldrai)

Single Image Super Resolution 문제를 Efficient한 환경에서 수행하기 위해 연산량을 줄이면서 성능을 끌어올린 연구이다.

Abstract

Single Image Super Resolution (SISR) 분야는 뛰어난 성능을 보여주지만, computational cost로 인해 resource-restricted 환경에의 차용이 힘든 상황이다. 특히, 좋은 성능을 보여주는 Transformer-based 모델들은 Self-attention에서 큰 computational cost가 발생한다.
이에 저자들은 Large Kernel Convolution을 feature mixer로 사용하여, self-attention을 대체하면서 효율적으로 long range dependency를 modeling하는 Convolutional Transformer layer (ConvFormer)를 제시하였다.
또한, local feature를 aggregation하면서 high-frequency information을 보존하는 Edge-preserving Feed-forward Network (EFN)를 제시한다.
이 두 모듈을 합쳐 ConvFormer-based Super-Resolution (CFSR)를 소개한다.

Introduction

Computational cost와 performance를 적절히 가져갈 수 있는 새로운 module인 Convolutional Transformer Layer (ConvFormer)를 제시한다. 이를 이용하여 Super Resolution task를 수행하는 ConvFormer-based Super-Resolution network (CFSR)를 소개한다.
ConvFormer는 Large Kernel Convolution을 차용하여 Self-attention을 대체한다. Self-attention을 배제함으로써, 연산량 및 메모리 사용량을 대폭 줄일 수 있었다.
또한, Edge-preserving Feed-forward Network (EFN)을 제시하여 edge extraction 성능을 향상시켰다. EFN은 기존 vision task에서 사용되는 3x3 depth-wise convolution에 image gradient prior를 합쳐 high-frequency 정보를 보존하면서도, re-parameterization을 통해 연산량이나 parameter의 증가없이 light-weight model을 위한 성능 향상을 이끌어냈다.

Backgrounds

SISR은 parameter를 늘림으로써 성공적으로 성능을 향상시켜왔지만, 동시에 resource-limited device에서 모델을 차용하기 위한 lightweight SISR 역시 많이 연구되어 왔다.
최근엔 Large Kernel Convnolution 이용한 SISR 연구가 좋은 성능을 보였다. (ShuffleMixer)
Transformer-based SISR은 좋은 성능을 보여 왔지만 (SwinIR 등), self-attention으로 인해 CNN-based method에 비해 높은 computational resource가 요구된다.

Pipeline

첫 3x3 convolution은 shallow feature extractor로 HxWx3 -> HxWxC 로 LR 이미지를 same resolution에서 latent space mapping 용이다.
이후 두개의 RCFB block은 deep feature extractor로 여러 ConvFormer layer가 포함되어 있으며, local feature aggregation을 위해 3x3 convolution이 RCFB를 뒤따른다.
마지막 reconstruction은 shallow feature와 deep feature의 합을 받아 HR image를 만드는데, 3x3 convolution과 pixel-shuffle operation이 포함되어 있다.
Loss function은 L1 pixel loss를 사용한다.

ConvFormer Layer

ConvFormer는 Self-attention을 대체하는 중요한 모듈이다.

크게 두 개의 모듈로 이루어져 있다.

Large Kernel Mixer

이때, DWConv는 Depth-Wise Convolution이다. Kernel size K는 9를 사용했다고 한다. 보통 K « C 이기 때문에 Multi-Head Self-Attention 보다 효율적이라고 한다.

Edge-preserving Feed-forward Network (EFN)

EFN의 operation은 다음과 같다.

Edge 정보를 중간 feature에 고려하게 되면 성능이 향상된다는 이전 논문들이 있다. Edge-preserving property는 중간 EDC에서 오는데, EDC의 묘사는 다음과 같다.

결국 Depth-wise Convolution과 1 / 2차 미분 (gradient)을 탐지하는 sobel filter와 laplacian filter를 Depth-wise convolution의 filter로 사용하는 convolution을 (learnable) alpha의 조합으로 이루어진다.

식은 다음과 같다.

이때 Depth-wise convolution을 모두 따로 실행하면 computation이 많아지기 때문에 re-parametrization을 통해 한 번에 계산한다. (Merged EDC by re-parametrization in inference)

Performance

더 적은 parameter와 FLOPs로 Transforme-based 모델에 근접한 성능을 보여준다.

마무리

SISR 분야에서 Depth-wise Convolution이 주력으로 쓰인다는 걸 알았다. Large Kernel Convolution이 Long-range dependency를 모델링하는데 효과적이라는 점은 새로 알게 되었다. 다만, Self-attention처럼 모든 spatial을 고려하려면 ConvFormer block을 여러 번 통과해야함으로, Diffusion model architecture 차용하기에는 힘들 것으로 보인다.

그럼에도, Lightweight Transformer-based method들에 근접한 성능을 only CNN-based method로 보이는 것은 고무적인 결과로 보인다.

Super Resolution, Light Weight

super resolution

This post is licensed under CC BY 4.0 by the author.