MewwSikk
article thumbnail

ํ•ด๋‹น ๋ธ”๋กœ๊ทธ๋Š” ๊ณต๋ถ€๋ฅผ ๋ชฉ์ ์œผ๋กœ https://kuklife.tistory.com/121 ๋ธ”๋กœ๊ทธ๋ฅผ ํ•„์‚ฌํ•˜๋ฉฐ ์“ด ๊ธ€์ž…๋‹ˆ๋‹ค.

paper Link: https://arxiv.org/pdf/1802.02611.pdf

###########################

 

DeepLab V3+ ๋…ผ๋ฌธ์€ 2018๋…„ 8์›” ๊ฒฝ ๊ตฌ๊ธ€์—์„œ ์ž‘์„ฑ๋œ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

Semantic Segmentation์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก ์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๊ฐ€ ์กด์žฌํ•˜์ง€๋งŒ ๊ทธ์ค‘ DeepLab ์‹œ๋ฆฌ์ฆˆ๋Š” ์—ฌ๋Ÿฌ Segmentation model ์ค‘ ์„œ๋Šฅ์ด ์ƒ์œ„๊ถŒ์— ๋งŽ์ด ํฌ์ง„๋˜์–ด ์žˆ๋Š” model๋“ค์ž…๋‹ˆ๋‹ค.

๊ทธ์ค‘์—์„œ๋„ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋†’์œผ๋ฉฐ DeepLab์‹œ๋ฆฌ์ฆˆ ์ค‘ ๊ฐ€์žฅ ์ตœ๊ทผ์— ๋‚˜์˜จ DeepLab V3+์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์ „์ฒด์ ์œผ๋กœ DeepLab์€ semantic segmentation์„ ์ž˜ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ atrous convolution์„ ์ ๊ทน์ ์œผ๋กœ ํ™œ์šฉํ•  ๊ฒƒ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด์ ์ธ ํ๋ฆ„์„ ๋ณด๊ธฐ ์œ„ํ•ด ์•„๋ž˜์˜ ์‹œ๋ฆฌ์ฆˆ ๋ณ„๋กœ ์–ด๋–ค ๋ณ€ํ™”๊ฐ€ ์žˆ์—ˆ๋Š”์ง€ ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

- DeepLab V1: Atrous conolution์„ ์ฒ˜์Œ ์ ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

- DeepLab V2: multi-scale context๋ฅผ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด Atrous Spatial Pyramid Pooling์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

- DeepLab V3: ๊ธฐ์กด ResNet ๊ตฌ์กฐ์— Atrous Convolution์„ ํ™œ์šฉํ•˜์—ฌ ์ข€ ๋” Dense ํ•œ Feature map์„ ์–ป๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

- DeepLab V3+: Depthwise Separable Convolution๊ณผ Atrous Convolution์„ ๊ฒฐํ•ฉํ•œ Atrous Separable Convolution์˜ ํ™œ์šฉ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

DeepLab V3+ ๋…ผ๋ฌธ์„ ์ฝ์–ด๋ณด๋ฉด ๊ด€๋ จ ์—ฐ๊ตฌ๋กœ V1์—์„œ ์ œ์‹œ๋œ atrous convolution, V2์—์„œ ์ œ์‹œ๋œ ASPP, ๋งˆ์ง€๋ง‰์œผ๋กœ Depthwise Separable Convolution์— ๋Œ€ํ•ด์„œ ์–ธ๊ธ‰ํ•˜๋‹ˆ ์ด ๊ธ€์—์„œ๋„ ์–ธ๊ธ‰ ํ›„ ๋ณธ๋ฌธ์œผ๋กœ ๋“ค์–ด๊ฐ€ ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

Relation Works

1) Atrous Convolution

atrous Convolution

Atrous์—์„œ tous๋Š” ๊ตฌ๋ฉ(hole)์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Atrous Convolution์€ ๊ธฐ์กด Convolution๊ณผ ๋‹ค๋ฅด๊ฒŒ ํ•„ํ„ฐ ๋‚ด๋ถ€์— ๋นˆ ๊ณต๊ฐ„์„ ๋‘” ์ฑ„ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. 

 

์œ„ ๊ทธ๋ฆผ์—์„œ๋Š” ์–ผ๋งˆ๋‚˜ ๋นˆ ๊ณต๊ฐ„์„ ๋‘˜์ง€ ๊ฒฐ์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ’ r์ด 1์ธ ๊ฒฝ์šฐ, ๊ธฐ์กด์˜ Convolution๊ณผ ๋™์ผํ•˜๊ณ  r์ด ์ปค์งˆ์ˆ˜๋ก ๋นˆ ๊ณต๊ฐ„์ด ๋„“์–ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 

 

์ด๋Ÿฌํ•œ Atrous Convolution์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ด์ ์€ ๊ธฐ์กด convolution๊ณผ ๋™์ผํ•œ ์–‘์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๊ณ„์‚ฐ๋Ÿ‰์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, Field of View(ํ•œ ํ”ฝ์…€์ด ๋ณผ ์ˆ˜ ์žˆ๋Š” ์˜์—ญ)์„ ํฌ๊ฒŒ ๊ฐ€์ ธ๊ฐˆ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค. 

 

์ฆ‰, ์—ฌ๋Ÿฌ convolution๊ณผ pooling ๊ณผ์ •์—์„œ ๋””ํ…Œ์ผํ•œ ์ •๋ณด๊ฐ€ ์ค„์–ด๋“ค๊ณ  ํŠน์ •์ด ์ ์  ์ถ”์ƒํ™”๋˜๋Š” ๊ฒƒ์„ ์–ด๋А์ •๋„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, DeepLab series์—์„œ๋Š” ์ด๋ฅผ ์ ๊ทน์ ์œผ๋กœ ํ™œ์šฉํ•˜๋ ค ๋…ธ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

 

(Field of View์— ๋Œ€ํ•œ ์ถ”๊ฐ€์„ค๋ช…)

๋ณดํ†ต Semantic Segmentation์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š” CNN์˜ ๋งˆ์ง€๋ง‰์— ์กด์žฌํ•˜๋Š” ํ•œ ํ”ฝ์…€์ด ์ž…๋ ฅ๊ฐ’์—์„œ ์–ด๋А ํฌ๊ธฐ์˜ ์˜์—ญ๊นŒ์ง€ ์ปค๋ฒ„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” Receptive Field์˜ ํฌ๊ธฐ๊ฐ€ ์ค‘์š”ํ•˜๊ฒŒ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.

31*31 Large Kernel ๋…ผ๋ฌธ ์ฐธ์กฐ: https://openaccess.thecvf.com/content/CVPR2022/papers/Ding_Scaling_Up_Your_Kernels_to_31x31_Revisiting_Large_Kernel_Design_CVPR_2022_paper.pdf

 

2) Atrous Spatial Pyramid Pooling (ASPP)

Atrous Spatial Pyramid Pooling(ASPP)

Semantic Segmentation์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, spatial pyramid pooling ๊ธฐ๋ฒ•์ด ์ž์ฃผ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋Š” ์ถ”์„ธ์ž…๋‹ˆ๋‹ค.

 

DeepLab V2์—์„œ feature map์œผ๋กœ๋ถ€ํ„ฐ rate๊ฐ€ ๋‹ค๋ฅธ Atrous Convolution์„ ๋ณ‘๋ ฌ๋กœ ์ ์šฉํ•œ ๋’ค, ์ด๋ฅผ ๋‹ค์‹œ ํ•ฉ์ณ์ฃผ๋Š” ASPP๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•  ๊ฒƒ์„ ์ œ์•ˆํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ ๋ฐœํ‘œ๋œ PSPNet์—์„œ๋„ Atrous Convolution์„ ํ™œ์šฉํ•˜์ง„ ์•Š์•˜์ง€๋งŒ, ์ด์™€ ๋น„์Šทํ•œ Pyramid Pooling ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. 

 

์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์€ multi-scale context๋ฅผ ๋ชจ๋ธ ๊ตฌ์กฐ๋กœ ๊ตฌํ˜„ํ•˜์—ฌ ๋ณด๋‹ค ์ •ํ™•ํ•œ Semantic Segmentation์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์šฐ๋ฉฐ, DeepLab V3๋ถ€ํ„ฐ๋Š” ASPP๋ฅผ ๊ธฐ๋ณธ ๋ชจ๋“ˆ๋กœ ๊ณ„์† ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. 

 

3) Depthwise Separable Convoltion

generally used convolution's figure

์œ„์˜ ๊ทธ๋ฆผ์„ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” Convolution์„ ๋‚˜ํƒ€๋‚ธ ์‚ฌ์ง„์ž…๋‹ˆ๋‹ค. 

 

์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ 8*8*3(H*W*C)์ด๊ณ , Convolution ํ•„ํ„ฐ ํฌ๊ธฐ๊ฐ€ 3*3(F*F)์ด๋ผ๊ณ  ํ•  ๋•Œ, ํ•„ํ„ฐ ํ•œ๊ฐœ๊ฐ€ ๊ฐ€์ง€๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐœ์ˆ˜๋Š” 3*3*3(F*F*C)๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ํ•„ํ„ฐ๊ฐ€ 4๊ฐœ๋ผ๋ฉด, ํ•ด๋‹น Convolution์˜ ์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 3*3*3*4(F*F*C*N)๋งŒํผ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 

 

Picture typically used to describe Depthwise Convolution

Convolution ์—ฐ์‚ฐ์—์„œ Channel ์ถ•์„ ํ•„ํ„ฐ๊ฐ€ ํ•œ๋ฒˆ์— ์—ฐ์‚ฐํ•˜๋Š” ๋Œ€์‹ ์— ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์ž…๋ ฅ ์˜์ƒ์˜ Channel ์ถ•์„ ๋ชจ๋‘ ๋ถ„๋ฆฌ์‹œํ‚จ ๋’ค, Channel ์ถ• ๊ธธ์ด๋ฅผ ํ•ญ์ƒ 1๋กœ ๊ฐ€์ง€๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ Convolution ํ•„ํ„ฐ๋กœ ๋Œ€์ฒด์‹œํ‚จ ์—ฐ์‚ฐ์„ Depthwise Separable Convolution์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. 

 

์ด์ฒ˜๋Ÿผ ๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ด์œ ๋Š” ๊ธฐ์กด Convolution๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ๋„ ์‚ฌ์šฉํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ˆ˜์™€ ์—ฐ์‚ฐ๋Ÿ‰์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

 

์˜ˆ๋ฅผ ๋“ค์–ด ์ž…๋ ฅ๊ฐ’์ด 8*8*3์ด๊ณ  16๊ฐœ์˜ Convolutionํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•  ๋•Œ ์‚ฌ์šฉ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐœ์ˆ˜๋Š”

- Convolution: 3*3*3*16 = 432

- Depthwise Separable Convolution: 3*3*3 + 3*16 = 27 + 48 = 75

์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

Depthwise Convolution์€ ํ•œ ๊ฐœ์˜ ํ•„ํ„ฐ๊ฐ€ ํ•œ ๊ฐœ์˜ ์ฑ„๋„์—๋งŒ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ์ดํ•ดํ•˜๋ฉด ์ข€ ๋” ์ดํ•ด๊ฐ€ ์ˆ˜์›”ํ•ฉ๋‹ˆ๋‹ค. 

 

Depthwise Seprable Convolution์€ ๊ธฐ์กด Convolution Filter๊ฐ€ Spatial Dimension๊ณผ Channel Dimension์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋˜ ๊ฒƒ์„ ๋”ฐ๋กœ ๋ถ„๋ฆฌ์‹œ์ผœ ๊ฐ๊ฐ ์ฒ˜๋ฆฌํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

๋‘ ์ถ•์„ ๋ถ„๋ฆฌ์‹œ์ผœ ์ˆ˜ํ–‰ํ•˜๋”๋ผ๋„ ์ตœ์ข… ๊ฒฐ๊ณผ๊ฐ’์€ ๊ฒฐ๊ตญ ๋‘ ๊ฐ€์ง€ ์ถ• ๋ชจ๋‘๋ฅผ ์ฒ˜๋ฆฌํ•œ ๊ฒฐ๊ด๊ฐ’์„ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ๊ธฐ์กด Convolution Filter๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋˜ ์—ญํ• ์„ ์ถฉ๋ถ„ํžˆ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 

 

Dephwise Convolution: https://gaussian37.github.io/dl-concept-dwsconv/

 

Depthwise separable convolution ์—ฐ์‚ฐ

gaussian37's blog

gaussian37.github.io

ํ•ด๋‹น ๋ธ”๋กœ๊ทธ์— ์ž์„ธํžˆ ์„ค๋ช…์ด ๋˜์–ด์žˆ์–ด, ๋งํฌ๋ฅผ ์ฒจ๋ถ€ํ•ฉ๋‹ˆ๋‹ค. 

 

4) Encoder-Decoder

๋งˆ์ง€๋ง‰์œผ๋กœ DeepLab V3+์—์„œ๋Š” ์œ„์—์„œ ์„ค๋ช…ํ•œ ๋ชจ๋“ˆ๋“ค์„ Encoder-Decoder์˜ ํ˜•ํƒœ๋กœ ๊ตฌ์กฐํ™”์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. 

U-net architecture

U-Net๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ intermediate connection์„ ๊ฐ€์ง€๋Š” encoder-decoder ๊ตฌ์กฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ณด๋‹ค spatial ํ•œ ํŠน์ง•์„ ์‚ด๋ ค object boundary๋ฅผ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

Method

DeepLab V3+ ์—์„œ๋Š” Encoder๋กœ DeepLab V3๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , Decoder๋กœ Bilinear Upsampling๋Œ€์‹  U-net๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ Concatํ•ด์ฃผ๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

DeepLab V3์™€ DeepLab V3+์˜ ๊ตฌ์กฐ๋ฅผ ๊ทธ๋ฆผ์œผ๋กœ ๋จผ์ € ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

DeeepLab V3 architecture

DeepLab V3๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ResNet์„ BackBone์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

- Encoder: Autrous Convolution๋ฅผ ์ ์šฉํ•œ ResNet

- ASPP

- Decoder: Bilinear Upsampling

 

DeepLab V3+ architecture

DeepLab V3+์˜ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

- Encoder: ResNet with Atrous Convolution์„ Xception์œผ๋กœ ๋ณ€๊ฒฝ

- ASP๋ฅผ ASSPP (Atrous Separable Spatial Pyramid Pooling)์œผ๋กœ ๋ณ€๊ฒฝ

- Decoder: Bilinear Upsampling์„ Simplified U-Net style decoder๋กœ ๋ณ€๊ฒฝ

 

์œ„์˜ ๋‚ด์šฉ์„ ์ข€ ๋” ์„ธ๋ถ€์ ์œผ๋กœ ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

1) Encoder-Decoder with Atrous Coovolution

- Encoder

DCNN์—์„œ Atrous Convolution์„ ํ†ตํ•ด ์ž„์˜์˜ resolution์œผ๋กœ feature map์„ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. 

 

์—ฌ๊ธฐ์„œ Output Stride์˜ ๊ฐœ๋…์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. "input image์˜ resolution๊ณผ ์ตœ์ข… output์˜ ๋น„"๋กœ ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ตœ์ข… feature map์ด input image์— ๋น„ํ•ด 32๋ฐฐ ์ค„์–ด๋“ค์—ˆ๋‹ค๋ฉด, output stride๋ฅผ 32๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ดํ•ดํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

Semantic Segmentation์—์„œ๋Š” ๋”์šฑ ๋””ํ…Œ์ผํ•œ ์ •๋ณด๋ฅผ ์–ป์–ด๋‚ด๊ธฐ ์œ„ํ•ด ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์˜ Block์„ 1๊ฐœ ํ˜น์€ 2๊ฐœ๋ฅผ ์‚ญ์ œ ํ›„ Atrous Convolution ํ•ด์คŒ์œผ๋กœ์จ Output Stride๋ฅผ 16 ํ˜น์€ 8๋กœ ์ค„์ž…๋‹ˆ๋‹ค. 

 

๊ทธ๋ฆฌ๊ณ  ์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๋ฌผ์ฒด ์ •๋ณด๋ฅผ ์žก์•„๋‚ด๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ rate์˜ Atrous Convolution์„ ์‚ฌ์šฉํ•˜๋Š” ASPP(Atrous Spatial Pyramid Pooling)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 

- Decoder

์ด์ „์˜ DeepLab V3์—์„œ๋Š” Decoder ๋ถ€๋ถ„์„ ๋‹จ์ˆœํžˆ bilinear upsampling ํ•ด์ฃผ์—ˆ์ง€๋งŒ, V3+์—์„œ๋Š” Encoder์˜ ์ตœ์ข… Output์— 1*1 Convolution์„ ํ•˜์—ฌ Channel์„ ์ค„์ด๊ณ  bilinear upsampling ํ•ด์ค€ ํ›„ Concat ํ•˜๋Š” ๊ณผ์ •์ด ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

2) Modified Aligned Xception

๋ณธ๋ก ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, Xception์€ Inception Module์— Depthwise Separable Convolution์„ ์ ์šฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. 

 

(Xception์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์„ค๋ช…)

ํ•ด๋‹น ๋…ผ๋ฌธ์—๋Š” inception Module์ด ๋ฌด์—‡์ธ์ง€ ์นœ์ ˆํ•˜๊ฒŒ ์„ค๋ช…ํ•ด์ฃผ์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์—, Inception Module์ด ๋ฌด์—‡์ธ์ง€๋ถ€ํ„ฐ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 

Inception Module, naive version

๋ณดํ†ต Convolution ์—ฐ์‚ฐ์„ ํ†ตํ•ด W, H์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ณ  C๋Š” ๋Š˜๋ฆฌ๋Š” ํ˜•ํƒœ๋ฅผ ์ทจํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, Inception์€ ์œ„์˜ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์ฐจ์›์„ ์ค„์ด๋Š” ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

- W, H๋Š” Max-Pooling์„ ํ†ตํ•ด์„œ ์ค„์ž…๋‹ˆ๋‹ค.

- C๋Š” Convolution Filter์—์„œ ์ง€์ •์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. (๋ณดํ†ต์˜ Convolution์€ C๋ฅผ ๋Š˜๋ ค ๋ชจ๋ธ์˜ Width๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.)

- ์ด ๋•Œ, 1*1 ์—ฐ์‚ฐ์€ Convolution ์—ฐ์‚ฐ์— ์‚ฌ์šฉ๋˜๋Š” ํ•„ํ„ฐ๋ฅผ 1*1๋กœ ํ•˜๊ณ  C๋Š” ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. 

์ด๋Ÿฌํ•œ 1x1 convolution ์—ฐ์‚ฐ์€ Fully Connected Layer (FCL)๊ณผ ๋น„์Šทํ•œ ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•˜๋ฏ€๋กœ, ์ด๋ฅผ Network in Network (NIN)๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. 1x1 convolution์ด ์ง„ํ–‰๋˜๋ฉด์„œ ๊ฐ ์ฑ„๋„์˜ ์ •๋ณด๋ฅผ ์ƒํ˜ธ ์—ฐ๊ด€์‹œํ‚ค๋Š” ๊ฒƒ์ด FCL์ด ํ•˜๋Š” ์ผ๊ณผ ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ, 1x1 convolution์€ FCL์— ๋น„ํ•ด ๊ณต๊ฐ„์ ์ธ ์ •๋ณด๋ฅผ ๋” ์ž˜ ์œ ์ง€ํ•œ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋‹ค์‹œ DeepLab V3+๋กœ ๋Œ์•„์™€์„œ, DeepLab V3+์—์„œ๋Š” Xception์„ Backbone์œผ๋กœ ์‚ฌ์šฉํ•˜์ง€๋งŒ MSRA์˜ Aligned Xception๊ณผ ๋‹ค๋ฅธ 3๊ฐ€์ง€ ๋ณ€ํ™”๋ฅผ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์™ผ์ชฝ: MSRA์˜ Xception model, ์˜ค๋ฅธ์ชฝ: ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ ๋ณ€ํ˜• Xception model

์›๋ž˜์˜ ๋ชจ๋ธ๊ณผ ๋‹ค๋ฅธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ ๋ณ€ํ˜• Xception model์˜ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

- ๋น ๋ฅธ ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ ์˜ํ•ด Entry Flow Structure๋ฅผ ์ˆ˜์ •ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. 

- Atrous Separable Convolution์„ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  Pooling Layer๋ฅผ Depthwise Separable Convolution์œผ๋กœ ๋Œ€์ฒดํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ 3*3 Depthwise Convolution ์ดํ›„์— ์ถ”๊ฐ€์ ์œผ๋กœ Batch-Norm๊ณผ ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

(3) Experiment

train OS: training ์ค‘ output stride, eval OS: the output stride during evalulation

๋‹ค์–‘ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ์„ธํŒ…์— ๋Œ€ํ•ด ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋Š”๋ฐ, ์šฐ์„  ResNet-101 ๊ตฌ์กฐ๋ฅผ Encoder๋กœ ์‚ฌ์šฉํ•˜์˜€์„ ๋•Œ, ์„ฑ๋Šฅ์„ ์ธก์ •ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Decoder ๋ถ€๋ถ„์„ Bilinear upsampling ํ•˜๋Š” ๋Œ€์‹ , ๋‹จ์ˆœํ™”๋œ U-net๊ตฌ์กฐ๋กœ ๋ณ€๊ฒฝํ•  ๊ฒฝ์šฐ ๊ธฐ์กด ๋Œ€๋น„ mIoU๊ฐ€ 1.64% ํ–ฅ์ƒ๋œ ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

Inference strategy on the PASCAL VOC 2012 val set when using modified Xception
Decoder Effect

Qualitative effect of employing the proposed decoder module compared with the naive bilinear upsampling (denoted as BU). In the examples, we adopt Xception as feature extractor and train output stride = eval output stride = 16.

 

์ดํ›„, Encoder๋ฅผ Xception์œผ๋กœ ๊ต์ฒด ํ›„ ์‹คํ—˜ํ–ˆ์„ ๋•Œ๋Š” ์•ฝ 2%๊ฐ€๋Ÿ‰์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค. 

ASPP ๋ถ€๋ถ„๊ณผ Decoder๋ถ€๋ถ„์— ์‚ฌ์šฉ๋˜๋Š” Convolution๋“ค์„ ๋ชจ๋‘ Separable Convolution์œผ๋กœ ๋Œ€์ฒดํ•  ๊ฒฝ์šฐ ์„ฑ๋Šฅ์€ ๊ธฐ์กด Convolution์„ ์‚ฌ์šฉํ•  ๋•Œ์™€ ๊ฑฐ์˜ ๋น„์Šทํ•˜์˜€์ง€๋งŒ, ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ์—ฐ์‚ฐ๋Ÿ‰ ์ž์ฒด๊ฐ€ ํš๊ธฐ์ ์œผ๋กœ ์ค„์–ด๋“ค์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. 

profile

MewwSikk

@Mu Gyum

ํฌ์ŠคํŒ…์ด ์ข‹์•˜๋‹ค๋ฉด "์ข‹์•„์š”โค๏ธ" ๋˜๋Š” "๊ตฌ๋…๐Ÿ‘๐Ÿป" ํ•ด์ฃผ์„ธ์š”!