학교수업/컴퓨터비전

[ComputerVision] 12. Image Segmentation

hwijin97 2022. 6. 2. 22:24

Goals of segmentation

- Group together similar-looking pixels for efficiency of further processing

  > simplify and / or change the representation of a image into something that is more meaningful and easier to analyze

 

Major processes for segmenation

- Separate image into coherent 'object'

  > Bottom-up : group tokens with similar features

  > Top-down : group tokens that likely belong to the same object

  > Supervised or unsupervised

 

Segmentation using clustering

- K-means

- Mean-shift

 

Feature Space

- Cluster similar pixels (features ) together

 

Segmentation as clustering

- Cluster together tokens with high similarity (small distance in feature space)

- Questions

  > How many cluster?

  > Witch data belongs to which group?

 

K-means algorithm

- $\text{argmin}_{S,\mu_i, i=1\cdots K} \sum_{i=1}^K \sum_{x\in S_i} ||x - \mu_i||^2$

- Partition the data into K sets $S = {S_1, S_2, \cdots, S_K}$ with corresponding centers $\mu_i$

- Partition such tha tvariance in each partition is as low as possible

----

- Assign each of the N points, $x_j$, to clusters by nearest $\mu_i$

- Re-compute mean $\mu_i$ of each cluster from its member points

- if no mean has changed more than some $\epsilon$, stop

- solving the optimization problem $\text{argmin}_s \sum_{i=1}^k \sum_{x_j \in S_i} ||x_j - \mu_i||^2$ <- $e(\mu_i)$

- Every iteration is a step of gradient descent. $\frac{\partial e}{\partial \mu_i} = 0$, $\mu_i^{t+1} = \frac{1}{|S_t^{(t)}|}\sum_{x_j \in S_i^{i}} x_j $

----

- an iterative technique that is used to partition an image into K clusters

  > Pick K cluster centers, either randomly or based on some heuristic

  > Assign each pixel in the image to the cluster that minimizes that distance between the pixel and the cluster center

  > Re-compute the cluster centers by averaging all of the pixels in the cluster

  > Repeat steps 2 and 3 until convergence in attained (i.e. no pixels change clusters)

- distance : the squared or absolute differnce between a pixel and a cluster center ex) pixel color, intensity, texture and location, or a weighted combination of these factors

- K can be selected manually, randomly, or by a heuristic

- The quality of the solution depends on the initial set of clusters and the value of K

 

Common similarity / distance measures

- P-norms

- Mahalanobis $d(\vec{x}, \vec{y}) = \sqrt{ \sum_{i=1}^N \frac{(x_i - y_i^2}{\sigm_i^2} }

  > Sacled Euclidean

- Cosine distance : simiarity $= \cos \theta = \frac{A \cdot B}{||A|| ||B||}$

 

Segmentation as clustering

- K-means clustering based on intensity or clor is essentially vector qunatization of the image attributes.

 > Clusters don't have to be spatially coherent.

- Cluster similar pixels (features) together

  > Clustering based on (r, g,b,x,y) values enforces more spatial coherence.

 

K-means for segmentation

- Pros

  > Simple and fast 

  > Converges to a local minimum of the error funtion

  > Easy to implement

- Cons

  > Memory intensive

  > Need to choose K

  > Sensitive to initialization

  > Sensitive to outliers

- Usage

  > Rarely used for pixel segmentation

 

Mean shift segmenation

- Versatile technique for clustering-based segmentation

- Non-parametric feature-space analysis technique, a so called mode ( the value that appears most often in a set of data values ) seeking algorithm

- Mean shift algorithm seeks modes or local maxima of density in the feature space.

- a procedure for locating the maxima (mode) of a density function given discrete data sampled from that function

- an iterative method, and we start with an initial estimate $x$.

- Let a kernel function $K(x_i - x)$ be given. This function determines the weight of nearby points for re-estimation of the mean. Typically a Gaussian kernel on the distance to the current estimated is used, $K(x_i -x) = e^{-c||x_i-x||^2}$

- The weighted mean of the density in the window determined by is $m(x) = \frac{\sum_{x_i\in N(x)} K(x_i-x)x_i }{\sum_{x_i \in N(x)} K(x_i - x) }$ where $N(x)$ is the neighborhood of $x$.

- The mean-shift algorithm now sets $x \leftarrow m(x)$, and repeats the estimation until $m(x)$ converges.

 

Kernel density estimation

- Histogram vs Generic KDE

- 비모수적 밀도추정 일반식 $p(x) \approx \frac{k}{NV}$ (V : x 를 둘러싼 영역 $R$ 의 체적 면적, N : 표본의 총수, k : 영역 R 내의 표본의 수)

- 커널 밀도 추정 두가지 방법

  > KDE(kernel Density Estimation)

  • 체적 V 고정 시키고, 영역내의 표본 수 k가 변하여 밀도 함수 추정
  • Parzen window 추정법 (또는 Parzen-Rosenblatt window) 이 대표적 $V_N = \frac{1}{\sqrt{N}}$ 과 같은 $N$ 의 함수로 체적 $V_N$ 지정해 영역 줄여가며 최적밀도 추정

  > k-NNR ( Nearest Neighbor Rule) 추정법

  • 영역 내 표본 수 k 값을 고정 값으로 선택하고 체적 V 변경시켜 접근
  • 어떤 표본점을 포함하는 체적 $k_N = \sqrt{N}$ 개의 표본을 포함하도록 체적 줄여나가면서 최적 밀도 추정

- KDE, K-NNR 모두 N 이 커질수록 실제 확률 밀도에 수렴

- V 와 k 는 N 에 따라 가변적으로 결정됨.

 

 

Parzen window based density estimation

- 비모수적 밀도함수 추정 일반식 : V 고정, k 가변 $p(x) \approx \frac{k}{NV}$

- 중심점 x 가 입방체의 중심, 한 변의 길이 h 인 초입방체를 영역 R, 여기에 k 개의 표본을 포함하면 -> 체적 : $V = h^D$

- Parzen window : 최초값이 중심이 되도록 하는 단위 초입방체

- 범위 내에 포함시킬 표본 갯수를 결정할 함수 $K(u) = \left\{\begin{matrix} 1, |u_j| < \frac{1}{2} \\ 0 \end{matrix}\right. $

- 초입방체 내의 점의 총수 : $k = \sum_{n=1}^N K (\frac{x - x^n}{h})$

- Parzen-Rosenblatt window 밀도함수 추정식의 일반화

  > 영역 내의 표본의 수는 중심이 x 이고 길이가 h 인 범위 내부에 점 $x^(n)$ 이 속할 때만 1, 나머지는 0

  > 영억 $V$ 내에 포함되는 표본의 총 수를 이용해 밀도추정

  >  일반적인 커널 밀도함수 추정기 $P_{KDE}(x) = \frac{1}{Nh^D}\sum_{n=1}^N K (\frac{x-x^n}{h})$

 

- 커널함수의 역할

  > $P_{KDE}(x)$ 의 기대값 -> $E[P_{KDE}(x)] = \frac{1}{Nh^D} \sum_{n=1}^N E[K(\frac{x-x^n}{h})] = \frac{1}{h^D}E[K(\frac{x-x^n}{h})] = \frac{1}{h^D}\int K(\frac{x-x^n}{h})p(x')dx'$

  > 추정된 밀도 $P_{KDE}(x)$ 의 기대값 : 커널 함수와 실제 밀도함수의 convolution

  > 커널의 폭 h 는 스무딩 파라미터 역할

  • 커널 함수가 넓을수록 추정치는 더 부드러워짐
  • $h \rightarrow 0$ : 커널 함수의 폭 -> 0, 함수값 -> 무한대, 즉 $h \rightarrow 0$ 이면 커널 함수가 delta 함수에 수렴

- 커널함수에서 bandwidth h 의 역할

  > $p_n(x) = \frac{1}{Nh^D} \sum_{n=1}^N K(\frac{X-X^n}{h})$

  > 커널 함수가 다음의 delta 함수라고 가정하면 $\delta_n(X) = \frac{1}{V_n}K(\frac{X}{h_n} \leftarrow p_n(x) = \frac{1}{n}\sum_{j=1}^n \delta_n(X-X^j)$ 

  • h 커질수록
    1. delta 함수의 크기 감소
    2. 동시에 포함되는 표본 데이터 수 증가
    3. 넓게 퍼진 평활한 형태의 함수
  • h 작아질수록
    1. delta 함수의 크기 증가
    2. Bandwith 에 포함되는 표본 데이터 수 감소
    3. 피크 모양에 가까운 형태의 밀도함수

 

- $y$ 를 $y_0$로 놓고 시작하여 수렴할 때까지 $y_0 \rightarrow y_1 \rightarrow y_2 \cdots$ 반복

  > $y_{t+1} = \frac{\sum_{i=1}^n x_k k(\frac{x_i - y_t}{h})}{\sum_{i=1}^n k(\frac{x-i - y_t}{h})}$

  > $y_{t+1} = y_t + m(y_t)$ -> $m(y_t)$ 는 벡터

 

Mean shift clustering

- Attraction basin : the region for which all trajectories lead to the same mode

- Cluster : all data points in the attraction basin of a mode

 

Mean shift clustering/segmentation

-find features (color, gradients, texture, etc)

- initialize windows at individual feature points

- Perform mean shift for each window until convergence

- Merge windows that end up near the same "peak" or mode

 

 

Mean shift clustering

- Teh mean shift algorithm seeks modes of the given set of points

  > Choose kernel and bandwidth

  > For each point:

  1. Center a window on the point
  2. Compute the mean of the data in the search window
  3. Center the search window at the new mean location
  4. Repead (b) and (c) until convergence

- Pros

  > Good general-purpose segmentation

  > Flexible in number and shape of regions

  > Robust to outliers

  > General mode-finding algorithm (useful for other problems such as finding most common surface normals)

- Cons

  > Have to choose kernel size in advance

  > Not suitable for high-dimensional features

- When to use it

  > Over-segmentation

  > Multiple segmentations

  > Tracking, clustering, filtering applications

 

 

Motion

Uses of motion in computer vision

- 3D shape reconstruction

- object segmentation

- Learning and tracking of dynamical models

- Event and activity recognition

- Self-supervised an predictive learning

 

Motion field

- The motion field is the projection of the 3D scene motion into the image

 

Optical flow

- Definition : The apparent motion of brightness patterns in the image

- Ideally, optical flow would be the same as the motion field.

- Have to be careful : apparent motion can be caused by lighting changes without any actual motion

  > Think of a uniform rotating sphere under fixed lighting VS a stationary sphere under moving illumination

 

From images to videos

- A video is a sequence of frames captured over time

- Now our image data is a function of space (x,y) and time(t)

 

Motion estimation techniques

- Optical flow

  > Recover image motion at each pixel from spatio-temporal image brightness variations (optical flow )

- Feature-tracking

  > Extract visual features (convers, textured areas) and "track" them over multiple frames

 

 

Optical flow

- apparent motion of brightness patterns in the images

- Note : apparent motion can be caused by lighting changes without any actual motion

- Goal : recover image motion at each pixel from potical flow

 

Estimating Optical flow

- Given two subsequent frames, estimate the apparent motion field u(x,y) and v(x,y) between them

- Key assumptions

  > Brightness constancy : projection of the same point looks the same in every frame

  > Small motion : points do not move every far

  > Spatial coherence : points move like their neighbors

 

Brightness constancy constraint

- Brightness constancy equation : $I(x,y,t-1) = I(x + u(x,y), y+v(x,y), t)$

- Linearizing the right side using Taylor expansion :

  > $I(x+u, y+v, t) \approx I(x,y,t-1) + I_x \cdot u(x,y) + I_y \cdot v(x,y) + I_t$

  > $I(x+u, y+v, t) - I(x,y,t-1) = I_x \cdot u(x,y) + I_y \cdot v(x,y) + I_y$

  > $I_x \cdot u + I_y \cdot v + I_t \approx 0 \rightarrow \Delta I \cdot [uv]^T + I_t = 0$

 

 * 광류 추정의 원리 *

*- 테일러 급수 이용

*  > $f(y+dy, x+dx, t+dt) = f(y,x,t) + \frac{\partial f}{\partial y}dy + \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial t}dt + $2차 이상의항

* - 두 영상 사이의 시간 차이 $dt$ 가 작다고 가정하고 2차 이상의 항은 무시

*- brightness consistency 가정으로 $f(y+\Delta y, x+\Delta x, t+\Delta t) = f(y,x,t) $ 를 대입하면, $\frac{\partial f}{\partial y}\frac{\partial y}{\partial t} + \frac{\partial f}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial f}{\partial t} = 0$ => 광류 조건식(그레디언트 조건식)

* - $\triangledown f(x,y) = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}) = (d_x, d_y) = (f(y+1, x) - f(y-1,x), f(y, x+1) - f(y, x-1))$

*  > $ \frac{\partial y}{\partial t} = v, \frac{\partial x}{\partial t} = u $ : $dt$ 동안 y 와 x 방향으로 이동량. 모션 벡터에 해당

 

- can we use this equation to recover image motion (u, v) at each pixel? $\triangledown I \cdot [uv]^T + I_y = 0$

- How many equations and unknowns per pixels?

  > One equation (this is a scalar equation!), two unknowns (u, v)

 

The aperture problem

Actual move $\neq$ Perceived motion

Overcome aperture problem

Optical flow estimation

- Lucas-Kanade

  > 화소(y,x) 를 중심으로 하는 윈도우의 영억 N(y,x)의 광류는 같다 라는 가정

  > 영역 N(y,x) 의 모둔 화소 $(y_i, x_i)$ 는 같은 모션 벡터(3x3) $v=(v,u)$ 가짐. $\frac{\partial f(y_i, x_i)}{\partial y}v + \frac{\partial f(y_i, x_i)}{\partial x}u + \frac{\partial f(y_i, x_i)}{\partial t} = 0$, $(y_i, x_i) \in N(y,x)$

- 미지수 2개, 식은 $n$ 개-> least square 방법

- 행렬 $A$ 로 표현, $Av^T = b$

  > $v^T = (A^TA)^{-1}A^T b$

- 중앙에 가까울수록 가중치 부여 ( 예 : 가우시안 )

  >  $A^WAv^T = A^TWAb$ -> $v^T = (A^TWA)^{-1}A^TWb$

 

- Lucas-Kanade

  > 이웃 영역만 보는 지역적 알고리즘

  • 윈도우 크기 중요 : 클수록 큰 움직임을 추정 가능하지만, 스무딩 효과로 모션 벡터의 정확성 낮아짐
  • 해결책으로 피라미드 활용

  > 명암 변화가 적은 물체 내부에 0 벡터 발생

 

 

 

'학교수업 > 컴퓨터비전' 카테고리의 다른 글

[ComputerVision] 11.  (0) 2022.06.02