-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdiscussion.tex
88 lines (70 loc) · 5.16 KB
/
discussion.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
\begin{table*}[ht]
\begin{center}
\setlength\tabcolsep{1pt}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|}
\hline
Dataset& Metrics &OSVOS~\cite{OSVOS} &$\text{OSVOS}^\text{S}$\cite{OSVOS-S} &BVS\cite{BVS} &PML~\cite{PML} &Lucid~\cite{LucidTracker} &CTN~\cite{CTN} &MaskRNN\cite{MaskRNN} &OSMN\cite{OSMN} &DyeNet~\cite{DyeNet} \\
\hline
\multirow{2}{*}{DAVIS2016} &$\J$ mean &79.8 &85.6 & 60.0 &86.1 &\textbf{91.2} &73.5 &80.4 &74 &86.2 \\
\cline{2-11}
&$\F$ Mean &80.6 &86.4 &58.8 &-- &\textbf{94.2} &69.3 &82.3 &-- &-- \\
\hline
\multirow{2}{*}{DAVIS2017} &$\J$ mean &55.1 &-- &-- &-- &65.1 &-- &60.5 &52.5 &\textbf{67.3} \\
\cline{2-11}
&$\F$ Mean &62.1 &-- &-- &-- &70.6 &-- &-- &57.1 &\textbf{71}\\
\hline
\end{tabular}
\end{center}
\caption{The result of semi-supervised methods on DAVIS datasets.}
\label{table:semisuperivsed_all_dataset}
\end{table*}
\begin{table*}[ht]
\begin{center}
\setlength\tabcolsep{3pt}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
Dataset& Metrics &FSEG~\cite{Jain2017FusionSeg} &LVO~\cite{Tokmakov2017Learning} &LMP~\cite{Tokmakov2017Learning} & POS~\cite{Koh2017Primary} &IET~\cite{li2018instance}\\
\hline
\multirow{2}{*}{DAVIS2016} &$\J$ mean &71.51 &75.9 &69.7 &76.3 &78.5\\
\cline{2-7}
&$\F$ Mean & -- &72.1 &66.3 &71.1 &75.5\\
\hline
\end{tabular}
\end{center}
\caption{The result of unsupervised methods on DAVIS datasets.}
\label{table:unsuperivsed_all_dataset}
\end{table*}
\section{Discussion}
\subsection{Evaluation Metrics}
In a supervised evaluation framework, given a groundtruth mask $G$ on a particular frame and an output segmentation $M$,
any evaluation measure ultimately has to answerthe question how well $M$ fits $G$. As justified in \cite{pont2016supervised},
for images one can use two complementary points of view, regionbased and contour-based measures. As videos extends the
dimensionality of still images to time, the temporal stability of the results must also be considered.
\subsubsection{Accuracy}
\paragraph{Region Similarity $\J$}
To measure the region-based segmentation similarity, i.e. the number of mislabeled pixels,
one employ the Jaccard index $\J$ defined as the intersectionover-union of the estimated segmentation and the groundtruth mask.
The Jaccard index has been widely adopted since its first appearance in PASCAL VOC2008 \cite{martin2004learning},
as it provides intuitive, scale-invariant information on the number of mislabeled pixels. Given an output segmentation $M$ and
the corresponding ground-truth mask $G$ it is defined as $\J = \frac{|M \cap G|}{|M\cup G|}$
\paragraph{Contour Accuracy $\F$}
From a contour-based perspective, one can interpret $M$ as a set of closed contours $c(M)$
delimiting the spatial extent of the mask. Therefore, one
can compute the contour-based precision and recall $Pc$ and
$Rc$ between the contour points of $c(M)$ and $c(G)$, via a bipartite graph matching in order to be robust to small inaccuracies,
as proposed in \cite{martin2004learning}.
So the F-measure $F$ is a good trade-off between two, which is defined as $F = \frac{2Pc Rc}{Pc+Rc}$.
\paragraph{Temporal Stability $\T$}
Temporal stability $\T$. Intuitively, $\J$ measures how well the pixels of the two masks match, while $F$ measures the
accuracy of the contours. However, temporal stability of the results is a relevant aspect in video object segmentationsince the evolution of object shapes is an important cue for
recognition and jittery, unstable boundaries are unacceptable in video editing applications.
\subsection{Results}
\subsubsection{Semi-Supervised Setting}
Table\ref{table:semisuperivsed_all_dataset} presents an evaluation on DAVIS-2016 and DAVIS-2017 using evaluation metrics proposed in Sec 5.1. Three measures are used: region similarity in terms of intersection over union ($\J$), contour accuracy ($\F$, higher is better), We outperform all semi-supervised methods on these measures.
We can find that LucidTracker\cite{LucidTracker} is start-of-the-art method on DAVIS2016 Setting, which is benefit from ``Data Dream'', improves the robustness of temporal propagation; On the ohter hand, DyeNet gets on the start-of-the-art location in DAVIS-2017, which is benefit from the combination of both temporal propagation and re-identification functionalities.
\subsubsection{Unsupervised Setting}
As shown in Table\ref{table:unsuperivsed_all_dataset}, we can compare the performance of difference algorithm. In generally, the method based on deep learning would excel the the traditional
methods.The method combined spatial-temporal information can greatly help improve performance in \cite{Tokmakov2017Learning}. And in normal case, optical flow method can help capture the
temporal information, but it is still not powerful enough. The visual memory module applied in temporal can help improve $4.4\%$ in $\J$ evaluation metrics. Let us see some traditional methods,like
\cite{Koh2017Primary,li2018instance}, they can also gain impressive performance in DAVIS dataset because they can design some specifical feature to model sequences data. Li $et.al$ use instance embedding method
to force the algorithm to learn the instance concepts. So they can get excellent performance in instance segmentation setting.