SegTrack Survey

1. Detect to Track and Track to Detect_ICCV2017_Christoph Feichtenhofer et al.
(code available) project page: http://www.robots.ox.ac.uk/~vgg/research/detect-track/

1> Summary: (1) The input to the network consists of multiple frames which are first passed through a ConvNet trunk to produce convolutional features which are shared for the task of detection and tracking. Then compute convolutional cross-correlation between the feature responses of adjacent frames to estimate the local displacement at different feature scales. On top of the features, we employ an RoI-pooling layer to classify and regress box proposals as well as an RoI-tracking layer that regresses box transformations (translation, scale, aspect ratio) across frames.

2> Network structure:
Add caption
3>. Some details:
(1) Reweighting scores across the detection tubes before putting them to construct the tracking regression.
(2) Perform non-maximum suppression with bounding-box voting before the tracklet linking step.
(3) CNN backbone is R-FCN
(4) Detection performance 79.8% ( vs ILSVRC2016 winner 76.2%)

4>. Disadvantages:
(1) Reweighting score part is error-prone for occlusion(such as water environment), most testing classes are moving animals(no person is strange)
(2) No tracking metrics and the objects are relatively separate with each other.

======================================================================

2. Weakly Supervised Semantic Segmentation using Web-Crawled Videos_CVPR2017_Seunghoon Hong et al.
(no code) paper link: https://arxiv.org/abs/1701.00352

1>Summary:
(1) Using image-level labeled images to conduct a weakly supervised learning( and get attention maps of each class) on images;(Encoder ? )
(2) Retrieval from Youtube to get videos(discard the unrelated video sequence based on the attention scores), and making use of attention, motion and color in spatial and temporal(super-pixel level) to form a CRF to conduct segmentation. (Decoder?)
(3)Using the (segmented) video as supervision to conduct semantic segmentation on images.

2>Network structure:

3>Results
state-of-art weakly supervised segmentation.
mIoU 58.1% in YouTube, 58.1% in VOC2012 validation and 58.7% in VOC2012 test set.

4> Disadvantage:
(1) not instance-level segmentation.

=====================================================================
3. Multiple Object Recognition with Visual Attention_ICLR 2015_ Jimmy Lei Ba.
(no code) paper link: https://arxiv.org/abs/1412.7755

1>Summary:
An attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. 

2> Network structure:


3> Results & Advantages:
(1) Beat CNN in  transcribing house number sequences from Google Street View images
(2) Few parameters and less computation
(3) Good design and 'complex' loss function
      Glimpse network
      Recurrent network
      Emission network
      Classification network

4> Disadvantages:
    No occultation.

=====================================================================
4. Convolutional Gated Recurrent Networks for Video segmentation_Mennatullah Siam et al_Arxiv2016
(no code) https://arxiv.org/abs/1611.05435

1> Summary:
The method relies on a fully convolutional network that is embedded into a gated recurrent architecture. This design receives a sequence of consecutive video frames and outputs the segmentation of the last frame. Convolutional gated recurrent networks are used for the recurrent part to preserve spatial connectivities in the image. 

2> Network structure:

3> Results & Details
(1) CNN backbone is FCN-8s
(2) Incooperate the temporal information in RNN unit.
(3) 3%~5% improvements in Cityscapes, SegTrack2, Davis etc.

4> Disadvantage:
 (1) Binary segmentation & Semantic Segmentation
 (2) Mainly handle with motion objects(pedestrain,car)
 (3) shape information could be used just like motion information
======================================================================
5. Dynamic Routing Between Capsules_NIPS2017_Sara Sabour et al
(code avaliable): https://arxiv.org/abs/1710.09829

1> Summary:
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

2>Network structure:









3>Results:(heavy occultation)



4> Disadvantage:
     Can't deal with the occultation of same letter.

=============================Other related papers ==========================
6. Online Multi-Object Tracking Using CNN-based Single Object Tracker with
Spatial-Temporal Attention Mechanism_ICCV2017_Mennatullah Siam et al.
paper link: https://arxiv.org/pdf/1611.05435.pdf

7.T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos
paper link: https://arxiv.org/abs/1604.02532


Comments

Popular posts from this blog

github accumulation

7. compile faster-r-cnn