VideoRec'07 workshop will be held as a Special Session of the CRV'07 conference on the second day of the conference on Tuesday

VideoRec'07: International workshop on
Video Processing and Recognition
May 28-30, 2007, Montreal, Canada
Marriott Chateau Champlain
www.computer-vision.org/VideoRec07

introduction

submission instructions

workshop program

AI'07 / GI'07 / CRV'07 / IS'07

Workshop Abstracts

VideoRec'07 workshop will be held as a Special Session of the CRV'07 conference
on the second day of the conference on Tuesday 13:00 - 18:30

Oral presentations
Tuesday 15:30 - 18:30 (Maisonneuve - 36th Floor)

15:30
Vision-Based Motorcycle Detection and Tracking System with Occlusion Segmentation
Chung-Cheng Chiu, Min-Yu Ku, Hung-Tsung Chen
15:50
Real-time commercial recognition Using Color Moments and Hashing
Abhishek Shivadas, John M. Gauch
16:10
Constructing Face Image Logs that are Both Complete and Concise
Adam Fourney and Robert Laganière

Abstract: This paper describes a construct that we call a face image log. Face image logs are collections of time stamped images representing faces detected in surveillance videos. The techniques demonstrated in this paper strive to construct face image logs that are complete and concise in the sense that the logs contain only the best images available for each individual observed. We begin by describing how to assess and compare the quality of face images. We then illustrate a robust method for selecting high quality images. This selection process takes into consideration the limitations inherent in existing face detection and person tracking techniques. Experimental results demonstrate that face logs constructed in this manner generally contain fewer than 5% of all detected faces, yet these faces are of high quality, and they represent all individuals detected in the video sequence.
16:30
Real-time eye blink detection with GPU-based SIFT tracking
Marc Lalonde, David Byrns, Langis Gagnon, flower delivery in Dublin, Denis Laurendeau

Abstract: This paper reports on the implementation of a GPU-based, real-time eye blink detector on very low contrast images acquired under near-infrared illumination. This detector is part of a multi-sensor data acquisition and analysis system for driver performance assessment and training.
Eye blinks are detected inside regions of interest that are aligned with the subject’s eyes at initialization. Alignment is maintained through time by tracking SIFT feature points that are used to estimate the affine transformation between the initial face pose and the pose in subsequent frames. The GPU implementation of the SIFT feature point extraction algorithm ensures real-time processing. An eye blink detection rate of 97% is obtained on a video dataset of 33,000 frames showing 237 blinks from 22 subjects.

16 :50
A Robust Video Foreground Segmentation by Using Generalized Gaussian Mixture Modelling
Mohand Saïd Allili, Nizar Bouguila and Djemel Ziou
17:10
A Framework for 3D Hand Tracking and Gesture Recognition
Ayman El-Sawah, Chris Joslin, Nicolas D. Georganas, Emil M. Petriu

Abstract: In this paper we present a framework for 3D hand tracking and dynamic gesture recognition using a single camera. Hand tracking is performed in a two step process: we first generate 3D hand posture hypothesis using geometric and kinematics inverse transformations, and then validate the hypothesis by projecting the postures on the image plane and comparing the projected model with the ground truth using a probabilistic observation model. Dynamic gesture recognition is performed using a Dynamic Bayesian Network model. The framework utilizes elements of soft computing to resolve the ambiguity inherent in vision-based tracking by producing a fuzzy hand posture output by the hand tracking module and feeding back potential posture hypothesis from the gesture recognition module.
17:30
Adaptive Appearance Model for Object Contour Tracking in Videos
Mohand Saïd Allili and Djemel Ziou
17:50
Automatic Annotation of Humans in Surveillance Video
T.B. Moeslund, D.M. Hansen, P.Y. Duizer

Abstract: In this paper we present a system for automatic annotation of humans passing a surveillance camera. Each human has 4 associated annotations: the primary color of the clothing, the height, and focus of attention. The annotation occurs after robust background subtraction based on a Codebook representation. The primary colors of the clothing are estimated by grouping similar pixels according to a body model. The height is estimated based on a 3D mapping using the head and feet. Lastly, the focus of attention is defined as the overall direction of the head, which is estimated using changes in intensity at four different positions. Results show successful detection and hence successful annotation for most test sequences.

[Video] [Head direction dataset]
18:10
Registration of IR and Video Sequences Based on Frame Difference
Zheng Liu and Robert Laganière

Abstract: Multi-modal imaging sensors are employed in advanced surveillance systems in the recent years. The performance of surveillance systems can be enhanced by using information beyond the visible spectrum, for example, infrared imaging. To ensure correctness of low- or high-level processing, multi-modal imagers must be fully calibrated or registered. In this paper, an algorithm is proposed to register the video sequences acquired by an infrared and an electro-optical (CCD) camera. The registration method is based on the silhouette extracted by differencing adjacent frames. This difference is found by an image structural similarity measurement. Initial registration is implemented by tracing the top head points in consecutive frames. Finally, an optimization procedure to maximize mutual information is employed to refine the registration results.

Posters & Demos
Held together with CRV'07 Poster Session 2
Tuesday 13:00 - 15:00 (Maisonneuve - 36th Floor)

Detecting, Tracking and Classifying Animals in Underwater Observatory Video
Duane R. Edgington, Danelle E. Cline, Jerome Mariette, Ishbel Kerkez

Abstract: For oceanographic research, remotely operated underwater vehicles (ROVs) and underwater observatories routinely record several hours of video material every day. Manual processing of such large amounts of video has become a major bottleneck for scientific research based on this data. We have developed an automated system that detects, tracks, and classifies objects that are of potential interest for human video annotators. By pre-selecting salient targets for track initiation using a selective attention algorithm, we reduce the complexity of multi-target tracking. Then, if an object is tracked for several frames, a visual event is created and passed to a Bayesian classifier utilizing a Gaussian mixture model to determine the object class of the detected event.

[Paper] [Poster]
GET-based map icon Identification for Interaction with Map and Kiosks
Huiqiong Chen, Derek Reilly

Abstract: This paper presents a GET (Generic Edge Token) based approach of detecting and recognizing objects by their shapes, and applies it to improve our ongoing work that considers ways of interacting with paper maps using a handheld. In our work, the GET-based technique aims to help user better locate points of interest on map by recognizing these icons from images/videos captured by handheld camera. In this method, video/image content can be described using a set of perceptual shape features called GETs. Perceptible object can be extracted from a GET map, and then be compared against pre-defined icon models based on GET shape features in recognition. This method provides a simple, efficient way to locate points of interest on the map, determining handheld location, orientation when combined with RFID (senor-based) technique. The tests show that the GET-based object identification can be executed in reasonable time for the real-time interaction system. Meanwhile, the detections and recognitions are robust under different lighting conditions, camera focus, camera rotation, and distance from the map.

[Paper] [Poster]
Face Recognition in Video Using Modular ARTMAP Neural Networks
M. Barry and E. Granger

Abstract: In video-based of face recognition applications, the What-and-Where Fusion Neural Network (WWFNN) has been shown to reduce the generalization error by accumulating a classifier’s predictions over time, according to each individual in the environment. In this paper, three ARTMAP variants – fuzzy ARTMAP, ART-EMAP (Stage 1) and ARTMAP-IC – are compared for the classification of faces detected in the WWFNN. ART-EMAP (stage 1) and ARTMAP-IC expand on the well-known fuzzy ARTMAP by using distributed activation of category neurons, and by biasing distributed predictions according to the number of time these neurons are activated by training set patterns. The average performance of the WWFNNs with each ARTMAP network is compared to the WWFNN with a reference k-NN classifier in terms of generalization error, convergence time and compression, using a data set of real-world video sequences. Simulations results indicate that when ARTMAP-IC is used inside the WWFNN, it can achieve a generalization error that is significantly higher (about 20% on average) than if fuzzy ARTMAP or ARTEMAP is used. Indeed, ARTMAP-IC is less discriminant than the two other ARTMAP networks in cases with complex decision bounderies, when the training data is limited and unbalanced, as found in complex video data. However, ARTMAP-IC can outperform the others when classes are designed with a larger number of training patterns.

[Paper] [Poster]
A Simple Inter- and Intrascale Statistical Model for Video Denoising in 3-D Complex Wavelet Domain Using a Local Laplace Distribution
Hossein Rabbani, Mansur Vafadust, Saeed Gazor

Abstract: This paper presents a new video denoising algorithm based on the modeling of wavelet coefficients in each subband with a Laplacian probability density function (pdf) with local variance. Since Laplacian pdf is
leptokurtic, it is able to model the sparsity of wavelet coefficients. We estimate the local variance of this pdf
employing adjacent coefficients at same scale and parent scale. This local variance models interscale dependency between adjacent scales and intrascale dependency between spatial adjacent. Within this framework, we design a maximum a posteriori (MAP) estimator for video denoising, which relies on the proposed local pdf. Because separate 3-D transforms, such as ordinary 3-D wavelet transforms, have artifacts that degrade the performance of this transform, we implement our algorithm in 3-D complex wavelet transform. This nonseparable and oriented transform gives a motion-based multiscale decomposition for video that isolates in its subbands motion along different directions. In addition, we use our denoising algorithm in 2-D complex wavelet transform, where the 2-D transform is applied to each frame individually. Despite the simplicity of our method in implementation, it achieves better performance than several denoising methods both visually and in terms of peak signal-to-noise ratio (PSNR).

[Paper] [Video1, Video2]
Automatic extraction of semantic object in image using local brightness variances
Chee Sun Won

Abstract: This paper deals with the problem of segmenting semantic object in an image. Fully automatic solution of this problem is not possible, but human intervention is needed for outlining the rough boundary of the semantic object to be segmented. Our goal is to make the object extraction automatic after the first semiautomatic segmentation. To achieve our goal, we manipulate the contrast of the object and the background such that any contrast-based object segmentation method can extract the object automatically.

[Paper]
Zoom on the evidence with ACE Surveillance (+demo)
Dmitry O. Gorodnichy, Mohammad A. Ali, Elan Dubrofsky, Kris Woodbeck

Abstract: Despite the population's growing awareness of the need to use surveillance systems for better security in private and business settings, such systems still have not become commonplace. The main reason for this is the amount of time and resources an average user has to dedicate in order to collect video data and then to dig through it searching for a evidence when using traditional DVR-based surveillance systems. Here we present ACE-Surveillance - an automated surveillance technology based on real-time Annotation of Critical Evidence that provides an efficient and low-cost solution to the problem. We describe the main features of this technology as related to its two components: ACE-Capture and ACE-Browser. The first component deals with detection and archival of annotated evidence, which is normally performed on a client's desktop computer. The latter deals with browsing and displaying archived video evidence and can be performed either locally on client's computer or remotely via a dedicated server. A new Zoom-on-the-Evidence browsing technique featured by the ACE Surveillance is introduced. Demonstrations of running the technology on several real-life long-term monitoring assignments, including its deployment by the NRC commissionaires, are shown.

[Paper] [Poster] [Link]
Working with computer hands-free using Nouse Perceptual Vision Interface (+demo)
Dmitry O. Gorodnichy, Elan Dubrofsky, Mohammad A. Ali

Abstract: Normal work with a computer implies being able to perform the following three computer control tasks: 1) pointing , 2) clicking, and 3) typing. Many attempts have been made to make it possible to perform these tasks hands-free using a video image of the user as input. Nevertheless, as reported by assistive technology practitioners, no real vision-based hands-free computer control solution that can be used by disabled users has been produced as of yet. Here we present the Nouse Perceptual Vision Interface (Nouse-PVI) that is hoped to finally offer such as a solution. Evolved from the original Nouse "Nose as Mouse" concept and currently under testing with SCO Health Service, Nouse-PVI has several unique features that make it preferable to other hands-free computer input alternatives. These include i) using the nose tip, which has been confirmed to be very convenient for disabled users, for whom the nose literally becomes a new "finger" that can be used to control the computer; ii) a feedback-providing mechanism called Perceptual Cursor (Nousor) that creates an invisible link between the computer and user and which is very important for control as it allows the user to adjust his/her head motion so that the computer can better interpret it, and iii) a number of design solutions specifically tailored for vision-based data entry using small range head motion, such as motion-based code entry (NouseCode and NouseType), a motion-based virtual keyboard (NouseBoard and NousePad) and a word-by-word letter drawing tool (NouseChalk). While demonstrating these innovative tools, we also address the issue of the user's ability and readiness to work with a computer in the brand new way, i.e. hands-free. The problem is that a user has to understand that it is not entirely the responsibility of the computer to understand what one wants, but it's also the responsibility of the user. - Just as a conventional computer user cannot move the cursor on the screen without first putting the hand on the mouse, so a perceptual interface user cannot work with a computer until s/he "connects" to it. That is, the computer and user must work as a team for the best control results to be achieved. This presentation is therefore designed to serve both as a guide to those developing vision-based input devices and as a tutorial for those who will be using them.

[Paper] [Poster] [Link]

Last updated: 20.V.2007.

Home