[
CogSci Summaries home |
UP |
email
]
http://www.jimdavies.org/summaries/
Cedras, C. & M. Shah (1995), Motion-Based Recognition: A Survey, IVC,
13(2):129-155.
@Article{CedrShah95,
author = "C\'edras, Claudette and Shah, Mubarak Ali",
year = "1995",
month = mar,
title = "Motion Based Recognition: {A} Survey.",
journal = "Image and Vision Computing",
volume = "13",
number = "2",
pages = "129--155",
}
Author of the summary: Jim Davies and Brad Singletary, 2000,
jim@jimdavies.org, bas@cc.gatech.edu
Cite this paper for:
- Structure From Motion (SFM)
- Optical flow consists of the computation of the displacement of each
pixel between frames. This yeilds a vector map of flows in the image.
- Motion correspondence matches interesting image
features that can be tracked through time. This gives you an idea of
which image features vary together... which can be used to recognize
higher level behaviours.
- Trajectory is
a set of points recorded from motion perceived over space. It's the
coordinates of where something was all the way to where it is now.
- Motion events may be enough to
discern between high level tasks. A motion event is simply events like
presence of motion, absence of motion, change in direction, change in
nth order derivative, and any subjective division of motion space you
can easily (hackishly) detect that gives you better recognition rates.
- Just getting the structure from
motion does not constitute recognition, since that structure must be
analyzed too to categorize it.
- motion-based recognition: motion is used to recognize directly, without having to
recreate the structure first.
- motion recognition: Recognition of a kind of motion. Distinct
from motion-based recognition.
- Region-based features: "features derived from the use of an extended
region or from a whole image"
- Problems with optical flow
- boundary oversmoothing: background pixels moving
- correspondence problem: How to unambiguously map a point in a frame to
a point in the next frame. Combinatorially explosive.
- proximal uniformity: most objects follow smooth paths, cover
small distances, and do so in a short amount of time.
- Simple trajectories are often usually not enough for recognition.
- In some domains like walking, the relative motion of objects (legs,
arms) is more important than their absolute motion.
- Common motion: movement relative to observer
- Trajectory Primal Sketch (TPS): Gould and Shah. a representation of
significant changes in motion. Can distinguish translation, rotation,
projectile and cycloid.
- motion boundaries: smooth starts, smooth stops, pauses, impulse
starts, impulse stops
- Spatio-temporal cube: A cube of images stacked. (x, y, time) are the
axes.
- ST-curves: curves in a Spatio-temporal cube
- eigenvectors, eigenlips.
- Motion verb recognition: "the association of natural language verbs
with the motion performed by a moving object in a sequence of images."
- labelling: Identifying human's parts from a movie.
- generalized cylinders
- Adjunct relations: relations of cylinders to the higher level
cylinders in a hierarchy
- apars: ribbons or antiparallel lines. They enclose regions.
- aperture problem: without enough feature-data, motion within
your limited field of view can be explained in multiple ways.
- flexion: 2 body segments change relative position, decreasing angle
- abduction/adduction: movement of body away/toward main body
- Scenario hierarchy: combined components (things) and assemblies
(combinations of things).
So this paper is a whole
vision course (and 3 times over again in its references) if you
please. Instead of addressing specific methods, we're going to take a
high level approach to the use of motion for computer vision.
Ok, to understand Motion-based recognition we've first got to better
understand the semantics of motion in general. In terms of the real
world, there are things, and spaces that things can occupy. Perceived
motion is the record of a stepwise displacement of a thing through
occupyable space. The size of step can vary, the size of the size of
the step can vary, ad infinitum... Remember from physics that we have:
position, velocity, accelleration, jerk, snap, crackle, and pop for
the zero through sixth order differences (cameras probably don't make
very good approximations of higher order differences).... each just
being the derivative of the last.
At any rate, we have some model we believe in that we use to describe
how things move in the world. (Newtonian Motion usually, but Dynamics
are better) It is useful to note that each constraint we add limits
the type of motion that can occur in the system. This model is
inherantly a feature space for later use. If we know how an object is
supposed to move we can limit the feature space considerably
(i.e. make it non-infinite). Also, we can use this model to generate
filters that help combat errors in our sensors.
We live in 3 dimensions. In terms of a camera sensor there is a nxm (2
dimensional) grid of pixels that represent emitted/reflected light
from the environment incident to the CCD. Recording a 2d
representation of a 3d world is a dimensionality reduction. So it is
crucial to see that some motions in R3 may not be directly perceivable
by a single (stationary) camera head. i.e. one dimension roughly
appears as scale differences only
So one subgoal here is to use cameras to recover parameters of the
third dimension (and other paramaters like where the camera is with
respect to the environment) we can't or can't easily get from just one
image. Another goal is to see what the hell is actually happening over
a time period... to build a high-level representation of motion we can
use to understand observed image sequences. Note that tracking something
implies recognition.
The paper spends lots of time describing prior work in psychology on
motion. Veritably, vision research that uses models of human behavior
is applied psychophysics so this makes sense. Humans use motion
information very well to notice things, measure relative distances,
measure absolute distances, and predict future motion. If humans can
do it, why can't machines? As it turns out motion is a very powerful,
but sometimes computationally expensive feature space for vision
research. One interesting thing to note is that humans have very
specialized hardware for motion detection/recognition. For instance,
humans don't recognize motion very well if it is upside down.
So how do we see what an environment is shaped like given motion? One
decent technique, Structure From Motion (SFM) recovers important facts
about structure of viewed 3d systems from point motions in a sequence
of frames. Think, how could I recover an OpenGL style model of the world
plus my projection/rotation/translation matrix from just watching things
move about an environment. Well, first you'd want to have a concept of
internal camera parameters followed by a nonlinear optimization of your
reconing of objects tracked in the world.
So how do we see what happened in an environment given motion? 3d
structure is not enough in practice for robust and accurate
recogntion. Other features are generally required. We can look at
calculating global information, and discrete information for features
of the video stream.
One type of global information detailed in the paper is optical flow.
Optical flow consists of the computation of the displacement of each
pixel between frames. This yeilds a vector map of flows in the image.
Individual or regional flows may be tracked/analyzed. Another type is
motion correspondence. Motion correspondence matches interesting image
features that can be tracked through time. This gives you an idea of
which image features vary together... which can be used to recognize
higher level behaviours.
Low level features give us the ability to perform more general techniques.
Let's talk about a specific motion feature: trajectory. Trajectory is
a set of points recorded from motion perceived over space. It's the
coordinates of where something was all the way to where it is now.
It's nice because all motion has trajectory. Choosing the window size for
analysis is important. Note that trajectory will usually be aliased when
initially recorded from a camera. Since whole regions can have trajectorys
as well as flows, Trajectory doesn't have to be totally granular.
So what if all this motion is overkill? Motion events may be enough to
discern between high level tasks. A motion event is simply events like
presence of motion, absence of motion, change in direction, change in
nth order derivative, and any subjective division of motion space you
can easily (hackishly) detect that gives you better recognition rates.
However, certain classes of motion can be recognized using an important
technique called the Trajectory primal sketch: translation,rotation,
projectile, and cycloid motions can all be discriminated.
MODELS AND COMPARISON
MLD: moving light displays - simulate corresponding point motion
(e.g. attatching lights to joints). People can detect walking and
gender from these displays. One theory to explain how people can do
this is SFM: structure from motion. That is people use the motion to
re-create a 3d structure representation and recognize that. The other
theory (motion-based recognition) is that motion is used to recognize
directly, without having to recreate the structure first.
SFM: theoretical solutions of SFM. Just getting the structure from
motion does not constitute recognition, since that structure must be
analyzed too to categorize it.
Motion based recognition has two steps:
- Find the appropriate representation for the objects
modeled. This is taken from the raw data. They can be low or
high level (trajectories or motion verbs (Koller, Tsotsos). )
- Match the input with a model.
Methods for extracting 2d motion:
- motion correspondence
- optical flow
Region-based features: "features derived from the use of an extended
region or from a whole image"
Figure 1:
- Sequence of images (f1, f2, f3, ... , fn)
- Trajectory-Based Features (motion trajectories,
spatiotemporal curves, reference curves)
- Velocity (Vx, Vy) -- motion events
- Speed and Direction (s, phi) -- motion events
- Joint Angles -- relative motion, motion events
- Optical Flow
- Normal Flow Statistics
- Correlation
- Average Flow of a Region
- Region-based features
- binary image features -- mesh feature cookbook
- Graylevel Image Features -- model views, Eigen
Images
Optical flow methods:
- differential methods. "Compute the velocity from
spatio-temporal derivatives of image intensity"
- region-based matching. "... velocity is defined as the shift
yielding the best fit between image regions, according to some
similarity or distance measure."
- energy-based (frequency-based) techniques. "... compute optical
flow using the output from the energy of velocity-tuned filters
in the fourier domain.
- phase-based techniques. velocity is defined in terms "of the
phase behaviour of band-pass filter outputs" (e.g. zero-crossing
techniques)
Problems with optical flow:
- aperture problem: without enough feature-data, motion within
your limited field of view can be explained in multiple ways.
- boundary oversmoothing: background pixels moving
- segmentation of multiple of multiple moving objects
correspondence problem: How to unambiguously map a point in a frame to
a point in the next frame. Combinatorially explosive.
Constraints can be applied to cut down on it:
- proximal uniformity: most objects follow smooth paths, cover
small distances, and do so in a short amount of time.
- maximum velocity
- small velocity change (smoothness of motion)
- common motion
- consistent match
- rigidity
Simple trajectories are often usually not enough for recognition.
Relative Motion
In some domains like walking, the relative motion of objects (legs,
arms) is more important than their absolute motion.
Common motion: movement relative to observer
Trajectory Primal Sketch (TPS): Gould and Shah. a representation of
significant changes in motion. Can distinguish translation, rotation,
projectile and cycloid.
Motion boundaries: smooth starts, smooth stops, pauses, impulse
starts, impulse stops
Goddard used changes in rotational velocity and direction change of
body segments as motion events, which were input to a connectionist
system.
Region-based features
large or whole-image region processing.
Polana & Nelson:
- mean flow magnitude divided by standard deviation
- positive and negative curl and divergence estimates
- non-uniformity of flow direction
- directional difference statistics in four directions
Spatio-temporal cube: A cube of images stacked. (x, y, time) are the
axes. You can use detection of 3d patterns in the cube to do motion
detection.
Matching and classification
Most classification algorithms classify an unknown by the model it is
closest to according to some distance measure. Thus the summaries
below focus on representation, not process.
Ways of representing the unknown and models:
- scale space of trajectories (Rangarajan, Allen, Shah) ???
- vectors
- connectionist network (Goddard)
- sequences of symbols from mesh frames (Yamato et al) ???. Apply
HMMs to them
Motion recognition
Cyclical motion
Two cycles are needed for detection. You can do it my noticing ST-curves.
Lipreading
Petajan: find where the mouth is by locating the nostril. Change the
image so that it is just a black blob where the mouth opening
is. Cluster training images into 255 groups. Pick representatives for
each group, put them all in a code book in order of black
area. Associate them with letters, input is classified as nearest match.
Finn & Montgomery: distances between distinct points on the mouth.
Mase & Pentland: Used optic flow to find movement-- elongation of
mouth and mouth opening.
Martin & Shah: uses a sequence of dense optical flow fields arount the
mouth. Spatially and temporally warped to control for how long the
speaker takes to say it.
Kirby et al: Uses a set of eigenvectors (these are like factors of
factor analysis.) Eigenlips.
Gesture interpretation
Darnell & Pentland: "uses an automatic view-based approach to build
the set of view models from which gesture models will be created." ???
Works in real time.
Davis & Shah: track trajectory of each finger. Works in real time.
Motion verb recognition
Motion verb recognition: "the association of natural language verbs
with the motion performed by a moving object in a sequence of images."
Koller, Heinze, Nagel: 119 german verbs in 4 categories: 1. verbs
describing the action of the vehicle only, verbs which make reference
to the road, , other objects, other locations. They were also broken
up according to whether they described the beginning, middle, or ends
of events. There are 13 attributes (computed from the
sequence). Attributes had preconditions, monotonicity conditions, and
postconditions. These indicate how acceptable an event is for the
beginning, middle, and end of an attribute.
Tsotsos: system: ALVEN. detected abnormal heart behavior. Used
semantic nets, frames, type hierarchies, inheritence. Markers were put
on the heart. With this rich knowledge base, it was able to describe
in detail the motion of the heart's LV wall.
Temporal textures classification
Nelson & Polana: ripples on water, wind in leaves, cloth waving in
wind. Features based on optical flow fields. Used vector analysis.
Human motion tracking and recognition
labelling: Identifying human's parts from a movie.
tracking: finding part trajectories.
Human motion tracking using motion models
Modelling of the human body
3d tracking uses volumetric models and stick figures, labelling uses
2d models. Stick figures can be described with few parameters. The
volumetric models use generalized cylinders.
Marr & Nishihara: hierarchy of cylinders. First the whole body is one
cylinder, which has components of other cylinders.
Adjunct relations are relations of cylinders to the higher level
cylinders. Advantage: can be refined as needed.
O'Rourke & Badler: segments and joints. Flesh is represented as
spheres fixed on the segments. (600 of them). Joint angle constraints,
collision detection.
Hogg: This system was designed for the law enforcement team at Hazard
county. Builds on Marr & Nishihara. Uses elliptical cylinders.
Rohr: builds on Marr & Nishihara. Has clothing. He argues that's how
we usually see people.
Leung & Yang:
apars: ribbons or antiparallel lines. They enclose regions.
modelling of human motion
Uses joint angles.
- flexion: 2 body segments change relative position, decreasing angle
- abduction/adduction: movement of body away/toward main body
- Rotation
Hogg: left hip, knee, sholder, elbow. One person used for
data. Symmetry assumed.
Rohr: hip, knee, sholder, elbow. 60 men used for data.
Chen & Lee: used the following constraints: both arms or legs can not
be in front at the same time. the arm/leg pair cannot swing forward or
backward at the same time. sholders and elbows move cooperatively
(same for hip and knee). arm and leg swing trajectory is paralled to
direction of motion. at most 1 knee has a flexion angle, when one
does, the other is perpendicular to the ground.
Akita: key frame sequence of stick figures to model movements. Hogg
and Rohr both tried this early on and abandoned it.
Recognizing body parts
Akita: like above, uses cones. More sound.
Leung & Yang: annotated sequence of images. uses apars.
recognition of human movements
Johansson: showed that walking can be detected in 200 ms (half a
cycle) with MLDs. abstracts low level features into higher level
ones.
Scenario hierarchy: combined components (things) and assemblies
(combinations of things).
Yamato: used HMMs to analyze tennis strokes. Insensitive to noise.
Discrimination between humans from their motion
Rangaragen: Same shape, different motions vs same motion, different
shape. Trajectories of joints.
Tsai et al.: Curvature of trajectory, computed frequency. Cycles must
be aligned (as of 1995).
Conclusion
future work:
- multiple moving objects
- clothing
Summary author's notes:
- Thanks to Gabriel Brostow for his help in understanding some of
the concepts.
Back to the Cognitive Science Summaries homepage
Cognitive Science Summaries Webmaster:
JimDavies
(jim@jimdavies.org)
Last modified: Tue Feb 29 17:21:53 EST 2000