How Moap Works: Trajectorization, and Labeling

So far in the series, we’ve started in the middle at reconstruction.  Then we took a step back and talked about reflectivity and markers.  Now, we’re going to move forward again, into the steps after reconstruction.

This article will be a little different than the previous ones, in that its more theoretical than practical.  That is to say, its the theory of how these kinds of things are done, not neccesarily how its done in Arena or in Vicon’s IQ.  Both systems are really closed boxes when it comes to a lot of this.  I can say, that the theory explained here is the basis for a series of operators in Kinearx, my "in development" mocap software.  And most of the theory is used in some form or another in Arena and IQ as well.  It just may not quite work exactly as I’m describing it.  Also, its entirely possible I’m overlooking some other techniques.  It would be good if this post spurred some discussion of alternate techniques. 

So, to review, the mocap system has triangulated the markers in 3d space for each frame.  However, it has no idea which marker is which.  They are not strung together in time.  Each frame simply contains a bunch of 3d points that are separate from the 3d points in the previous and next frames. I’ll term this "raw point cloud data."

Simple Distance Based Trajectorization

Theory:  Each point in a given frame can be compared to each point in the previous frame.  If the current point is closer than a given distance to a point in the previous frame, there’s a good chance its the same marker, just moved a little.

Caveats:   The initial desire here, will be to turn up the threshold, so that when the marker is moving, it registers as being close enough.  The problem, is that the distance one would expect markers to be from one another on a medium to small object, is close to the distance they would be expected to travel if the object were moved at a medium speed.  Its the same order of magnitude.  Therefore, there’s a good chance that it will make mistakes.

Recommendation:  This can be a useful heuristic.  However, the threshold must be kept low.  What will result, will be trajectorization of  markers that are moving slowly, or are mostly still.  However, movement will very quickly pass over the threshold and keep moving markers from being trajectorized.  This technique could be useful for creating a baseline or starting point.  However, it should probably be ignored if another more reliable heuristic disagrees with it.

Trajectorization Based on Velocity

Theory:  When looking at an already trajectorized frame, one can use the velocity of a trajectory to predict the location of a point in the next frame.  Comparing every point in the new frame against the predicted location, with a small distance threshold should yield a good match.  Since we are tracking real world objects that actually have real world momentum, this should be a valid assumption.  This technique can also be run in reverse.  This technique can be augmented further by measuring acceleration and using it to modify the prediction.

Caveats:  Since there is often a lot of noise involved in raw mocap data, a simple two frame velocity calculation could be WAY off.  A more robust velocity calculation taking multiple samples into consideration can help, but increase the likelihood that the data samples are from too far back in time to be relevant to the current velocity and acceleration of the marker (by now, maybe the muscle has engaged and is pushing the maker a different direction entirely).  An elastic collision will totally throw this algorithm off. Since the orientation of the surfaces that are colliding is unknown to the system, its not realistic for it to be able to predict direction.  And since most collisions are only partially elastic, the distance can not be predicted.  Therefore, an elastic collision will almost always result in a break of the trajectory.

Recommendation:  This heuristic is way more trustworthy than the simple distance calculation.  The threshold can be left much lower and should be an order of magnitude smaller than the velocity of a moving marker.  It can also be run multiple times with different velocity calculations and thresholds.  The results should be biased appropriately, but in general, confidence in this technique should be high.

Manual Trajectorization

 Theory: You, the human, can do the work yourself.  You are trustworthy.  And its your own fault if you’re not.

Caveats:  Who has time to click click click every point in every frame to do this?

Recommendation:  Manual trajectorization should be reserved for extremely difficult small sections of mocap, and for sparse seeding of the data with factual information.  Confidence in a manual trajectory should be extremely high however.

Labeling enforces Trajectorization

Theory:  If the labeling of two points says they’re the same label, then they should be part of the same trajectory.

Caveats:  Better hope that labeling is right.

Recommendation:  We’re about to get into labeling in a bit.  So you might think of this as a bit of a circular argument.  The points are not labeled yet.  And they’re trajectorized before we get to labeling.  So its too late right?  Or too early?  Not necessarily.  I can only really speak for Kinearx here, not Arena or IQ.  However, Kinearx will approach the labeling and trajectorization problems in parallel.  So in a robust pipeline, there will be labeling data and trajectorization data available.  The deeper into the pipeline, the more data will be available.  So, assuming you limit a trajectorization decision to labeling data that is highly trusted, this technique can also be highly trusted.

Trajectorization enforces Labeling

Theory: If a string of points in time are trajectorized, and one of those points are labeled, all the points in the trajectory can be labeled the same.

Caveats: Better hope that trajectorization is right.

Recommendation:  Similar to the previous technique, this one is based on execution order.  IQ uses this very clearly.  You can see it operate when you start manually labeling trajectories. The degree to which Arena uses it is unknown, but I suspect its in there.  Kinearx will make this part of its parallel solving system.  It will also likely split trajectories based on labeling, if conflicting labels exist on a single trajectory.  I prefer to rely on this quite a bit.  I prefer to spot label the data with highly trusted labeling techniques, erring on the side of not labeling if you’re not sure, and have this technique fill in the blanks.

Manaual Labeling

Theory: You, the human, can do the work yourself.  You are trustworthy.  And its your own fault if you’re not.

Caveats:  Who has time to click click click every point in every frame to do this?

Recommendation:  Manual labeling should be reserved for extremely difficult sections of mocap, and for sparse seeding of the data with factual information.  Confidence in a manual label should be extremely high however.  When I use IQ, I take an iterative approach to the process and have the system do an automatic labeling pass, to see where its having trouble on its own.  I then step back to before the automatic labeling pass and seed the trouble areas with some manual labeling.  Then I save and set off the automatic labeling again.  Iterating this process, adding more manual labeling data, eventually results in a mostly correct solve.  Kinearx will make sure to allow a similar workflow, as I’ve found it to be the most reliable to date.

Simple Rigid Body Distance Based Labeling

Theory:  If you kn
ow a certain number of markers to move together because they are attached to the same object, you can inform the system of that fact.  It can measure their distances from one another (calibrate the rigid body) and then use that information to identify them on subsequent frames.

Caveats:  Isosceles triangles and equilateral triangles cause issues here.  There is a lot of inaccuracy and noise involved in optical mocap and therefore, the distances between markers will vary to a point.  When it comes to the human body, there is a lot of give and stretch.  Even though you might want to treat the forearm as a single rigid body, the fact is, it twists along its length and markers spread out over the forearm will move relative to one another.

Recommendation:  This is still the single best hope for automatic marker recognition.  When putting markers on objects, its important to play to the strengths and weaknesses of this technique.  So, make sure you vary the distances between markers.  Avoid making equilateral and isosceles  triangles with your markers.  Always look for a scalene triangle setup.  When markering similar or identical objects, make sure to vary the marker locations so they can be individually identified by the system (this includes left and right sides of the human body).  If this is difficult, consider adding an additional superfluous marker on the objects in a different location on each, simply for identification purposes.  On deforming objects (such as the human body), try to keep the markers in an area with less deformation (closer to bone and farther from flesh).  Make good use of slack factors to forgive deformation and inaccuracy.  Know the resolution of your volume.  Don’t place markers so close that your volume resolution will get in the way of an accurate identification.

Articulated Rigid Body Distance and Range of Motion Based Labeling

Theory:  This is an expansion of the previous technique, to include the concept of connected, jointed or articulated rigid body systems.  If two rigids are connected by a joint (humerus to radius in a human arm for example) the joint location can be considered an extra temporary marker for distance based identification on either rigid.  Therefore, if one rigid is labeled enough to find the location of the joint, the joint can be used to help label the other rigid.  Furthermore, information regarding the range of motion of the joint can help cull mis identifications.

Caveats:  Its possible that the limits on a joint’s rotation could be too restricting compared with the reality of the subject, and cull valid labels.

Recommendation:  This is perhaps the most powerful technique of all.  Its nonlinear and therefore somewhat recursive in nature.  However, most importantly, it has a concept of structure and pose and therefore can be a lot more intelligent about what its doing that other more generic methods.  It wont help you track a bunch of marbles or a swarm of ants, but anything that can be abstracted to an articulated jointed system (most things you’d want to mocap) are greatly assisted by this technique.  You can also go so far as to check the pose of the system from previous frames against the current solution to throw out labeling that would create too much discontinuity from frame to frame.

Conclusion

These techniques get you what you need to trajectorize and label your data.  However, there are plenty of places to go from here.  These steps serve multiple purposes.  They’ll be executed for realtime feedback.  They’ll be the first steps in a cleanup process.  They may be used and their results exported to a 3rd party app such as motion builder.  Later steps may include:

  • more cleanup
  • export
  • tracking of skeletons and rigids
  • retargeting
  • motion editing

IQ, Arena, Blade and Kinearx may or may not support all of those paths.  For example, currently, Arena will allow more cleanup.  It will track skeletons and rigids.  It will stream data into motion builder.  It will export data to motion builder.  It will not regarget.  It will not get into motion editing.  Motiobuilder can retarget and motion edit, and it also has some cleanup functionality.  IQ will allow more cleanup, export and tracking.  It does not perform retargeting or motion editing.  Blade supports all of this.  Kinearx will likely support some retargeting but will stay clear of too much motion editing in favor of a separate product that will be integrated into an animator’s favorite 3d package (Maya or XSI for example).

The next topic will likely be tracking of skeletons and rigids.  You might notice that we’ve kind of gotten into this a bit with the labeling of articulated rigid systems.  And you’d be correct in making that identification. A lot of code would be shared between the labeler and the tracker.  However, whats best for labeling may not be best for tracking.  So the implementation is usually different at a higher level because the goals are different. 

How Moap Works: Markers and Retroreflectivity

The NaturalPoint cameras as well as your typical Vicon and Motion Analysis systems are what are known as Optical Motion Capture Systems.  More specifically, in their more common configuration, they’re Retroreflective Optical Motion Capture Systems.  Though, they can also be configured as active marker systems as well.  Its just less common.

Diffuse Bounce, Reflectivity and Retroreflectivity

Wikipedia has a page on these different types of reflected light (doesn’t it always?).  However, its a bit dense.  I’ll summarize and provide context.

There are plenty of potential light sources in your mocap space.  It can come through a window.  It can come from light bulbs.  It can come from the LED ring around the lense of your cameras.  When light hits the surface of an object, you tend to think about it as a whole bunch of individual rays generally coming from the same direction if it comes from a single light source, and generally having the same angle (orientation).  Anyhow, when the light strikes the surface, lots of different things happen to it.  For example, some of the light can be absorbed.  The resulting energy needs to go somewhere and can become heat, light, electricity etc.  This is how most pigments work.  Most of the light is usually not absorbed however.  Its either reflected or refracted.  A simplified explanation of refracted light, is that it passes through the object, like say, glass.  Reflected light however, is what we’re more concerned with.

Simple reflection or specular reflection, is what you find in a mirror.  The light ray bounces off a surface as per the law of reflectance.  More important than any one ray following the law of reflectance, in a material that has high specularity, most if not all the rays follow the law and end up having a similar angle after being reflected.  Hence an image as seen in a highly specular material maintains its general appearance.  It doesn’t blur or distort beyond recognition.  This is true of a mirror as an extreme example.  Its also true of say, car paint.  You can see things reflected in car paint and as such, it can be said that a significant number of light rays hitting car paint exhibit a tight specular reflection.  Or you could say car paint has high specularity (not as high as a mirror).

Diffuse bounce light is another form of reflection.  Diffuse bounce light is the light that you see when looking at a matte object, such as say, concrete or paper.  In the case of diffuse light, the incoming rays still respect the law of reflectance.  However, the material is rough enough, that its highly faceted at a microscopic level.  That is to say, at any given point on the surface, its orientation or surface normal is somewhat random.  So while individual rays reflect, as a whole, they scatter all over the place because the material doesn’t exhibit a single smooth uniform surface for all the rays to bounce the same direction off of.  The appearance and general characteristics of such a surface can generally be predicted through Lambert’s Cosine law.  Hence, why in 3d animation, we’re often applying "Lambert" shaders to objects for their diffuse component.  Diffuse bounce light makes up the majority of light you see when looking at objects in our world.  Anything that’s sorta matte finish, is putting out a lot more diffuse bounced light than other types of light.

Retroreflected light is light that manages to reflect directly back at the light source.  Retroreflection doesn’t usually happen naturally all that much.  However, it is incredibly useful for optical motion capture and safety.  "Reflective" paint on the road at night, and roadsigns are examples of man made retroreflective materials used for safety. Also, those strips of "reflective" material you put on haloween costumes are good examples.  Notice these materials are marketed as "reflective" when in reality its not their simple reflective characteristics that make them desirable.  Its their retroreflective characteristics, a subset of reflectivity, that make them work.  Marketing often isn’t concerned with being succinct.  Technically, a roll of masking tape is reflective tape.  Its just mostly diffuse reflection is all.  And it probably wont alert anyone driving a car as to its presence.

What does this have to do with Mocap?

So, how do we use this knowledge to get our mocap cameras to see markers and nothing else?  Hence making the task of tracking those markers easier?  Well, its generally a matter of contrast.  If you can make your markers brighter than anything else in the frame, you can adjust your exposure and threshold the image to knock everything else out of contention, leaving you with a mostly black image, with little gray and white dots that are your markers.

Its probably worth noting that this is not the only way to accomplish the task of tracking markers.  Another approach would be pattern recognition.  A system based on pattern recognition would probably count as an optical mocap system but doesn’t fall into the historical category of an optical system as used in the entertainment industry.

Anyhow, back to contrast.  The task of making your markers brighter than everything else.  Simple specular reflectivity makes some pretty bright highlights. You could theoretically conceive of a scenario where you know where your light source is and if you catch a reflection in a marker in a camera, you could solve for the marker.  In reality though, this isn’t useful.  Its rare that you’ll catch a reflection of a light source in a camera.  You’d need way too many light sources to make it common enough to use.  Its possible you could take this to an extreme and set up a colored dome and then use the color of the dome reflected in a marker to track the ray back to its source location, but again, this is speculative and the kind of setup you’d need to do is is expensive and quite disruptive on the shooting environment.  Remember, one of the goals of viable mocap systems is to be able to be used in parallel with principal photography on a movie set.

Diffuse light is potentially useful.  However, fact of the matter is, most things are fairly diffuse.  Things that are white, or light gray are highly diffuse.  A diffuse object can only put out as much light as it takes in.  Its not possible to be SO much more efficient than a white piece of paper.  So instead, approaches to using diffuse light to generate contrast go the other direction.  You try to make everything in your environment matte black (full absorption, no diffuse bounce).  That way, your markers show up bright by contrast.  Again, this solution isn’t ideal.  The room, the cameras, the people, everything but the markres must be matte black to get contrast this way.

As you might imagine, the solution here is retroreflection.  Again, retroreflection is light that reflects back at the light source.  So its super bright like specular reflection, but unlike specular reflection, its easy to pick up.  You know exactly where its going, right back to the source.  All you need to do is make sure your light source is also your camera lense (or close enough).  This is of course, why NaturalPoint cameras and optical mocap cameras in general, tend to have LED rings around the lense.  NP camera LEDs show up a dull pink when they’re active but don’t let this fool you.  They are actually putting out a ton of light.  Its just infrared… about 850 nanometers in wavelength.  According to Jim Richardson, the CMOS sensors in the cameras are actually more responsive to visible light than IR.  However, IR l
ight is usually used in mocap because a) we can’t see it, so it doesn’t distract us.  b) motion picture film and video cameras already filter it out because they are mimicking our own visual response.  This way, the mocap system’s lighting doesn’t interfere with human vision based imaging.

Markers

If you’ve got your light source and camera all set up to pick up retroreflective light, then all thats left to do is make sure your marker actually is retroreflective.  There are typically two ways this is done by contemporary humans.

Firstly, we can use "corner reflectors."  An example of a corner reflector is a bicycle reflector.  Corner reflectors are made by butting three mirrors together at right angles.  A bicycle reflector often has hundreds of little mirrors set up in triplets in this manner.  Believe it or not, this does actually work.  I have to cover up my bicycle all the time when I use cameras in my apartment.  I have looked into getting a bunch of small 1" bicycle reflectors to use as markers and in some situations, they may actually be useful.  Though, there are better solutions.

The second retroreflective material is whats known as 3m scotchlite.  Pretty much any retroreflective material you can think of besides corner reflectors comes back to 3m and scotchlite.  Even those reflective paints on the road are made with materials bought from 3m.  I have a can of "reflective" spray paint from Rustoleum.  They bought their materials from 3m.  Scotchlite is based on glass beads and can be bought in many forms, from raw beads (sand like) to textiles to tapes to paints.  Scotchlite comes in different grades and colors.  Generally though, the best retroreflectivity comes from scotchlite products in which the beads have been bonded to a material by 3m, rather than bonding done by other parties.  So, buying 3m tape or textile is your best bet for mocap.  The material that NaturalPoint sells in their own store is actually the highest quality material I’ve come across.  Markers built from that material perform better than some of the "hard" markers in their store, that clearly had the material sprayed on by a 3rd party.

Emissive Markers

You may have noticed that to this point, we’ve been talking about generating contrast on materials that are bouncing light from a separate light source.  However, its possible that a marker could emit its own light.  Generally, these types of markers are known as active markers.  I have actually constructed active markers in the past and will probably do so again within the year.  NaturalPoint actually sells wide throw 850nm LEDs in their store for this kind of application.  Mocap systems by PhaseSpace also work off of active LED markers.  Active markers have benefit and detriment.  They often put out a lot more light than a retroreflective maker will and therefore are really easy to track.  They are however, expensive, and they do require mounting electronics on your mocap talent.  This can be problematic in some cases. In some cases, they heat up quite a bit, though this problem can be designed away.

Hopefully some of this has helped give an understanding of what is going on in your mocap volume.  You can use this information to help get better quality captures.  Throwing your cameras into grayscale mode and looking at the enivironment as the camera sees it,  will let you see these concepts in action.  It should also give you a better idea of how to go about optimizing your mocap environment and exposure settings for capture. 

How Mocap Works: Reconstruction

I’ll start this series by jumping to the middle.  Makes sense right?  Believe me, this is the easy part. All the other parts will make more sense when they are looked at relative to this part.

What is reconstruction?

Different mocap systems define this differently.  I’m going to define it as the task of taking the 2d data from a calibrated multi-camera system and making it 3d.  This is pretty analogous to what the OptiTrack point cloud SDK does.  I’m going to skip calibration for this blog entry.  You can assume that we have a set of cameras and we know where they are in 3d space.  We know their orientation and we know their lensing.  I’m also going to skip the 2d aspects of the process.  You can assume that each camera is providing us with multiple objects (I call them blips) and most importantly, their position on the camera’s backplane (where they are in the camera’s view in 2d).  There is one other thing you must assume.  You must assume that there is error in all those values.  Not just the blips, but also the camera positions orientations and lense data. We hope they’re all close, but they’re not perfect.  Reconstruction is the task of taking that data, and turning it into 3d points in space.

Rays

Well, the simplest way to think about it is as follows.  Each blip can be projected out into space from the camera’s nodal point (center of the virtual lense) based on where it is in 2d on that camera’s backplane (virtual imager).  Shooting a point out is generally referred to as shooting a ray or vector.  Any two rays from two separate cameras that intersect are likely to be intersecting at a point in 3d space where there is an actual marker.  This is the reverse process of the cameras seeing the real marker in the first place.  Rays of light bounce from the marker though the lense nodal point and onto the backplane imager where they are encoded and sent to the computer (thats a little oversimplified but you get the idea).  If a third ray intersects as well, its FAR more likely to be a marker than a coincidence (you’d be surprised how often you end up with coincidences running at 100fps+).  So, while you can reconstruct a 3d point from as little as two rays, if you have enough cameras to spend on verification, you’ll get less fake markers by requiring that 3 or more rays agree.

This is often referred to as Triangulation

Its probably worth noting that this is not the typical triangulation you’ll use when say, calculating the epicenter of an earthquake by knowing its distance from known points.  Its a different type of triangulation or should I say, a different subset of triangulation operations. 

Residuals 

Sorry, did I say those rays intersected?  That was a mistake.  The rays never intersect.  They just pass closely together.  See, that error I was talking about gets in the way.  So what your typical mocap app will do, is deal with residuals to basically say "its close enough."  When you give a NaturalPoint system a residual for its point cloud reconstruction, you are telling it that rays that pass within a distance below the residual, should be considered having intersected where the residual is the lowest.  A high residual, could suck discreet markers together into one larger marker if they are close enough.  A low residual could keep rays from intersecting and result in low marker counts per frame.  You’ll want to balance your residual against the overall accuracy in your volume.  You can get an idea of the accuracy of your volume by looking at the residuals that it gives you at the end of the calibration process.  Or, you can just mess around with it.  You’ll also want to pay attention to the units.  Different systems measure residuals in different units.  Meters, Centimeters, Millimeters etc.

Angles

There are other factors that play into the accuracy of a reconstruction.  If two rays have a similar angle (they are more parallel then perpendicular) the accuracy of their reconstruction goes down significantly.  Its harder to determine accurately at what distance they intersect, as a little inaccuracy in the angles translates to a potentially long distance.  Most of the inaccuracy plays into the depth axis.  If you have rays that are more perpendicular, their inaccuracy is spread evenly along all three axis of potential movement, rather than the one depth axis.  Therefore, most NaturalPoint reconstruction parameters include a threshold for the minimum angle between rays.  Rays that intersect but are closer than the minimum angle, are ignored.  The units are important here as well.  I believe they tend to be in radians rather than degrees.

Min and Max Distance 

These are simple.  After a point has been reconstructed, it can be tossed due to its distance from the cameras from which they have been cast.  One really good reason for this, is that the light sources on the cameras can flare out objects that are too close, generating WAY too many phantom blips.  Ignoring blips that are reconstruct so close is a safe bet.  Likewise, throwing out markers that reconstruct far into the distance is also safe, though often not needed.

Hopefully, these basic concepts help explain what is gonig on inside the black box.  This should give you plenty of concepts with which to start working with camera placement to get better performance out of an optical mocap system. An obvioues freebie would be:  don’t place cameras too close together that are meant to cover the same space.  The angle between rays seeing the same object will be too low to get an accurate reconstruction.

How Mocap Works Series

I’m going to write a series of blog entries about optical motion capture and how it works.  Knowing whats going on inside a mocap system can help an operator better utilize it.  The series will focus on NaturalPoint’s OptiTrack cameras and systems, with references to other mocap systems and ideas.  It will also occasionally diverge into descriptions of Kinearx because my mind is a bit jumbled.  Sorry.