Monday, August 20, 2012

Kinect SDK Development


At the end of last year I did a bit of work with the Kinect SDK.  I built a Kinect plugin to control Microsoft Power Point presentations.  I wrote up my notes at the time, but I've just realised that I never posted a blog about it.

Although this was written months ago, and I know that the Kinect SDK has been updated since then, most of this information is still relevant.


The Kinect unit can provide 4 different streams of data at:
  1. Skeletal Tracking.
  2. “Depth” data stream.
  3. “Depth and Player” data stream.
  4. Colour video.

You can subscribe to events which are fired on every refresh (approx. 30 FPS) to handle changes to each of these data streams.

Skeletal Tracking

This data stream allows us to track the 3D position of 20 skeletal joints of up to 2 people standing in front of the Kinect unit.

The joints tracked include:
  • Head
  • Shoulders
  • Elbows
  • Wrists
  • Hands
  • Spine centre
  • Hips
  • Knees
  • Ankles
  • Feet

It is a fairly simple task to interrogate these joint positions.  

“Depth” data stream

This is a 320x240 data stream that we can display as video showing depth information, ie how far away from the Kinect camera each pixel in the view is.  It is fairly easy to create a grey scale video with dark pixels in the background, and light pixels in the foreground 

“Depth and Player” data stream

This is a copy of the Depth video stream, however a couple of bits in each depth byte are used to mark which Player each pixel belongs to.  This means that we could, for instance, render a video showing grey scale depth information (as before) but the pixels that relate to different players shown in different colours, or we could just render the depth image that relates to the current player, ignoring the background.

Colour video

This is another simple video stream, simply showing the full colour video from the Kinect unit at a 640x480 pixel resolution.



For our PowerPoint controller application I Built a WPF application with a Standard MVVM structure, and with the Kinect code in a separate code library.

I used these data streams mentioned above to perform our various forms of gesture recognition.  For the most part I used the Skeletal Tracking system, but we also attempted some recognition with the Depth and Player data stream, which I’ll mention later.

Skeletal Feedback

In our WPF application I have drawn the current skeleton layout on a canvas by simply updating line elements (and an ellipse for the head) with the current Joint positions translated from 3D coordinates into screen coordinates.

Movement Gestures

For the main gesture recognition I simply stored the last 20 recorded repositions of all of the skeletal joints.  This allows us to make and compare gestures of just under a second long.

To record a gesture I stored the recorded positions of one joint (eg: the right hand) relative to the centre of the shoulders.  I recorded multiple gestures this way, for instance:

Gesture Name
Joint
Description
BreastStroke
Hand Left
Hand Right
This move is simply a wave of your left or right hand extended directly in front of yourself at shoulder level (Swimming).
Salute
Hand Right
Bring your arm up to the right of your head at about a 45 degree angle with your elbow slightly forward.
WaxOn/WaxOff
Hand Left
Hand Right
Draw a half circle in front of your body (Karate Kid).
Flap Up/Down
Hand Left
Hand Right
Extend your arm out to the side and wave up or down (Flap your wings).

To check if a gesture has been triggered, I just do a comparison of the recorded gesture positions with the most recent positions of each relevant Joint.  We calculate the 3D direction and distance between the centre of the shoulders and the Joint for each of the 20 historical positions and compare them with the recorded positions and allow a small margin of error.

If the last 20 positions match up fairly closely to a recorded gesture then we fire an event to say that the gesture has been observed.

We also don’t allow the same gesture to occur more than once a second because the difference between the first 19 recorded positions and the last 19 positions is often negligible.

Simple Positional Movement

Here we simply converted a Joint position into screen coordinates.  An event is fired every frame updating the position of the Joint.  This can be used to move the mouse pointer with our right hand for example.

Positional Gestures

The second gesture type that we record are Joint proximity triggers.  This is basically recording when two joints move within a certain distance of each other.  For instance we can have a trigger when we touch our head

Joint #1
Joint #2
Distance
Gesture
Head
Hand Right
0.25
Touch your Head with your Right Hand
Shoulder Left
Hand Right
0.25
Touch your Left Shoulder with your Right Hand
Hip Left
Hand Left
0.25
Put your Left Hand on your Hip

Events are fired when one of these gestures is enabled, and again when it is disabled.  This allows us to control the mouse with our hands using the Simple Positional Movement (described above), and putting your hand on your hip to trigger a mouse up/down.  While your hand is on your hip, the mouse button is held down, and when we remove our hand, the mouse button is released.

Open/Closed fist tracking.

I tried using the depth and player data stream as well as skeletal joint positions to draw a view of one of the hands, and attempted to determine from this whether the players’ hand was open or closed.  The plan was to use this in conjunction with Simple Positional Movement to move the mouse around and open and close the fist to toggle the mouse button.  

Unfortunately due to the low resolution of the depth information, as well as it being extremely difficult to visualise the fingers at all angles, this wasn’t very successful.  I have left the code in that draws the hand movement for others to see.



I also had a couple of ideas for other Gesture tracking methods 

Multiple Joint Gestures

Another method which I haven’t built, but which would be fairly simple, would be to extend the Movement Gestures to track multiple Joints.  This would allow us to record a clapping gesture for instance, or a salute that actually required the elbow to be in a certain position, or a short dance involving the whole body.

Posture Gestures

This method would be to record the relative positions of some or all joints at a point in time.  This would allow us to record the “Menu” position seen in various games of standing up stock straight with the left arm raised 45 degrees to the side.