Before we can get down to the nitty-gritty of our gesture recognition algorithm, we need to make sure that we can access the Kinect sensor and display a stream of depth frames in a simple GUI.
Accessing Microsoft Kinect from within OpenCV is not much different from accessing a computer's webcam or camera device. The easiest way to integrate a Kinect sensor with OpenCV is by using an OpenKinect
module called freenect
. For installation instructions, take a look at the preceding information box. The following code snippet grants access to the sensor using cv2.VideoCapture
:
import cv2 import freenect device = cv2.cv.CV_CAP_OPENNI capture = cv2.VideoCapture(device)
On some platforms, the first call to cv2.VideoCapture
fails to open a capture channel. In this case, we provide a workaround by opening the channel ourselves:
if not(capture.isOpened(device)): capture.open(device)
If you want to connect to your Asus Xtion, the device
variable should be assigned the cv2.cv.CV_CAP_OPENNI_ASUS
value instead.
In order to give our app a fair chance of running in real time, we will limit the frame size to 640 x 480 pixels:
capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480)
The read()
method of cv2.VideoCapture
is inappropriate when we need to synchronize a set of cameras or a multihead camera, such as a Kinect. In this case, we should use the grab()
and retrieve()
methods instead. An even easier approach when working with OpenKinect
is to use the sync_get_depth()
and sync_get_video()
methods.
For the purpose of this chapter, we will need only the Kinect's depth map, which is a single-channel (grayscale) image in which each pixel value is the estimated distance from the camera to a particular surface in the visual scene. The latest frame can be grabbed via this code:
depth, timestamp = freenect.sync_get_depth()
The preceding code returns both the depth map and a timestamp. We will ignore the latter for now. By default, the map is in 11-bit format, which is inadequate to be visualized with cv2.imshow
right away. Thus, it is a good idea to convert the image to 8-bit precision first.
In order to reduce the range of depth values in the frame, we will clip the maximal distance to a value of 1,023 (or 2**10-1
). This will get rid of values that correspond either to noise or distances that are far too large to be of interest to us:
np.clip(depth, 0, 2**10-1, depth) depth >>= 2
Then, we will convert the image into 8-bit format and display it:
depth = depth.astype(np.uint8) cv2.imshow("depth", depth)
In order to run our app, we will need to execute a main function routine that accesses the Kinect, generates the GUI, and executes the main loop of the app. This is done by the main function of chapter2.py
:
import numpy as np import wx import cv2 import freenect from gui import BaseLayout from gestures import HandGestureRecognition def main(): device = cv2.cv.CV_CAP_OPENNI capture = cv2.VideoCapture() if not(capture.isOpened()): capture.open(device) capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480)
As in the last chapter, we will design a suitable layout (KinectLayout
) for the current project:
# start graphical user interface app = wx.App() layout = KinectLayout(None, -1, 'Kinect Hand Gesture Recognition', capture) layout.Show(True) app.MainLoop()
The layout chosen for the current project (KinectLayout
) is as plain as it gets. It should simply display the live stream of the Kinect depth sensor at a comfortable frame rate of 10 frames per second. Therefore, there is no need to further customize BaseLayout
:
class KinectLayout(BaseLayout): def _create_custom_layout(self): pass
The only parameter that needs to be initialized this time is the recognition class. This will be useful in just a moment:
def _init_custom_layout(self): self.hand_gestures = HandGestureRecognition()
Instead of reading a regular camera frame, we need to acquire a depth frame via the freenect
method sync_get_depth()
. This can be achieved by overriding the following method:
def _acquire_frame(self):
As mentioned earlier, by default this function returns a single-channel depth image with 11-bit precision and a timestamp. However, we are not interested in the timestamp, and we simply pass on the frame if the acquisition is successful:
frame, _ = freenect.sync_get_depth() # return success if frame size is valid if frame is not None: return (True, frame) else: return (False, frame)
The rest of the visualization pipeline is handled by the BaseLayout
class. We only need to make sure that we provide a _process_frame
method. This method accepts a depth image with 11-bit precision, processes it, and returns an annotated 8-bit RGB color image. Conversion to a regular grayscale image is the same as mentioned in the previous subsection:
def _process_frame(self, frame): # clip max depth to 1023, convert to 8-bit grayscale np.clip(frame, 0, 2**10 – 1, frame) frame >>= 2 frame = frame.astype(np.uint8)
The resulting grayscale image can then be passed to the hand gesture recognizer, which will return the estimated number of extended fingers (num_fingers
) and the annotated RGB color image mentioned earlier (img_draw
):
num_fingers, img_draw = self.hand_gestures.recognize(frame)
In order to simplify the segmentation task of the HandGestureRecognition
class, we will instruct the user to place their hand in the center of the screen. To provide a visual aid for this, let's draw a rectangle around the image center and highlight the center pixel of the image in orange:
height, width = frame.shape[:2] cv2.circle(img_draw, (width/2, height/2), 3, [255, 102, 0], 2) cv2.rectangle(img_draw, (width/3, height/3), (width*2/3, height*2/3), [255, 102, 0], 2)
In addition, we will print num_fingers
on the screen:
cv2.putText(img_draw, str(num_fingers), (30, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255)) return img_draw