Architecture of the application

Before jumping into the code, let's see what the architecture of the application will look like:

First, we read the video frames at a certain rate, maybe 30 frames per second. Then, we give each of the frames to the YOLO model, which gives us the bounding-box predictions for each of the objects. Once we have the bounding boxes' dimensions, which are relative to the window that owns the center, and the number of the grids together with the frame size, it's possible to precisely draw the bounding boxes in to the frame, as shown in the preceding diagram, for each of the objects. Then we simply show the modified frame with the bounding boxes to the user.

Now, if we have enough processing power, this architecture will work just fine, but when used with a CPU, it doesn't scale well. The reason is that even if YOLO is optimized and really fast, it won't work well with a low-power CPU. These steps take some time, from 300 milliseconds to 1.5 seconds, depending on the resolution of the frame and the grid size we choose. The user will be able to see one frame per 1.5 seconds, which doesn't look good—we'll have a slow-motion video. This would be the case only when used with CPU; on the other hand, when we use GPU, YOLO does a great job and provides real-time predictions, so this architecture will work just fine.

Table of Contents for Architecture of the application

Create new playlist

Sign In

Sign Up

Table of Contents for
Architecture of the application