The single gesture supported by the HoloLens out of the box provides the capabilities roughly equivalent to a mouse with a single button. Consider that for traditional computers it is normal to use mice with 2–4 buttons and a wheel, not to mention a keyboard, and you might understand why a single-button mouse equivalent might fall a bit short by today’s standards.
But the issue isn’t just about limiting the device to a single gesture interaction: While the gesture itself is very simple in theory, it has proven quite difficult for first-time users to perform in practice. I personally experienced the lack of gestures first-hand (pun not intended) while playing around with the HoloLens in FutuLabs, our internal arm dedicated to the exploration of emerging technologies, and had several ideas die already at the idea phase because of this limitation.
The Leap Motion Controller in action
At this time we were also testing the use of a Leap Motion Controller (LMC), a specialised USB peripheral that provides very accurate hand tracking data in real time. So it occurred to me whether it would be possible to combine the LMC with the HoloLens to enable the creation of custom gestures. This whole thing also happened to play out at a period of time where I was looking for a topic for my Master’s thesis, and this seemed like the perfect thesis project. After discussing with my professor and my supervisor, I set out to build the system.
Several months of work later I managed to complete an initial version. The result is a system where hand tracking data is continuously streamed to the HoloLens, allowing for the implementation of custom gestures. And just recently we decided to open source the whole project, so as to hopefully open the door for other developers and researchers to explore a more wide variety of interactions in mixed reality. This post provides an insight into the workings of the developed system, the final result, and thoughts about the future.
The data must flow
The above picture shows the setup I used during development. The LMC is mounted on top of the HoloLens with a small angled block in-between, so that the LMC has a better view of the area where hands are generally used.
As can be seen, the LMC is not directly connected to the HoloLens, but rather connected to a laptop. This is because the LMC has to be physically connected with a USB cable in order to be powered and ensure a good enough transfer speed. A separate computer was needed as the HoloLens only has a single Micro-USB port that cannot be used with peripherals. The only thing I needed to do was to set up a connection between the computer and the HoloLens, to stream the sensor data generated by the LMC.
In the end I decided to set up two parallel connections, one for streaming data and one for control messages. The former uses UDP to keep the latency to a minimum. The risk associated with the use of UDP is that there is no guarantee that the packets will arrive in the order they are sent, or even arrive at all.
Fortunately, the LMC runs at a high frequency (roughly 100fps) and each message has a unique, incrementing ID, making it relatively easy to account for these risks. For the control messages I used TCP, since speed isn’t critical but successful delivery of the messages is.
A change of perspective
Apart from the most basic gestures where the hand’s position and orientation aren’t taken into account, it is not enough to directly use the data provided by the LMC. All the positions and rotations given are from the LMC’s point of view, but we need to know what they are from the point of view of the HoloLens. I wanted to create a method for calibrating the two devices (i.e. determine their position and rotation in relation to each other) without needing any additional and/or special equipment (e.g. custom markers or a printed design) so that it would be easy for anyone to take the system into use. It also removes the risk of equipment failure or forgetting to bring it along at the most inopportune moment.
The Perspective-n-Point problem. Source: opencv.org
The starting point for determining the relationship between the two devices was to find some common features that both of them are able to recognise. Because of the specialised nature of the LMC, it made sense to choose some part of the hands. Going through the data provided by the LMC, one feature in particular stood out: the fingertips.
On the side of the HoloLens, the only sensor directly accessible to developers is the camera mounted on the front of the device. Fortunately, plenty of research exists on how to find fingertips in images.
This combination of data – 3D points from the LMC and 2D points in an image from the HoloLens – lines up perfectly with a well known problem in computer vision and augmented reality known as the Perspective-n-Point (PnP) problem. In short, if you take a picture of some 3D points and are able to determine which 2D point in the image corresponds to which 3D point, then you are able to determine the position and rotation of the camera.
Example of image used for calibration
The calibration process I ended up with in the end goes as follows. You begin by taking one or more images of your outstretched hands as shown in the above image. The more images you take, the better the calibration result (should) be, but it is more tedious to do. The only requirement here is that the hands need to be held in the shown position. This allows the system to later determine which fingertip in the image belongs to which finger and pair up that point with the correct fingertip from the LMC. At the same moment as the image is taken, the 3D data from the LMC is also saved for later use.
Left: A hand detected from an image. Right: The fingertips and center of mass detected
The images are all sent to the computer which the LMC is connected to. In principle, processing could also be done on the HoloLens, but since I already had to use a separate computer for the LMC, I decided I might as well make use of it.
Once all the images have been sent, I have to find the hands in the images before I can find the fingertips. This proved to be the technically most difficult task, and it required me to develop my own system for detecting skin based on colour, combining research from several different papers. Once the hands have been found, fingertips are then located using the process in the work of Prasertsakul and Kondo (2014).
With all the pieces together, all that remains is to actually solve PnP. Fortunately, this is such a well known problem with so many ways of solving it that OpenCV, an open source computer vision library, contains a function for it. The result of running the function is the transformation needed for moving the LMC data to the point of view of the HoloLens. The transformation is then sent to the HoloLens where it is used to transform all data received from the LMC.
What was accomplished and looking forward
Results of calibration, captured through Microsoft’s Mixed Reality Capture
The above image shows the results of running the calibration, with the red spheres showing the positions of the fingertips after applying the calculated transformation to the LMC data. As can be seen, the results are decent but not perfect. They are also very consistent.
It is difficult to pinpoint what the exact source of error is, though a couple of possibilities are that the physical and virtual (the one used for determining where to draw objects) cameras of the HoloLens are not perfectly aligned, or that the points identified as fingertips are not the exact same points the LMC considers to be the fingertips. But the accuracy and consistency together should be enough to enable the development of custom gestures.
There are two things I would primarily like to improve, the first one being the way the LMC is mounted. A better mount that attaches to the HoloLens would make it possible to place the LMC in almost exactly the same position each time, which in turn would make it possible to reuse calibration results even if the LMC is removed and reattached between uses.
The second development area is hand detection. Colour based detection is too unreliable. There are simply too many skin-like colours commonly found in most backgrounds to reliably only extract the hands. Looking at the current research, a neural network based approach seems to be the most promising general purpose solution.
By open sourcing this project I really hope it enables others to look deeper into how we could interact in an environment where the real and virtual are truly mixed together. When talking about devices such as the HoloLens, the first deficiency most people want to point out is the technological one.
It is certainly true that the hardware still has a ways go, e.g. in terms of improving the field of view and processing power, but I feel most people overestimate our current understanding of how to design for this new environment. It is in no way obvious that the lessons learned in a traditional 2D context will translate to a mixed reality one. On the other direction, you cannot rely on taking cues from physical design either, since the objects you interact with are still only virtual.
It is not even known whether gestures are the best option in the first place. When developing, you will notice quite quickly how tired your arms can get when you have a lot of interactions. With so many open questions, I hope this system can contribute one piece of the puzzle, so we can answer these questions in time.