In this paper, a method for table plane detection using down sampling, accelerometer data and
organized point cloud structure obtained from color and depth images of the Kinect sensor is proposed.
The method outperforms the baseline methods in term of both precision and computational time.
A table plane dataset with the ground-truth has been built in the context of the object-finding
aid system. In order to evaluate the proposed method, the authors have performed a quantitative
comparison of the method with two different methods in the literature. Moreover, to confirm the
robustness of the proposed method, three different evaluation measures are utilized. It obtained a 97%
of precision rate with a frame processing rate of 5Hz on the captured dataset. In this research context,
16 trang |
Chia sẻ: huongthu9 | Lượt xem: 438 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Real-Time table plane detection using accelerometer information and organized point cloud data from kinect sensor, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Journal of Computer Science and Cybernetics, V.32, N.3 (2016), 243–258
DOI 10.15625/1813-9663/32/3/7689
REAL-TIME TABLE PLANE DETECTION USING
ACCELEROMETER INFORMATION AND ORGANIZED POINT
CLOUD DATA FROM KINECT SENSOR
VAN-HUNG LE1,2, MICHIEL VLAMINCK4, HAI VU1, THI-THUY NGUYEN3, THI-LAN LE1,
THANH-HAI TRAN1, QUANG-HIEP LUONG4, PETER VEELAERT4, WILFRIED PHILIPS4
1International Research Institute MICA, HUST - CNRS/UMI-2954 - GRENOBLE INP,
Vietnam
2Tan Trao University, Vietnam; lehung231187@gmail.com
3Faculty of Information Technology, VietNam National University Agriculture, Vietnam
4Ghent University/iMinds - Image Processing and Interpretation, Belgium
Abstract. Table plane detection in the scene is a prerequisite step in developing object-finding-
aided systems for visually impaired people. In order to determine the table plane in the scene, we
have to detect planes in the scene first and then define the table from these detected planes based
on the specific characteristics. Although a number of approaches have been proposed for plane seg-
mentation, it still lacks proper table plane detection. In this paper, the authors propose a table
plane detection method using information coming from a Microsoft Kinect sensor. The contribution
of the paper is three-fold. First, for plane detection step, the dedicated down-sampling algorithms
to original point cloud thereby representing it as the organized point cloud structure in are applied
to get real-time computation. Second, the acceleration information provided by the Kinect sensor is
employed to detect the table plane among all detected planes. Finally, three different measures for
the evaluation of the table plane detector are defined. The proposed method has been evaluated using
a dataset of 10 scenes and published RGB-D dataset which are common contexts in daily activities
of visually impaired people. The proposed method outperforms the state-of-the-art method based on
PROSAC and obtains a comparable result as a method based on organized point cloud where the
frame rate is six times higher.
Keywords. Table plane detection, acceleration vector, organized point cloud, plane segmentation.
1. INTRODUCTION
Plane detection in 3-D point clouds is a critical task for many robotics and computer vision
applications. In order to help visually impaired/blind people find and grasp interesting objects (e.g.,
coffee cup, bottle, bowl) on the table, one has to find the table planes in the captured scenes. From
the extracted table plane, then the relevant features can be calculated such as its normal vector,
center point of the table in the current scene. These features will help to determine the object
positions in the current scene. As a prerequisite step, the table plane extraction should be a robust
algorithm and furthermore high accuracy and low computational cost. However, 3-D point clouds
obtained by low-cost sensors (e.g. from the Microsoft Kinect sensor [13], other depth cameras) are
generally noisy and redundant with a huge number of points. Therefore, in common approaches,
the plane extraction either produces false positives results or requires huge computational costs. By
c© 2016 Vietnam Academy of Science & Technology
244 VAN-HUNG LE ET AL.
exploiting associated data provided by the sensors, the table plane features can be adapted in the
3-D point cloud in a robust way. This paper is motivated by such adaptation in which accelerator
data provided by the Kinect sensor to prune the extraction results. The proposed algorithms achieve
both real-time performances as well as high detection rate of the table planes. In the experimental
evaluations, the proposed method is examined in different contexts to confirm the robustness of the
proposed algorithm.
The paper is organized as follows. Section 2 presents related works on plane segmentation.
Section 3 describes the proposed method with two main topics: plane segmentation and table plane
extraction. Section 4 shows experimental results. Section 5 concludes the paper and gives some ideas.
2. RELATED WORK
The plane extractions/segmentations in a complex scene usually are solved by two main approaches.
The first approach uses robust estimation algorithms such as RANSAC [8], and its variants [3], Least
Squares [1], Hough Transform [2] for estimating the planes in the scene. For example, Yang et
al. [17] proposed to combine the RANSAC algorithm and ‘minimum description length’ for plane
estimation from point cloud data having complex structures. The point cloud is divided into the
blocks, the RANSAC algorithm is performed on each block. Each block is limited from zero to
three planes. This combination generates an algorithm that avoids detecting wrong planes due to
the complex geometry of the 3-D data. Usually, the point cloud of the scene has millions of points.
Therefore, when using the estimation algorithm on this data, it requires a high computational time.
On top of that, the RANSAC-based approaches strongly depend on an heuristically chosen threshold
to eliminate outliers.
The second approach is based on the local surface normal in the scene [5, 6, 7]. Deschaud et
al. [5] proposed an approach for plane detection on unorganized point cloud data. In that paper,
the authors implemented the following three steps. The first step estimates the normal vector at
each point. The second step computes the score of local planarity in each point, after that the best
seed point that represents a good seed plane is selected. The third step is growing this seed plane
by adding all points close to the plane. [6, 7] uses the normal vector of points in organized point
cloud data. Organized point cloud data is structured so that points are arranged in a grid (similar
to image pixels in a matrix structure). Holz et al. [6] proposed an approach for organized point
cloud segmentation. This approach performs segmentation in two steps. First, the segmentation step
is performed on the normal of each point. After that, points having similar local surface normal
orientation are examined and segmented in distance space generate region. This approach is able to
process video with frame rates of up to 30Hz. Chen et al. [7] proposed an approach using hierarchical
clustering on cloud data based on the normal vectors. The clustering performed similar to the ap-
proach in [6], but the authors have represented the data in a graph whose node and edge represent a
group of points and their neighborhood respectively. The clustering is performed on the graph. This
approach can process video with a frame rate of more than 35Hz for 640× 480. Although, all these
approaches implemented plane segmentation in the scene, they do not address table plane detection
in particular. We have proposed an approach for the table plane detection [10] in complex scenes.
We used PROSAC [4] algorithm for plane segmentation and some geometrical constraints for table
plane extraction in the complex scene. However, this approach requires prior knowledge about the
scenes, such as constrains between the wall and the table, the wall and the floor, size of the table.
REAL-TIME TABLE PLANE DETECTION USING ACCELEROMETER INFORMATION 245
Figure 1. Object finding aid for the visually impaired people
Figure 2. The proposed frame-work for table plane detection
3. TABLE PLANE DETECTION
3.1. Overview
Our research context aims to develop object finding and grasping-aided services for visually im-
paired people (see Fig. 1). To this end, it is needed to locate the queried objects in the table. In
order to develop such a service, this paper first deals with detecting the planes in the current scene
and then determining table plane from detected planes. Therefore, we approach this objective as a
problem of table plane detection and extraction from a real scene.
In our work, Microsoft Kinect, a low-cost depth and RGB sensor, is utilized. This sensor becomes
more and more popular in computer vision applications. An object captured by the Kinect sensor
is represented in 3-D space denoted by the coordinates (x, y, z) [13]. This data is generated from
both color and depth images in order to form the point cloud data. The proposed frame-work, as
shown in Fig.2, consists of four steps: down-sampling, organized point cloud representation, plane
segmentation and table plane classification. To achieve low computational costs, the data in the first
step is reduced, targeting a lower sampling rate. However, the sampling rate can not be arbitrarily
low because it can significantly affect the subsequent steps and lower the overall detection quality.
For plane segmentation, RANSAC or one of its variants can be used as referred to in the related
work section. However, this step requires highly accurate and real-time plane extraction. Therefore,
we proposed to represent the information captured from the Microsoft Kinect as an organized point
cloud and perform plane segmentation using normal vector information. To select the table planes
within the extracted planes, the acceleration data from the Kinect sensor is used in the third step.
The main constraint is that a table should stand on the floor. Therefore the table plane should be
parallel with the floor plane. The accelerometer from the Kinect sensor provides us the normal vector
of the ground (floor) plane and other planes, that are parallel with the table plane. The planes which
246 VAN-HUNG LE ET AL.
Figure 3. (a) Computing the depth value of the center pixel based on its neighborhoods
(within a 3× 3 pixels window); (b) down sampling of the depth image
do not meet this criterion are eliminated. A table plane is identified among the remaining planes if
it is high enough. The proposed method has to be done in the real-time and archives state-of-the-art
detection rate results. In the following sections, four steps of the method are described in detail.
3.2. Point cloud representation
In the scenario of developing an object-finding system that aids the visually impaired people
collecting an object based on their query, separating the table plane from a current observation is a pre-
processing step so that the interested objects lying on the table are more easily detected and localized.
This pre-processing procedure therefore should be high detection rate and low computational costs.
However, the collected point cloud data in a certain scene consists of many 3-D data points (each
point has the coordinates (x, y, z)). This type of data always requires high computational time but
includes many noises. To deal with these issues, we adopt down-sampling and smoothing techniques.
Actually, many down-sampling techniques such as [16, 12, 15, 9] could be applied. Because of our
work utilizing only depth feature, a simple and effective method for down-sampling and smoothing
the depth data is described as below.
Given a sliding window (of size n × n pixels), the depth value of a center pixel D(xc, yc) is
computed from the Eq. 1:
D(xc, yc) =
∑N
i=1D(xi, yi)
N
, (1)
where D(xi, yi) is depth value of i
th neighboring pixel of the center pixel (xc, yc); N is the number
of pixels in the neighborhood n x n (N=(n x n) -1). An illustration of the down-sampling procedure
is given in Fig. 3a. As shown, if the input depth image has the size of 640×480 pixels and the sliding
window has the size of 3× 3 pixels, then the output depth image is reduced to 320× 240 pixels. If
the size of the sliding window is increased, then the size of the output depth image is reduced.
It is noticed that this technique does not change the coordinates of pixels and the coordinates
of objects in 3-D space, it only reduces the number of points representing an object. By averaging
depth values in each window, we only retain the center pixels (xc, yc) of all the sliding window, and
other pixels (as shown in Fig. 3b) are removed. Therefore, this step also smoothes the collected data.
After down-sampling, the image data is converted into organized point cloud data. Each data
point has a 3-D coordinate (x, y, z) and color values (r, g, b). Using the RGB camera intrinsic
REAL-TIME TABLE PLANE DETECTION USING ACCELEROMETER INFORMATION 247
parameters [13], each pixel (xp, yp) in the RGB image has a color value C(rp, gp, bp) and a depth
value D(xp, yp) in the corresponding depth image, which is projected into the metric 3-D space using
the following Eq. 2:
x =
z(xp − cx)
fx
; y =
z(yp − cy)
fy
; z = D(xp, yp); (r, g, b) = C(rp, gp, bp) (2)
with (fx, fy) and (cx, cy) being the focal length and principal point, respectively.
The organized point cloud data follows the structure of a matrix as in the image. Each point has
a 2-D index (i, j), in which (i, j) are the indices of the row and column of the matrix respectively.
They are limited by the size of the collected image. For example, an image obtained from Microsoft
Kinect sensor has 640× 480 pixels, then i = 1, ..., row; j = 1, ..., col with [row, col] = [480, 640].
Matrix P presents the organized point cloud data of a scene as below:
P =
p1,1 p1,2 p1,3 p1,4 . . . p1,col
p2,1 p2,2 p2,3 p2,4 . . . p2,col
p3,1 p3,2 p3,3 p3,4 . . . p3,col
p4,1 p4,2 p4,3 p4,4 . . . p4,col
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
prow,1 prow,2 prow,3 prow,4 . . . prow,col
; pi,j = (xi,j , yi,j , zi,j) (3)
where (xi,j , yi,j , zi,j) are values of 3-D coordinate as defined in Eq. 2.
It is noticed that the number of points in the cloud is reduced data after applying the down-
sampling technique. The indices of the down-sampling data via center pixels of the sliding windows
are preserved. By this way, it is easier to find the corresponding pixel in the original color image.
3.3. Plane segmentation
The third step in the proposed framework (as shown in Fig. 2) is plane estimation that is based
on normal vectors extracted from organized point clouds. The planes extraction procedure consists
of the following steps: estimating surface normals, segmentation and merging co-planar points to
generate the planes. The detailed process of the plane segmentation is given in Algorithm 1. First,
for estimating the surface’s normal vector, the approach of Holz et al. [6] was used. Illustrations of
this technique are shown in Fig. 4. A plane at point pi is estimated based on itself pi and its two
neighbors, as shown in Fig.4(b). To estimate a normal vector at a point pi, k-nearest neighbours of
pi are determined within a radius r. The curvature value σ (Eq. 5) is estimated by analyzing the
eigenvectors of the covariance matrix C as given in Eq. 4. The similar value σ and normal vectors of
the points within an organized point cloud P are grouped and clustered to select co-planar points.
These co-planar points are starting points of the region growing algorithm for merging and generating
planes.
C =
1
k
k∑
i=1
(pi − pav)(pi − pav)T , Cvj = λjvj , j ∈ {0, 1, 2}; (4)
σ =
λ0
λ0 + λ1 + λ2
. (5)
248 VAN-HUNG LE ET AL.
(a) (b) (c) (d)
Figure 4. Illustration of estimating the normal vector of a set point in the 3-D space. (a) a
set of points; (b) estimation of the normal vector of a black point; (c) selection of two points
for estimating a plane; (d) the normal vector of a black point
Figure 5. Example of plane segmentation (a) color image of the scene; (b) plane segmentation
result with PROSAC [10]; (c) plane segmentation result with the organized point cloud
where pav represents the 3-D centroid of the nearest neighbors; λj is the j
th eigenvalue of the
covariance matrix; and Vj is the j
th eigenvector.
Based on the detected co-planar points, we follow the seed growing regions. Because three-
dimensional voxel grid is established and local surface normal is mapped out to the corresponding
grid cell. Each cell in the 3-D voxel grid (nxr,c, n
y
r,c, nzr,c)
T (r, c are indices of the each cell) [13] has
some points along the normal. The results of this work can generate the cell discretization on the
voxel grid. To solve this problem, the average surface normal orientation in two neighbor grid cells
are compared. If it is less than the threshold of clustering then they will be added to the existing
group. Consequently, the co-planes are merged together.
Thanks to the pre-processing procedures, utilizing the organized point clouds allow for accelerated
computational time. Moreover, original technique in [6], that calculated the ‘integral image’, is
applied to also reduce computational time. In the experimental evaluations, the table plane detection
results are compared in context of utilizing two approaches of the plane segmentation: one inspired
by conventional RANSAC algorithms and another is the proposed techniques based on organized
point clouds to confirm robustness of the method.
3.4. Table plane detection
Besides color and depth features, a Microsoft Kinect sensor provides acceleration information
[13]. This is a vector whose direction points downwards. For each collected frame, there is an
acceleration vector, as shown in Fig. 6. Since complex scenes usually contain different planes such
REAL-TIME TABLE PLANE DETECTION USING ACCELEROMETER INFORMATION 249
Algorithm 1: Plane segmentation
Input: Organized point cloud: P = {p1, p2, ..., pk}
Output: Plane candidates: Planes = {Planesi(Ai, Bi, Ci, Di)}
1 ((nx, ny, nz)T = ComputeNormal(pi); . Compute normal vector of each point pi
2 (nxi , n
y
i , n
z
i )
T = FindNeighBor((nx, ny, nz)T ); . Find the k-nearest neighbors of each
point pi
3 (nxi , n
y
i , n
z
i )
T = SortRegionNeighBor((nxi , n
y
i , n
z
i )
T ); . Sort points in each region
including k-nearest neighbors points based on their curvature values
4 Rn{rn1, rn2, ..., rnn} = Clustering((nxi , nyi , nzi )T ); . Clustering region based on
normal space
5 Rd{rd1, rd2, ..., rdm} = Clustering(Rn), (m ≤ n); . Clustering region based on
distance space
6 RN(j) = FindNeighbor(Rd), (j = 1, 2, ...,m); . Finding rnj neighboring regions of rj
region
7 Ang(j) = ComputeAngle(rdj , RN(j)); . Compute the angle between normal of rdj
region and RN(j) regions
8 If(Ang(j) ≤ t) then . t is the angle threshold to merge regions
9 {
10 Ri = MergeRegion(rdj , RN(j)); . Each region has a fitted plane
11 Push Ri → Planes; . Add region Ri to Planes
12 }
13 Return P lanesi(Ai, Bi, Ci, Di);
as the table plane, floor plane, objects’ planes, and wall planes. In order to determine the table
plane, the non-table planes have to be eliminated based on the criterion as scenario constraints. As a
visually impaired people with a mounted Microsoft Kinect on the chest moves around the table, the
acceleration vector is perpendicular to the table plane and floor plane.
To do this, a constraint that is the table stands on the floor is utilized. Therefore, the table plane
is parallel with the floor plane in the current scene. The acceleration vector gives us a normal vector
of the ground/floor plane (and also other planes which are perpendicular to the table plane). The
non-table planes that do not meet this criteria are eliminated. Based on this scheme, it is compute
the angle (ai) between the acceleration vector and a normal one of the detected plane pli which is
obtained from the plane segmentation procedure. If the angle is larger than a threshold t then such
detected plane should be eliminated. Otherwise, we consider it as a table plane candidate and move
to the second step. The results of the first step are planes that are perpendicular to the acceleration
vector. After rotating the y axis such that it is parallel with the acceleration vector, the directions
of the acceleration vector and the y axis are the same as Fig. 6. Therefore, the table plane is
highest plane in the scene, that means the table plane is the one with minimum y-value. Since the
Microsoft Kinect is mounted on the persons chest (as shown in Fig. 1), the table plane is usually the
closest plane to the Microsoft Kinect among the detected planes. Therefore, we choose a plane with
a minimum y-value to be the detected table plane. Moreover, this plane must have enough number
of points (‘mininliers’).
250 VAN-HUNG LE ET AL.
Acceleration
vector
Figure 6. Illustrating acceleration vector provided by a Microsoft Kinect sensor
Figure 7. Examples of 10 scenes captured in our dataset
4. EXPERIMENTAL RESULTS AND DISCUSSION
4.1. Setup and dataset
To evaluate the performance of the proposed method, our own dataset and a dataset in [14]
are used. Concerning our dataset, the following experiments have been set up: A Microsoft Kinect
version 1 is mounted on the person’s chest, the person then moves around one table in the room. The
distance between the Kinect and the center of the table is about 1.5 m. The height of the Kinect
compared with table plane is about 0.6 meter. The height of table plane is about 60 → 80 cm.
We capture data of 10 different scenes which include a cafeteria, showroom, and kitchen and, so on.
These scenes cover common contexts in daily activities of visually impaired people. Some examples of
the captured images of the 10 scenes are shown in Fig. 7. For each scene, the subject is asked to move
around a table. Therefore, different view-point is considered in our experiments. The numbers of
captured images (at frame rate 5 fps) are given in Table 1. The size of the image is 640× 480 pixels.
Each frame has a corresponding acceleration vector. The color and depth images are calibrated by
using Microsoft Kinect SDK calibration functions.
The second dataset is introduced in [14]. This dataset contains calibrated RGB-D data of 111
scenes. Each scene has a table plane. The size of the image is 640 × 480 pixels. Some examples of
this dataset are illustrated in Fig. 8. Since this dataset does not provide acceleration information,
we assume the table plane is the largest plane in a scene.
REAL-TIME TABLE PLANE DETECTION USING ACCELEROMETER INFORMATION 251
Table 1. The number of frames of each scene
Scene 1 2 3 4 5 6 7 8 9 10
#frame 950 253 771 292 891 797 411 1717 254 350
Figure 8. Examples of scene in the dataset [14]
4.2. Table plane detection evaluation method
In order to evaluate the proposed technique, it is to prepare the ground-truth of the table planes
for two datasets and define three different evaluation measurements. Concerning the ground-truth,
the table region on each color image is cropped manually. Such a cropped region gives a mask of the
table plane, as shown in Fig. 9b. After that, the corresponding region on the depth image is taken
and then presented as a point cloud data. They are referred them as the ground-truth point clouds.
For the evaluation measurements, since the table plane detection result could be affected by different
factors of the detected planes. In this paper, the accuracy of the detected table plane in term of both
table plane’s parameters and its size or area is considered. In order to evaluate the parameters, it is
to compare the normal vectors delivered from the parameters (e.g., A,B,C) of the detected table
plane) and the one extracted from ground-truth data. However, if only using parameters of the plane,
the size of the detected table would be omitted. In practical experiments, the size of the detected
table can be much smaller than the actual table in the captured scene when projecting the detected
plane into the color image. Whereas, if the evaluation bases only on the size/area of the detected
table, then the detected plane can be skewed. Therefore, three evaluation measures are needed and
they are defined as below.
Evaluation measure 1 (EM1): This measure evaluates the difference between the normal
vector extracted from the detected table plane and the normal vector extracted from ground-truth
data. First, a 3-D centroid of the ground-truth point cloud data is determined. A normal vector at
this point is calculated. The 3-D center point and its normal vector are then extracted from each
detected table plane, as illustrated in Fig. 9. After that, an angle (α) between the normal vector
of detected table plane and a vector T that connects the 3-D centroid of the detected table plane
to the 3-D centroid of the ground-truth is calculated as shown in Fig. 10(a), (α) is expected to be
closed to 90 degree. To evaluate a true detection, a lower and upper threshold for this angle is set
up. In this paper, it is to set thresholds of 85 degrees for the lower and 95 degrees for the upper. A
limitation of this evaluation is that the planes can consists of many noises, so the evaluation results
can be affected by these noises because 3-D centroid is determined from all points (including noises)
belonging to the plane. Therefore, the second evaluation measure to overcome this issue is proposed.
Evaluation measure 2 (EM2): By using EM1, only one point was used (center point of the
252 VAN-HUNG LE ET AL.
Figure 9. (a) Color and depth image of the scene; (b) mask data of table plane; (c) cropped
region; (d) point cloud corresponding of the cropped region, green point is 3-D centroid of
the region
ground-truth) to estimate the angle. To reduce the noise influence, more points for determining the
normal vector of the ground truth are used. For the EM2, 3 points (p1, p2, p3) are randomly selected
from the ground-truth point cloud. After that, a normal vector n of the plane constituted from two
vectors u and v is computed as follows:
n = [u,v] (6)
where n generated from cross product of two vectors; u is a vector from p1, p2 points; v is a vector
from p1, p3 points. The angle (β) between the normal vector of the detected table plane and n
vector is computed. Since two vectors are parallel, the angle β should be 0 degree. However, in order
to make the evaluation measure to be tolerant to the noise, an upper threshold for (β) is set. If (β)
is lower than this threshold, it is defined as a true detection. It is to set this threshold of 5 degree in
our experimental evaluations (see Fig. 10b).
Evaluation measure 3 (EM3): The two evaluation measures presented above do not take into
account the area of the detected table plane. Therefore, it is to propose EM3 that is inspired by
the Jaccard index for object detection [11]. First, the detected table plane is projected from the
point cloud into the RGB image space, named Rd and an area of the table with manually annotated
ground truth on the image, named Rg is generated. Then, it is to compute the ratio r between the
intersection and union between detected and ground-truth regions as follows:
r =
Rd ∩Rg
Rd ∪Rg . (7)
If r is greater than a threshold, we define it as a true detection, otherwise a false detection. In [7],
the author used only the overlapping region for determining a true detection. This may be wrong
evaluation if the detected table plane covers whole image. By utilizing EM3, results of the table
plane detection could be evaluated more accurate. For example, the detected table plane could be
true when the number of data points satisfies the constraints of the EM1, and EM2; but its size is not
satisfied according to the EM3 measurement, then it is a false detection. Although, EM1 and EM2
are the traditional evaluations for the plane detection techniques. In this paper, three independent
measurements are utilized besides the computational time, to confirm the robustness of the proposed
REAL-TIME TABLE PLANE DETECTION USING ACCELEROMETER INFORMATION 253
Figure 10. (a), Illustration of the angle between normal vector of the detected table plane
and T. (b), Illustration of the angle between normal vector of the detected table plane ne
and ng; (c) Illustration of overlapping and union between detected and ground-truth regions
technique. Moreover, also the recall and missing rate of the table plane detection are computed.
Recall is defined by the ratio between the number of true detections and the number of table planes
in the dataset while the missing rate is defined as a ratio between the number of missed table planes
and the number of table planes.
4.3. Results and discussion
In order to evaluate the proposed method, we compare it with two other methods. Three methods
are mainly different from one another in term of the implementation techniques for the plane seg-
mentations. The first baseline method, PROSAC is deployed (see details in [10]), whereas the second
one utilizes the techniques proposed in [6, 7]. The proposed method, as described in Sec. 3., utilizes
the down-sampling and smoothing techniques to reduce computational time.
Some parameters are set for the first method as follows: minimum number of points in a plane
are 100 points; the thresholds to determine a point belonging the plane in PROSAC algorithm
[10] is (t = 0.1) (that means 10cm); the threshold to remove floor plane h is 120 cm. Regarding
the second, our method, the parameters are defined as follows: minimum number of points in the
region (mininliers) are 1000 points; the threshold for clustering and merging two regions data: angle
threshold is 5 degrees, the distance threshold is 0.05 (that means 5cm); the threshold to remove floor
plane h is 120 cm. For the proposed method, we set (mininliers) equal 300 points, down sampling
image with size (3× 3). The others parameters are identical to the second one.
These methods are compared with the experimental dataset as described in section 4.1. All
methods are written in C++ and using the PCL 1.7 and OpenCV 2.4.9 libraries on a PC with
Core i5 processor RAM 8G. The comparative results of three different evaluation measures on two
datasets are shown in Table 2 and Table 3 respectively, while the detail result for each scene of our
dataset is illustrated in Fig. 11.
The results of our dataset show that the methods based on organized point clouds (second method
and the proposed method) obtain not only good results in terms of precision and low missing rate but
also computational efficiency. The proposed method obtains similar results compared to the second
254 VAN-HUNG LE ET AL.
Table 2. The average result of detected table plane on our own dataset(%)
Approach
Evaluation Measurement Missing
rate
Frame
per secondEM1 EM2 EM3 Average
First Method [10] 87.43 87.26 71.77 82.15 1.2 0.2
Second Method [6] 98.29 98.25 96.02 97.52 0.63 0.83
Proposed Method 96.65 96.78 97.73 97.0 0.81 5
Figure 11. Detailed results for each scene of the three plane detection methods on our
dataset: (a) using the first evaluation measure; (b) the second evaluation measure and (c)
the third evaluation measure
one (the average recalls are 97% and 97.52% respectively). However, our method gets 5 frames per
second while the second method can only process 0.83 frames per second. The first one has the lowest
recall (82.15%) and lowest computational time (0.2 frames/second) because, this method is based
REAL-TIME TABLE PLANE DETECTION USING ACCELEROMETER INFORMATION 255
on PROSAC algorithm and may generate some wrong planes. In term of missing rate, our method
and the second one have a lower missing rate in comparison with the first method. Some examples
of table detection results of the proposed method are illustrated in Fig. 12. It is interesting to see
that the detection results are coherent with three evaluation measures. Since the third measure uses
a strict constraint on detected area, the recall of three methods decreases. Especially for the first
detection method that is based on PROSAC, it can generate many planes with a small area. An
illustration of this problem is shown in (Fig. 13 - top line). In this example, the detections of the
first method are correct if using the first and second evaluation measure. Since the angle (α) and
the angle (β) defined in the first and second evaluation measures are 85.68% and 4.17% respectively.
However, the r rate of the third measure is 21.3%, which does not satisfy the defined criteria.
The results of the method on the dataset [14] (see Table 3) show that our method obtains a similar
accuracy than the methods in [6], [10] on EM1 and EM2 and outperforms these methods on EM3.
However, this dataset has more noise than our dataset. Therefore, the obtained accuracy is lower
than that of our dataset with EM1, EM2 measures. Concerning EM3 measure, since this dataset is
less complex than our dataset in term of number of planes appeared in the scene (one scene in this
dataset normally has a table plane, a wall plane and a ground plane), accuracy with EM3 is quite
high.
The table detection results on two testing datasets shows that the proposed method achieves
good performance (greater than 97% for EM3) with an acceptable frame rate for working application
(5 frames per second).
Table 3. The average result of detected table plane on the dataset [14] (%)
Approach
Evaluation Measurement Missing
rate
Frame
per secondEM1 EM2 EM3 Average
First Method [10] 87.39 68.47 98.19 84.68 0.0 1.19
Second Method [6] 87.39 68.47 95.49 83.78 0.0 0.98
Proposed Method 87.39 68.47 99.09 84.99 0.0 5.43
Since our method applies down sampling in this paper, the table detection result is compared
with different down sampling factors. The chosen down sampling factors are (3×3), (5×5) and (7×7).
The results are listed on Tab. 4.3.. The table plane detection results are inversely proportional to
the down sampling factor and processing time. The choice of down sampling depends on the system
requirements. With 33 frames per second, 84.13% of average precision is obtained. In some cases,
this result can be accepted because when the system can not detect the table, the visually impaired
people can move lightly in the scene until the Microsoft Kinect can capture better the table.
However, in some cases, all methods could not detect well the table plane because the number
of points belonging to the table is smaller than a threshold mininliers. This case is illustrated
in Fig. 13- bottom line. In this failure case, none of the planes could be detected in the plane
segmentation step.
5. CONCLUSIONS AND FUTURE WORKS
In this paper, a method for table plane detection using down sampling, accelerometer data and
organized point cloud structure obtained from color and depth images of the Kinect sensor is proposed.
256 VAN-HUNG LE ET AL.
Figure 12. Results of table detection with our dataset (two first rows) and the dataset in
[14] (two bottom rows). Table plane is limited by the red color boundary in image and by
green color points in point cloud. Arrow with red color is normal vector of detected table
Table 4. The average result of detected table plane of our method with different down
sampling factors on our dataset
Down sampling
Average recall
(%)
nbframe per second
(3x3) 97.00 5
(5x5) 92.21 14
(7x7) 84.13 33
The method outperforms the baseline methods in term of both precision and computational time.
A table plane dataset with the ground-truth has been built in the context of the object-finding
aid system. In order to evaluate the proposed method, the authors have performed a quantitative
comparison of the method with two different methods in the literature. Moreover, to confirm the
robustness of the proposed method, three different evaluation measures are utilized. It obtained a 97%
of precision rate with a frame processing rate of 5Hz on the captured dataset. In this research context,
REAL-TIME TABLE PLANE DETECTION USING ACCELEROMETER INFORMATION 257
Figure 13. Top line is an example detection that is defined as true detection if using the two
first evaluation measures and as false detection if using the third evaluation measure: (a)
color image; (b) point cloud of the scene; (c) the overlap area between the 2-D contour of
detected table plane and the table plane ground-truth. Bottom line is an example of missing
case with our method (a) color image, (b) point cloud of the scene. After down sampling,
the number of points belonging to table is 276 that is lower than our threshold
a dataset consisting of 10 different scenes is collected and the frames of each scene were collected from
different perspectives. The dataset is collected in the common environments. Therefore, it can satisfy
the requirements of an aided-system for object finding. Based on the table plane detection results,
we will continue to perform object detection and localization. Finally the aided system is completed
with a communication module that sends the object information to visually impaired people.
ACKNOWLEDGMENT
This research is funded by Vietnam National Foundation for Science and Technology Development
(NAFOSTED) under grant number FWO.102.2013.08.
REFERENCES
[1] H. Badino, “Least Squares Estimation of a Plane Surface in Disparity Image Space,” Technical
report in Carnegie Mellon University Pittsburgh, 2011.
258 VAN-HUNG LE ET AL.
[2] D. Borrmann, J. Elseberg, K. Lingemann, and A. Nu¨chter, “The 3D Hough Transform for Plane
Detection in Point Clouds : A Review and a new Accumulator Design,” 3D Research, vol. 2,
no. 2, 2011.
[3] S. Choi, T. Kim, and W. Yu, “Performance Evaluation of RANSAC Family,” in Procedings of
the British Machine Vision Conference, 2009, pp. 1–11.
[4] O. Chum and J. Matas, “Matching with PROSAC - progressive sample consensus,” in Proceedings
of the Computer Vision and Pattern Recognition, 2005, pp. 220–226.
[5] J. E. Deschaud and F. Goulette, “A Fast and Accurate Plane Detection Algorithm for Large
Noisy Point Clouds Using Filtered Normals and Voxel Growing,” in Proceedings of the 5th
International Symposium on 3D Data Processing (3DPVT), 2010.
[6] S. Dirk Holz, R. B. Rusu, and S. Behnke, “Real-Time Plane Segmentation Using RGB-D Cam-
eras,” in LNCS (7416): RoboCup 2011 - Robot Soccer World Cup XV, 2011, pp. 306–317.
[7] C. Feng, Y. Taguchi, and V. Kamat, “Fast Plane Extraction in Organized Point Clouds Using
Ag-glomerative Hierarchical Clustering,” in Proceedings of the IEEE International Conference
on Robotics and Automation (ICRA), 2014, pp. 6218–6225.
[8] R. Fischler and M. A. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with
Applications to Image Analysisand Automated Cartography,” Communications of the ACM,
vol. 24, no. 6, pp. 381–395, 1981.
[9] T. Huang, G. Yang, and G. Tang, “A fast two-dimensional median filtering algorithm,” IEEE
Trans. Acoust., Speech, Signal Processing, vol. 27, no. 1, pp. 13–18, 1979.
[10] L. V. Hung, V. Hai, N. T. Thuy, L. T. Lan, and T. T. T. Hai, “Table plane detction using geo-
metrical constraints on depth image,” in Proceedings of the National conference on Fundamental
and applied IT research (FAIR), 2015, pp. 647–657.
[11] P. Jaccard, “The distribution of the flora in the alpine zone,” New Phytologist, vol. 11, no. 2,
pp. 37–50, 1912.
[12] N. Jagadeesan and R. Parvathi, “ An Efficient Image Downsampling Technique Using Genetic
Algorithm and Discrete Wavelet Transforman,” Journal of Theoretical and Applied Information
Technology, vol. 61, no. 3, pp. 506–514, 2014.
[13] J. Kramer, N. Burrus, F. Echtler, H. C. Daniel, and M. Parker, Hacking the Kinect. Apress,
2012. [Online]. Available:
[14] A. Richtsfeld, T. Mrwald, J. Prankl, M. Zillich, and M. Vincze, “Segmentation of unknown
objects in indoor environments,” in 2012 IEEE/RSJ International Conference on Intelligent
Robots and Systems, Oct 2012, pp. 4791–4796.
[15] J. S. Simonoff, Smoothing Methods in Statistics. Springer, 1998.
[16] M. Trentacoste, R. Mantiuk, and W. Heidrich, “ Blur-Aware Image Downsampling,” EURO-
GRAPHICS, vol. 30, no. 2, 2011.
[17] M. Y. Yang and W. Forstner, “Plane Detection in Point Cloud Data,” Technical report Nr.1 of
Department of Photogrammetry, Institute of Geodesy and Geoinformation, University of Bonn,
2010.
Received January 12 - 2016
Revised January 23 - 2017
Các file đính kèm theo tài liệu này:
- real_time_table_plane_detection_using_accelerometer_informat.pdf