Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Advances in computer vision and machine learning have led to development in 2D and 3D human pose estimation.
Human pose estimation from images is affected by occlusion and lighting.
Radar and LiDAR technologies need specialized hardware that is expensive and power-intensive.
Recent research has explored the use of WiFi antennas for body segmentation and key-point body detection.
This paper further expands on the use of WiFi signal and deep learning to estimate dense human pose correspondence.
Results of the study show that the model can estimate the dense pose of multiple subjects with comparable performance to image-based approaches.
This paves the way for low-cost, broadly accessible, and privacy-preserving algorithms for human sensing.

Paper Content

Introduction

Progress has been made in human pose estimation using 2D and 3D sensors
LiDAR and radar sensors are expensive and power-consuming
RGB cameras have narrow field of view and poor lighting conditions
Privacy concerns prevent use of cameras in non-public places
WiFi signals can be used as a substitute for RGB images in certain instances

Computer vision research has focused on dense pose estimation from images and video
Güler et al. mapped human images to dense correspondences of a human mesh model using deep networks
DensePose is based on instance segmentation architectures and predicts body-wise UV maps for each pixel
Shapovalov et al.’s work focused on lifting dense pose surface maps to 3D human models without 3D supervision
Radar-based systems and WiFi are different
mmMesh generates 3D human mesh from commercially portable millimeter-wave devices
WiFi-based person localization using the traditional time-of-flight (ToF) method is limited
Fei Wang et.al. demonstrated that it is possible to detect 17 2D body joints and perform 2D semantic body segmentation mask using only WiFi signals
Our research aims to conduct dense pose estimation via WiFi

Methods

Raw CSI signals are cleaned by amplitude and phase sanitization
Two-branch encoder-decoder network performs domain translation from sanitized CSI samples to 2D feature maps
2D features are fed to a modified DensePose-RCNN architecture to estimate the UV map
Transfer learning is used to minimize differences between feature maps produced by images and WiFi signals
Raw CSI data is sampled in 100Hz with 30 subcarrier frequencies
Inputs of network contain 5 consecutive CSI samples under 30 frequencies
Outputs include 17 keypoint heatmaps and 25 UV maps

Phase sanitization

Raw CSI samples are noisy and have random phase drift and flip.
Most WiFi-based solutions rely only on the amplitude of CSI signals.
Discarding phase information has a negative impact on performance.
Sanitization is used to obtain stable phase values and enable full use of CSI information.

Modality translation network

Modality Translation Network is used to transform network inputs from the CSI domain to the spatial domain
Two encoders are used to extract CSI latent space features from the amplitude and phase tensors
Each element of the tensors captures a unique 1D summary of the entire scene
Fused 1D feature is reshaped into a 24x24 2D feature map
Two convolution blocks are used to extract spatial information and obtain a 6x6 feature map
Four deconvolution layers are used to upsample the encoded feature map to 3x720x1280

Wifi-densepose rcnn

Obtain 3x720x1280 scene representation in image domain
Utilize image-based methods to predict UV maps of human bodies
Pose estimation algorithms are two-stage
Not possible to extract signals corresponding to single person from group
Adopt network structure similar to DensePose-RCNN
Extract spatial features from 3x720x1280 imagelike feature map
Network contains two branches: DensePose head and Keypoint head
Estimate keypoint locations and densely predict human part labels and surface coordinates
Output 17x56x56 keypoint mask and 25x112x112 IUV map

Transfer learning

Training the Modality Translation Network and WiFi-DensePose RCNN network takes 80 hours
Transfer learning is used to improve training efficiency
Transfer learning involves a teacher and student network
Mean squared error is used to calculate transfer learning loss
Transfer learning results in higher performance and fewer iterations to converge

Losses

Total loss is computed using 5 components
Classification and box regression losses are standard RCNN losses
DensePose loss consists of several sub-components
Cross-entropy loss for coarse segmentation and body part classification
Smooth L1 loss for UV coordinate regression

Experiments

Experiment conducted to validate WiFi-based DensePose
Results of experiment presented in this section

Dataset

Dataset 1 contains CSI samples and videos taken at 100Hz and 20 FPS respectively.
Dataset 1 was gathered in 16 spatial layouts with 1-5 subjects in each layout.
Pre-trained models were used to produce pseudo ground truth.
Pseudo ground truth includes person bounding boxes, segmentation masks, UV maps, and keypoint coordinates.

Training/testing protocols and metrics

Two protocols are used to evaluate the performance of the algorithm: Same Layout and Different Layout
The Same Layout protocol randomly selects 80% of the samples for training and the rest for testing
The Different Layout protocol trains on 15 spatial layouts and tests on 1 unseen spatial layout
The performance of the algorithm is evaluated in two aspects: the ability to detect humans and accuracy of the dense pose estimation
Standard Average Precision (AP) is calculated for person bounding boxes at multiple IOU thresholds
Geodesic Point Similarity (GPS) and Masked Geodesic Point Similarity (GPSm) are used to measure the performance of DensePose detection

Implementation details

Implemented approach in PyTorch
Training batch size of 16 on 4 GPU server
Empirically set = 0.6, = 0.3, = 0.1
Warmup multi-step learning rate scheduler
Initial learning rate of 1 − 5
Learning rate increases to 1 − 3 for first 2000 iterations
Learning rate decreases to 1 10 of its value every 48000 iterations
Trained for 145,000 iterations for final model

Wifi-based densepose under same layout

WiFi-based DensePose under the Same Layout protocol had a high AP@50 (87.2) and a low AP@75 (35.6).
DensePose estimations had high dpAP • GPS@50 and dpAP • GPSm@50, but low dpAP • GPS@75 and dpAP • GPSm@75.

Comparison with image-based densepose

WiFi dataset does not have manual annotations, making it difficult to compare performance of WiFi-based DensePose with its Image-based counterpart
Image-based DensePose baseline was used to compare performance
Estimations from WiFi-based model are reasonably well compared to Image-based DensePose
WiFi-based model has lower absolute metrics, but small difference between AP-m and AP-l values

Ablation study

Incorporating phase information slightly improves performance
Adding a keypoint branch increases performance significantly
Transfer learning reduces training time without sacrificing performance

Performance in different layouts

Results obtained using same layout for training and testing
Performance decreases significantly when tested on unseen domain
Image-based model suffers from same domain generalization problem

Failure cases

WiFi-based model produces wrong body parts when body poses rarely seen in training set
Difficult to extract detailed information for each individual when 3+ subjects in one capture

Conclusion and future work

Demonstrated dense pose estimation from WiFi signals
Hardware setup illustrated in Figure 1
CSI tensor of 3x3 dimensions
5 consecutive complex-valued CSI samples used
Modality Translation Network used to extract features from amplitude and phase
WiFi-DensePose RCNN used to extract person-wise features
Transfer learning from image-based teacher network
Performance limited by public training data
Future work to collect multi-layout data and extend to 3D human body shapes
Qualitative comparison of image-based and WiFi-based DensePose in Figures 9 and 10
Ablation study of human detection and DensePose estimation
Metrics used to measure performance

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Methods#

Phase sanitization#

Modality translation network#

Wifi-densepose rcnn#

Transfer learning#

Losses#

Experiments#

Dataset#

Training/testing protocols and metrics#

Implementation details#

Wifi-based densepose under same layout#

Comparison with image-based densepose#

Ablation study#

Performance in different layouts#

Failure cases#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Related work

Methods

Phase sanitization

Modality translation network

Wifi-densepose rcnn

Transfer learning

Losses

Experiments

Dataset

Training/testing protocols and metrics

Implementation details

Wifi-based densepose under same layout

Comparison with image-based densepose

Ablation study

Performance in different layouts

Failure cases

Conclusion and future work