The research presents a comprehensive framework for consolidating multimodal sensor data collected under naturalistic conditions, grounded in the Carnegie Mellon University Multi-Modal Activity Database (CMU-MMAC). Focusing on Subject 07-Brownie, the study investigates the entire processing pipeline, from data alignment and transformation to fusion method evaluation, interpretability, and modality contribution. A unified preprocessing pipeline is developed to temporally align heterogeneous video and audio data. Fusion is performed through resampling, grayscale conversion, segmentation, and feature standardization. Semantic richness is confirmed via heatmaps, spectrograms, and luminance time series, while frame-aligned waveform overlays demonstrate temporal consistency. Results indicate that late fusion yields the highest validation accuracy, followed by hybrid fusion, with early fusion performing the lowest. To assess the interpretability and discriminative power of audio and video in fused activity recognition, PCA and t-SNE visualize feature coherence over time. Classification results show limited performance for audio alone, moderate for video, and significant improvement with multimodal fusion, underscoring the strengths of combined data. Incorporating RFID data, which captures sparse interactions asynchronously, further enhances recognition accuracy by over 50% and improves macro-averaged ROC-AUC. The framework demonstrates the potential to transform raw, asynchronous sensor data into aligned, semantically meaningful representations, providing a reproducible approach for multimodal data integration and interpretation in intelligent systems designed to perceive complex human activities.
翻译:暂无翻译