Traces and logs are essential for observability and fault diagnosis in modern distributed systems. However, their ever-growing volume introduces substantial storage overhead and complicates troubleshooting. Existing approaches typically adopt a sample-before-analysis paradigm: even when guided by data heuristics, they inevitably discard failure-related information and hinder transparency in diagnosing system behavior. To address this, we introduce UniSage, the first unified framework to sample both traces and logs using a post-analysis-aware paradigm. Instead of discarding data upfront, UniSagefirst performs lightweight and multi-modal anomaly detection and root cause analysis (RCA) on the complete data stream. This process yields fine-grained, service-level diagnostic insights that guide a dual-pillar sampling strategy for handling both normal and anomalous scenarios: an analysis-guided sampler prioritizes data implicated by RCA, while an edge-case-based sampler ensures rare but critical behaviors are captured. Together, these pillars ensure comprehensive coverage of critical signals without excessive redundancy. Extensive experiments demonstrate that UniSage significantly outperforms state-of-the-art baselines. At a 2.5% sampling rate, it captures 56.5% of critical traces and 96.25% of relevant logs, while improving the accuracy (AC@1) of downstream root cause analysis by 42.45%. Furthermore, its efficient pipeline processes 10 minutes of telemetry data in under 5 seconds, demonstrating its practicality for production environments.
翻译:暂无翻译