Sequential scientific data span many resolutions and domains, and unifying them into a common representation is a key step toward developing foundation models for the sciences. Astronomical spectra exemplify this challenge: massive surveys have collected millions of spectra across a wide range of wavelengths and resolutions, yet analyses remain fragmented across spectral domains (e.g., optical vs. infrared) and object types (e.g., stars vs. galaxies), limiting the ability to pool information across datasets. We present a deep learning model that jointly learns from heterogeneous spectra in a self-supervised manner. Our universal spectral tokenizer processes spectra from a variety of object types and resolutions directly on their native wavelength grids, producing intrinsically aligned, homogeneous, and physically meaningful representations that can be efficiently adapted to achieve competitive performance across a range of downstream tasks. For the first time, we demonstrate that a single model can unify spectral data across resolutions and domains, suggesting that our model can serve as a powerful building block for foundation models in astronomy -- and potentially extend to other scientific domains with heterogeneous sequential data, such as climate and healthcare.
翻译:序列科学数据涵盖多种分辨率和领域,将其统一为通用表示是构建科学基础模型的关键步骤。天文光谱数据典型地体现了这一挑战:大规模巡天项目已收集了跨越广泛波长范围与分辨率的数百万条光谱,但分析工作仍分散于不同光谱域(如光学与红外)和天体类型(如恒星与星系),限制了跨数据集信息整合的能力。我们提出一种深度学习模型,以自监督方式联合学习异构光谱数据。该通用光谱标记器直接在原始波长网格上处理多种天体类型和分辨率的光谱,生成本质对齐、同质化且具有物理意义的表示,这些表示可高效适配于一系列下游任务,并取得有竞争力的性能。我们首次证明单一模型能够统一跨分辨率与跨领域的光谱数据,表明该模型可作为天文学基础模型的有力构建模块,并有望扩展至其他具有异构序列数据的科学领域,如气候与医疗健康。