Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.
翻译:压缩文件格式是高效数据存储与传输的基石,但其在表示学习方面的潜力尚未得到充分探索。本文提出TEMPEST(TransformErs froM comPressed rEpreSenTations)方法,该方法利用压缩文件固有的字节流结构,设计了一种有效的分词与编码策略。通过采用这种紧凑编码,标准Transformer模型可直接从压缩数据流中学习语义表示,无需进行原始字节级处理或完整媒体解码。我们的方案显著减少了语义分类所需的标记数量,从而降低了计算复杂度与内存占用。通过在多样化数据集、编码方案及模态上的大量实验,我们证明TEMPEST在保持与前沿技术相竞争的准确率的同时,实现了内存与计算效率的显著提升。