Mojo is an emerging programming language built on MLIR (Multi-Level Intermediate Representation) and supports JIT (Just-in-Time) compilation. It enables transparent hardware-specific optimizations (e.g., for CPUs and GPUs), while allowing users to express their logic using Python-like user-friendly syntax. Mojo has demonstrated strong performance on tensor operations; however, its capabilities for relational operations (e.g., filtering, join, and group-by aggregation) common in data science workflows, remain unexplored. To date, no dataframe implementation exists in the Mojo ecosystem. In this paper, we introduce the first Mojo-native dataframe library, called MojoFrame, that supports core relational operations and user-defined functions (UDFs). MojoFrame is built on top of Mojo's tensor to achieve fast operations on numeric columns, while utilizing a cardinality-aware approach to effectively integrate non-numeric columns for flexible data representation. To achieve high efficiency, MojoFrame takes significantly different approaches than existing libraries. We show that MojoFrame supports all operations for TPC-H queries and a selection of TPC-DS queries with promising performance, achieving up to 4.60x speedup versus existing dataframe libraries in other programming languages. Nevertheless, there remain optimization opportunities for MojoFrame (and the Mojo language), particularly in in-memory data representation and dictionary operations.
翻译:Mojo是一种基于MLIR(多级中间表示)构建的新兴编程语言,支持JIT(即时)编译。它能够实现透明的硬件特定优化(例如针对CPU和GPU),同时允许用户使用类似Python的友好语法表达逻辑。Mojo在张量运算方面已展现出卓越性能;然而,其在数据科学工作流中常见的关系运算(如筛选、连接和分组聚合)能力尚未得到探索。迄今为止,Mojo生态系统中尚无数据框实现。本文介绍了首个Mojo原生数据框库——MojoFrame,该库支持核心关系运算和用户定义函数(UDF)。MojoFrame基于Mojo张量构建,以实现数值列的高速运算,同时采用基数感知方法有效整合非数值列,实现灵活的数据表示。为实现高效率,MojoFrame采用了与现有库显著不同的技术路径。我们证明MojoFrame支持TPC-H查询的全部操作及部分TPC-DS查询,并展现出优异的性能,相较于其他编程语言的现有数据框库最高可达到4.60倍的加速比。尽管如此,MojoFrame(及Mojo语言)仍存在优化空间,特别是在内存数据表示和字典操作方面。