Voxify3D：像素艺术与体素渲染的融合 (Voxify3D: Pixel Art Meets Volumetric Rendering)

Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

翻译：体素艺术是一种广泛应用于游戏和数字媒体的独特风格化形式，但由于几何抽象、语义保持和离散颜色一致性之间的冲突要求，从三维网格自动生成体素艺术仍然具有挑战性。现有方法要么过度简化几何结构，要么无法实现体素艺术所需的像素级精确、调色板约束的美学效果。我们提出了Voxify3D，一种可微分的两阶段框架，将三维网格优化与二维像素艺术监督相结合。我们的核心创新在于三个组件的协同整合：（1）正交像素艺术监督，消除透视畸变以实现精确的体素-像素对齐；（2）基于分块的CLIP对齐，在不同离散化层级间保持语义一致性；（3）调色板约束的Gumbel-Softmax量化，支持在离散颜色空间上进行可微分优化，并具备可控的调色板策略。这一整合解决了根本性挑战：极端离散化下的语义保持、通过体素渲染实现像素艺术美学，以及端到端的离散优化。实验表明，在多样化角色和可控抽象（2-8种颜色，20倍-50倍分辨率）条件下，该方法表现出优越性能（CLIP-IQA得分37.12，用户偏好率77.90%）。项目页面：https://yichuanh.github.io/Voxify-3D/