超越炒作：多类别分类任务中的嵌入方法与提示方法对比 (Beyond the Hype: Embeddings vs. Prompting for Multiclass Classification Tasks)

Are traditional classification approaches irrelevant in this era of AI hype? We show that there are multiclass classification problems where predictive models holistically outperform LLM prompt-based frameworks. Given text and images from home-service project descriptions provided by Thumbtack customers, we build embeddings-based softmax models that predict the professional category (e.g., handyman, bathroom remodeling) associated with each problem description. We then compare against prompts that ask state-of-the-art LLM models to solve the same problem. We find that the embeddings approach outperforms the best LLM prompts in terms of accuracy, calibration, latency, and financial cost. In particular, the embeddings approach has 49.5\% higher accuracy than the prompting approach, and its superiority is consistent across text-only, image-only, and text-image problem descriptions. Furthermore, it yields well-calibrated probabilities, which we later use as confidence signals to provide contextualized user experience during deployment. On the contrary, prompting scores are overly uninformative. Finally, the embeddings approach is 14 and 81 times faster than prompting in processing images and text respectively, while under realistic deployment assumptions, it can be up to 10 times cheaper. Based on these results, we deployed a variation of the embeddings approach, and through A/B testing we observed performance consistent with our offline analysis. Our study shows that for multiclass classification problems that can leverage proprietary datasets, an embeddings-based approach may yield unequivocally better results. Hence, scientists, practitioners, engineers, and business leaders can use our study to go beyond the hype and consider appropriate predictive models for their classification use cases.

翻译：在当今人工智能热潮中，传统分类方法是否已过时？我们证明，在多类别分类问题中，预测模型在整体性能上优于基于大型语言模型（LLM）提示的框架。利用Thumbtack客户提供的家庭服务项目描述中的文本和图像，我们构建了基于嵌入的softmax模型，用于预测每个问题描述对应的专业类别（例如，杂工、浴室改造）。随后，我们将其与要求最先进的LLM模型解决相同问题的提示方法进行比较。研究发现，嵌入方法在准确性、校准性、延迟和财务成本方面均优于最佳LLM提示方法。具体而言，嵌入方法的准确性比提示方法高出49.5%，且其在纯文本、纯图像以及文本-图像混合问题描述中的优越性保持一致。此外，该方法能生成校准良好的概率，我们后续将其作为置信度信号，在部署过程中为用户提供情境化体验。相比之下，提示方法生成的分数信息量严重不足。最后，嵌入方法在处理图像和文本时的速度分别是提示方法的14倍和81倍，而在实际部署假设下，其成本可降低至十分之一。基于这些结果，我们部署了嵌入方法的一个变体，并通过A/B测试观察到与离线分析一致的性能表现。本研究表明，对于能够利用专有数据集的多类别分类问题，基于嵌入的方法可能产生明确更优的结果。因此，科学家、从业者、工程师和商业领袖可参考本研究，超越行业炒作，为其分类应用场景选择合适的预测模型。