高级微调技术在多智能体协作中的应用:亚马逊大规模实践模式

中文摘要

本文详细介绍了亚马逊在人工智能领域通过高级微调技术取得的显著成果。文章展示了如何通过微调技术减少33%的危险药物错误(亚马逊药房)、减少80%的人力投入(亚马逊全球工程服务)以及将内容质量评估的准确率从77%提升至96%(亚马逊A+)。文章深入探讨了实现这些成果的技术细节,包括基础方法如监督微调(SFT)、近端策略优化(PPO),以及用于人类对齐的直接偏好优化(DPO)。此外,还介绍了专为智能体系统设计的前沿推理优化技术,如基于分组的策略优化强化学习(GRPO)、直接优势策略优化(DAPO)和组序列策略优化(GSPO)。这些技术不仅在亚马逊内部取得了显著成效,也为人工智能领域的研究和应用提供了宝贵的实践经验。

English Summary

Advanced Fine-Tuning Techniques for Multi-Agent Orchestration: Patterns from Amazon at Scale

This post delves into the significant achievements Amazon has made in the field of artificial intelligence through advanced fine-tuning techniques. It demonstrates how fine-tuning has enabled a 33% reduction in dangerous medication errors (Amazon Pharmacy), an 80% reduction in human effort (Amazon Global Engineering Services), and an improvement in content quality assessment accuracy from 77% to 96% (Amazon A+). The article provides a detailed exploration of the technical methodologies behind these outcomes, including foundational methods such as Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO) for human alignment. Additionally, it introduces cutting-edge reasoning optimizations specifically designed for agentic systems, such as Grouped-based Reinforcement Learning from Policy Optimization (GRPO), Direct Advantage Policy Optimization (DAPO), and Group Sequence Policy Optimization (GSPO). These techniques have not only yielded remarkable results within Amazon but also offer valuable practical insights for research and applications in the AI domain.