2026年5月22日 (周五)
代理堆栈获得更多的生产形状:为团队提供沙箱运行时间,降低硬件屏障的更大但高效的MOE模型,以及针对吞吐量,隐私合规性,评价可靠性的研究. 如果您是航运代理商,则不同的是牵引装置(许可、隔离、日志和测试),而不仅仅是底模。
代理堆栈获得更多的生产形状:为团队提供沙箱运行时间,降低硬件屏障的更大但高效的MOE模型,以及针对吞吐量,隐私合规性,评价可靠性的研究. 如果您是航运代理商,则不同的是牵引装置(许可、隔离、日志和测试),而不仅仅是底模。
运行时间( YC P26) 将沙箱编码代理作为团队原始
运行时间将推出一个产品, 设定为“一个团队中的每个人的散装箱编码代理”,
编码代理以高影响方式失效,例如删除文件,泄露秘密,或进行意想不到的重播全局变化. Sandboxing将默认从信任转向遏制,这往往是有用工具与事件发生器的区别.
- 01 Agentic coding should be designed around containment first, not just prompt quality.
- 02 Team adoption depends on predictable environments: reproducible sandboxes, pinned dependencies, and clear boundaries on what an agent can touch.
- 03 Auditability becomes a product feature, because ‘why did it change this file?’ is the first question after any agent mistake.
Treat agent execution like CI: run in ephemeral sandboxes, mount only the needed repo paths, block outbound network by default, and require explicit approval for steps that write, delete, or open PRs. Keep a durable run log (inputs, tool calls, diffs) so reviews are fast when something goes wrong.
Cohere 命令 A+ 突出显示“ 盗版模型, 更少的 GPU 方向用于代理堆栈
Cohere发布Command A+,被描述为218B稀疏的Mixture-of-Experts模型从以前的变体整合,定位为代理工作流程,并报告以W4A4量化方式运行的H100最多.
Sparse MoE和积极的量化旨在扩大对强模型的获取,而不需要最大的集群. 对于代理构建者来说,更便宜的推论可以转化为更长的视野(更多的工具调用,更多的重试),但是,如果护栏没有用步数来缩放,也会增加错误的爆炸半径.
- 01 Lower inference cost tends to increase agent step counts, so safety controls must be step-aware (rate limits, budgets, and ‘stop conditions’).
- 02 Consolidating variants can simplify deployment and reduce ‘which model do we use?’ churn for product teams.
- 03 Multimodal capability is increasingly table stakes for agents operating in real workspaces (screenshots, PDFs, or mixed inputs).
If you adopt cheaper / higher-throughput models, add hard budgets: max tool calls, max write operations, and timeouts. Track per-task cost and failure modes (timeouts, loops, unsafe suggestions) and use those metrics as release gates, not after-the-fact dashboards.
研究推动硬性部分:平行溪流、隐私政策合规和耐污染评价
一套新文件侧重于缩放剂的可靠性:多结构有限责任公司探索分离提示、 " 思考 " 和I/O;POLAR-Bench评价了与敌对第三方互动的代理人的隐私-实用性权衡;关于耐污染基准的工作认为,目前的领导板越来越脆弱。
在生产方面,最昂贵的失败并不是小的事实错误。 它们是隐私的泄露,不安全的工具使用,以及那些在静态基准上看起来不错,但在真实的工作流程下破裂的系统. 这些文件表明,评价和架构,而不仅仅是模型大小,是下一个瓶颈。
- 01 If you cannot reliably separate ‘internal reasoning’ from ‘external outputs’, you will keep shipping agents that over-share or mis-handle private context.
- 02 Privacy-policy compliance is adversarial: third-party systems can actively prompt an agent to reveal disallowed data.
- 03 Benchmark contamination means you should measure robustness and real workflow success, not just benchmark deltas.
Add an agent test suite to CI that includes: (1) policy red-team prompts (must-not-share data), (2) tool-call misuse checks (reading forbidden paths, over-calling tools), and (3) multi-step recovery (safe abort, rollback, or escalation). Release-block on failures, and keep the tests private to reduce leakage.
Multi-Stream LLMs
Paper on separating or parallelizing model streams for prompts, reasoning, and I/O.
POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents
Benchmark for evaluating whether agents respect privacy policies under adversarial interaction.
LLM Benchmark Datasets Should Be Contamination-Resistant
Argument for ‘unlearnable’ benchmark designs to resist pretraining contamination.
Spotify 扩展 AI 音频工具与 11Labs 驱动音频书创建
Spotify正在推出由11Labs提供动力的音频书创作工具,这表示持续投资于创造者-造型AI工作流程,而不是纯粹的消费者聊天体验.
Spotify和UMG宣布AI生成的重混和封面为付费功能
Spotify与UMG的许可交易引入了即时驱动的重混和封面作为Premium加法,由艺术家选择退出和特许使用费设定,为消费者AI的创建增加了一个显著的权利和同意层.