2026年5月9日 (周六)
新的研究针对的是更可靠的工具使用剂(以及更好的安全评价),而产品团队辩论升级的特征如ChatGPT的"信任接触"和市场在AI芯片内旋转.
代理人可靠性是主题:论文侧重于遵守约束、大规模技能检索和无基准安全评分,而OpenAI则是一个选择进入的 " 信任接触 " 升级特征,引起操作和隐私问题。
ChatGPT 引入了“ 信任的联系人” 升级功能
OpenAI正在为成年的ChatGPT用户推出一个可选的安全功能,允许他们指定一个 " 信任的联系人 " ,如果系统发现严重的自我伤害或与自杀有关的关切,可以通知他们。
升级特征可以减少边缘情况下的伤害,但也引入了新的失败模式:假阳性,不想要的披露,当一个自动信号触发现实世界干预时,责任不明确.
- 01 Treat automated escalation as a high-stakes classifier problem, not a UI toggle. False positives can be socially damaging, and false negatives create a misleading sense of coverage.
- 02 Consent design matters as much as detection. Opt-in, clear revocation, and transparent descriptions of triggers are essential to user trust.
- 03 Organizations integrating similar features should pre-plan incident handling: who gets notified, what guidance is provided, and what evidence is logged for review, without turning sensitive chats into a surveillance substrate.
If you build AI products with safety escalation, run tabletop exercises for false-positive scenarios (relationship conflict, coercion, minors using adult accounts). Define minimum necessary data retention, and provide a fast ‘disable + delete’ path for users.
研究警告说 " 约束性衰变 " 打破了后端代码生成代理
一份新论文认为LLM代理可以生成功能正确的后端代码,同时逐渐违反生产系统所依赖的结构限制(architecture types,数据库chemas,ORMs).
在生产中,从所需结构中漂移的`大多数正确 ' 代码是昂贵的:它增加了维护负担,引入了微妙的安全或数据一致性问题,并使整合审查更加困难。
- 01 Evaluations that score only end behavior encourage agents to ‘cheat’ on non-functional requirements. Structural correctness needs explicit measurement.
- 02 Constraint compliance is not a one-time check. Agents can start aligned and then drift across multiple edits, tool calls, or refactors.
- 03 Teams should encode constraints in machine-checkable gates (lint rules, schema tests, architecture checks), rather than relying on prompt wording or code review alone.
If you deploy coding agents, add ‘structure tests’ to CI (schema migration checks, ORM model parity, layering rules). Log agent diffs and enforce policy checks on every tool write, not just at PR time.
无基准安全评分正式确定在标签存在之前如何比较模型
一份正式确定 " 无基准比较安全评分 " 的文件,具体说明在何种条件下,即使没有地面真实标签,基于情景的审计也可作为部署证据。
许多部署都需要一种合理的方法来比较某一具体领域或语言中尚无标签基准的候选模型(或微调)的安全性。
- 01 Safety scores without ground-truth labels are only meaningful under a strict contract: fixed scenario pack, rubric, auditor, judge, sampling, and rerun budget.
- 02 Changing any audit component can invalidate comparisons, so reporting needs to be versioned and reproducible.
- 03 This framing encourages teams to treat safety evaluation like measurement infrastructure, not an ad hoc one-off.
If you are selecting models for deployment, publish a ‘safety scorecard spec’ (scenario set version, rubric, judge model, sampling settings). Require reruns after model updates, policy changes, or prompt/template edits.
在LLM代理中进行技能检索的SkillRet基准
一项大规模基准的重点是在紧凑的背景和暂缺预算下从图书馆收回正确的 " 技能 " ,反映了随着代理工具生态系统的增长而面临的实际挑战。