2026年4月8日 (周三)
对最重要的AI,公共市场和密码 进行实际的,与源相连的综述 在过去的24小时内。
基准和安全评价不断扩展到更现实的环境(多模式科学图、多流包含的任务和代理运行时间)。 同时,高知名度的模型文档和安全写作正在推动团队将能力增益和业务风险(即时注射,工具滥用,代码重建文物)作为同一发行周期的两面处理.
Anthropic 出版 Claude Mythos 预览系统卡和网络安全评价
两本相关出版物广为传播:克劳德·神话预览的系统卡PDF和一份评估模型网络安全能力的配套文章。
系统卡和特定领域评价日益成为安全、法律和产品小组制定部署政策所依赖的实际工具。 对于工具使用代理的操作者来说,这类文件只有在转化为混凝土护栏(被屏蔽的,被记录的,被允许执行的)时才有用.
- 01 Treat model documentation as an input to policy, not marketing: map claims to enforceable controls in your runtime.
- 02 Cybersecurity capability shifts can change your threat model overnight, especially for agents with file/network access.
- 03 The highest risk is usually not the model’s raw ability, but what the surrounding system lets it do by default.
Update your agent release checklist: require a short internal “system card delta” note for every model upgrade (new strengths, new failure modes, and the single most important policy change you will enforce).
Feynman Bench 瞄准图结构的多模式物理推理
一项新的arXiv基准提议评价以Feynman图表为中心的任务的多式联运LLMs,强调全球结构逻辑而不是局部提取。
建设科学或工程副驾驶的团队经常撞到一堵墙,模型可以读取标签,但在基础的正式结构上失败. 压力图表推理基准有助于预测一个模型在实际分析工作流程中是否可靠,而不仅仅是对列报层面的理解。
- 01 If your product relies on diagrams, evaluate for global consistency (structure and constraints), not just captioning.
- 02 Multimodal performance can look strong on “spot the text” tests while still failing at symbolic or relational logic.
- 03 Better benchmarks are a forcing function: they expose where tool augmentation (calculators, solvers) is still needed.
Create a small internal evaluation set of 20 real diagrams from your domain (schematics, plots, network diagrams). Score models on: (1) constraint validity, (2) step-by-step derivations, and (3) whether answers remain correct when you permute labels.
研究突出代理安全漏洞:"安全"LLMS可能会成为不安全的代理.
一篇arXiv论文认为,停止聊天对齐的安全评价错过了在用户机上具有真正权限运行的代理商更大的风险表面.
在代理环境中,主要失败不是坏答案,而是不安全的行动。 这推动组织向防御深度发展:沙箱,严格的工具权限,可审计的痕迹,以及耐迅速注射的工作流程.
- 01 Agent safety is an execution problem: permissioning, isolation, and auditability matter as much as model alignment.
- 02 Prompt injection is a systems vulnerability when the agent can read untrusted content and then act.
- 03 Define “unsafe” in operational terms (file writes, network calls, secret access) and test those pathways explicitly.
Add a “privilege budget” to your agent runs: default to no network, no shell, and read-only filesystem. Only grant capabilities per task via an allowlist, and log every elevation with a human-readable reason.
毒性识别剂可通过LLM脱污作用持久存在
一个案例研究报告称,在含混不清的JavaScript中,毒化变量/识别名称,即使模型似乎理解语义,也能存活到重建后的代码中,凸显出自动化反向工程的微妙完整性风险.
ST-Bench基准 多流双流协调
一个基准框架侧重于双人任务中多个感官流之间的时空协调,强调规划和同步,而不是单步感官.