2026年4月14日 (周二)
对最重要的AI,公共市场和密码 进行实际的,与源相连的综述 在过去的24小时内。
今天的AI饲料将治理风险和计量分开:一份报告说,官员们可能正在推动银行测试Anthropic模型,而新的论文和社区项目试图使LLM评价更加现实,从能感推断基准到模型能否在真正的代码库中找到真正的缺陷. 实用信息:将模型选择视为风险决定,并将基准视为不完整,直到可以在自己的环境中复制.
报告:官员可能鼓励银行测试Anthropic的神话模式
TechCrunch报道称,特朗普政府官员可能鼓励银行试行名为Mythos的Anthropic模型,尽管近期政府担心Anthropic是供应链风险.
如果准确,它显示AI供应商的选择可以被政策信号塑造,而不仅仅是模型质量. 对受监管的公司来说,这增加了业务风险:飞行员可以在一夜之间变得具有政治敏感性,而供应商集中化的速度比内部控制能够跟上的速度要快。
- 01 Model adoption in regulated industries is becoming a governance exercise (security, compliance, regulators, and public scrutiny), not a simple product decision.
- 02 A ‘preferred vendor’ narrative can flip quickly, so portability (prompts, evals, and audit trails) matters as much as raw capability.
- 03 Treat early pilots as evidence-gathering, with clear exit criteria, so you can switch providers without restarting from zero.
Create a portable model-evaluation packet for every AI feature: your test prompts, success metrics, red-team cases, and privacy requirements. Re-run the same packet on every candidate model and keep the artifacts ready for audit.
Watt伯爵为LLM推论提出了一个能感基准
一份新的arXiv论文介绍了Watt Counts,这是一套数据集和基准,重点是衡量不同GPU设置的LLM推论的能耗。
推论成本不仅仅是每个象征性的美元,只有动力和冷却限制才能限制吞吐量。 如果你在规模上运行模型, 能量感知剖面可以改变哪个模型,量化, 和硬件组合是实际可行的。
- 01 Energy, latency, and throughput trade off differently across GPUs, so ‘fastest’ is not necessarily ‘most efficient’ for your workload.
- 02 Benchmarks that include energy measurements help operators avoid surprises when scaling from a demo to production.
- 03 Sustainable inference is increasingly a competitive lever for providers and an internal constraint for teams running on-prem or at the edge.
Add power and cost-per-1K-tokens to your internal eval dashboard. If you cannot measure it directly, start by comparing GPU utilization, latency percentiles, and batch size sensitivity for your real traffic.
N- Day- Bench 询问 LLMS 在真实代码库中是否能找到真正的弱点
一个名为N-Day-Bench的社区项目收集现实世界的脆弱性案例,并评价LLMS是否能够在原始代码库中识别它们.
安全评价常常因为任务是合成的而失败. 实事求是的bug搜索测试帮助您理解一个代理是否对分解和审查有用,或者它是否主要产生自信的噪音.
- 01 Real-code evaluation surfaces failure modes that toy benchmarks hide: dependency context, build systems, and ambiguous intent.
- 02 Vulnerability-finding is high-risk because false positives waste time and false negatives create a dangerous sense of coverage.
- 03 The most valuable outcome may be process improvements (better checklists and review workflows), not just model scores.
If you use LLMs for security review, run them in a constrained workflow: require citations to specific files and lines, force a minimal reproducer or proof sketch, and gate any automated patching behind human review.
对照LLMS的卡片:设定幽默比对标准
研究人员在“反人类卡”式设置上测试前沿模型,以衡量与人类基线相比的幽默偏好。
Bench复制者:评估社会和行为科学中的可复制性
在数据提供不一致的情况下,一个衡量LLM代理是否能够支持复制工作的基准.
NVIDIA PhysicsNeMo 教程:达西流,FNOs,PINNs,代建模
Colab上的PhysicsNeMo逐步走过,为物理知情的ML和基准推论构建了工作流程.