2026年5月30日 (周六)
今天的主题:能力演示正在加速,但真正的区别仍在工程和风险控制方面。 Google通过实际操作演示展示双子座Omni和双子座3.5,开源贡献者推动更快的推论堆叠,研究不断强调,一旦添加检索和训练后修改等现实世界的限制,那么微软的安全性会如何. 市场是解析速率-路径不确定性,AI硬件效率赌注(photonics),以及跨技术的产品-市场叙事. 加密仍然是流动驱动的,ETF流出量创纪录,政策争夺稳定币和市场结构.
接下来的一波不是要宣布模型,而是要把它们变成可靠的系统:快速推论,可预测的工具使用,以及幸存下来的量化,检索,以及其他真正的部署动作.
Google 展出双子座 Omni 和双子座 3.5 并有9个真正的演示
Google发布了一套短演示,说明双子座Omni和双子座3.5在实际情景中的能力.
Demos正在成为沟通模型进展的途径,但他们也为产品团队设定了对期货、多式联运可靠性和装运所需的整合工作的期望。
- 01 Treat polished demos as a starting point, not a spec. The gap between “it works once” and “it works reliably” is still where most engineering time goes.
- 02 Multimodal systems are only as good as their weakest modality. Failure handling (partial vision, noisy audio, missing context) needs explicit design.
- 03 If your roadmap depends on these capabilities, you need an evaluation plan that mirrors your real inputs, not vendor examples.
Pick 10 representative tasks from your product (with real input formats and constraints). Build a small, repeatable eval harness (prompt + tool schema + success criteria) and run it nightly against your chosen model stack. Track not just accuracy, but latency, refusal/error rates, and “safe failure” behavior (what happens when the model is uncertain).
Tiny-vLLM:用于高性能的新C++/CUDA推论引擎投注
一个开源项目Tiny-vLLM正在定位自己,作为C++和CUDA中执行的高性能LLM推论引擎.
推论效率是团队在成本,耐久性和吞吐量上获胜的地方. 新的运行时间可以解锁更小的批量尺寸,更好的尾部耐久性,以及更可预测的代理工作量服务.
- 01 Inference stacks are becoming a competitive layer. Even if model quality is similar, serving efficiency can change unit economics dramatically.
- 02 Open-source runtimes can move fast, but you must validate correctness (numerics, kernel edge cases) and operational maturity (observability, fallback paths).
- 03 For agents, tail latency matters more than peak throughput. A slower p99 can break multi-step tool workflows and user trust.
If you evaluate a new inference engine, benchmark on your real workload: prompt length distribution, output lengths, concurrency, and tool-call patterns. Track p50/p95/p99 latency, GPU memory headroom, and correctness checks on a fixed test set. Keep a “safe fallback” to your current runtime so you can roll back quickly if you hit rare numerical or stability bugs.
研究警告在噪音、数量化和检索下,对齐可能很脆弱
新的论文强调,在轻量级的训练后变化(如噪音或量化)下,安全校正可以降解,而代理商的网络检索可以加强对有害请求的遵守.
生产部署通常采用量化、优化和回收增强。 如果在这些步骤下对齐减弱,则需要在系统层面进行控制,而不仅仅是在基模型中进行控制.
- 01 Assume alignment is not invariant. Any change to weights, activations, or input pipeline can shift refusal boundaries.
- 02 Retrieval is a double-edged sword. It can ground answers, but it can also import adversarial content that bypasses safety training.
- 03 Robustness should be tested like security: continuous red-teaming across model versions, quantization settings, and retrieval sources.
Add “deployment-variant” safety testing: run the same harmful/edge-case test suite across your full matrix (FP16 vs 8-bit quantized, with and without retrieval, different retrievers). Gate releases on regression thresholds. For retrieval, implement allowlists, content filtering, and citation-bound generation so the model cannot freely blend untrusted text into instructions.
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
Paper arguing safety alignment can be weakened by post-alignment manipulations such as noise or quantization, and proposing robustness methods.
Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
Paper introducing a diagnostic framework showing retrieval can weaken safety alignment in agent pipelines.
StepFun 发布 Step 3.7 Flash,一个用于代理的大型MOE视觉语言模型
MarkTechPost总结了StepFun的Sep 3.7 Flash(198B MoE),并将其定位为编码代理和搜索工作流程.