问HN:这可能吗?双子座的长上下文Moe架构(假设)
双子座的长上下文混合专家架构(假设):
分享我对双子座模型如何实现其1-10百万长上下文窗口的看法(假设),并提供相关线索以支持这一观点。
专家集成(EoE)或专家网络(MeoE)具有共同/共享的长(1-10M)上下文窗口。
双子座的100万+标记的混合专家(MoE)可能使用“实例”(活跃专家集/TPU分片)共享一个共同的分布式上下文;各个活跃的专家组随后利用这个庞大上下文的相关“部分”进行生成。这允许通过不同的系统“分区”并发处理独立请求。
上下文在一个集群内的多个互联TPU之间进行分片和管理。
对于任何给定的输入,仅激活总模型中一小部分专门的“专家”子网络(一个“动态路径”),具体取决于所需的复杂性和上下文。
整体的混合专家模型能够同时处理多个并发用户请求。
每个请求及其特定的输入和上下文将触发自己独特且独立的活跃专家路径。
共享上下文可以作为独立的(迷你)上下文分片。
在单个集群中的大规模分布式混合专家(MoE)架构,通过并行性对长上下文进行分片和管理,能够处理来自该上下文窗口的一部分的并发请求,以及在大型TPU集群中独立的专家路径;如果需要,它也可以为单个请求使用整个上下文窗口。
证据支持这一点:谷歌在混合专家(MoE)研究方面的开创性工作(Shazeer, GShard, Switch),先进的TPU(v4/v5p/Ironwood)配备大量的高带宽HBM和3D环形/OCS芯片间互连(ICI),使得专家分布(MoE专家、序列并行性如环形注意力)成为可能,TPU集群的VRAM容量与1000万标记的上下文需求相匹配。谷歌的Pathways和系统优化进一步支持这一分布式并发模型。
og x 线程:https://x.com/ditpoo/status/1923966380854157434
查看原文
Gemini's Long Context MoE Architecture (Hypothesized):<p>Sharing how I think (hypothesis) Gemini models achieve their 1-10 Million long context window. With details to clues to support the same.<p>Ensemble of Expert (EoE) or Mesh of Expert (MeoE) with common/shared long (1-10M) context window<p>Gemini's 1M+ token MoE likely uses "instances" (active expert sets/TPU shards) sharing a common distributed context; individual active expert groups then use relevant "parts" of this vast context for generation. This allows concurrent, independent requests via distinct system "partitions."<p>The context is sharded and managed across numerous interconnected TPUs within a pod.<p>For any given input, only a sparse set of specialized "expert" subnetworks (a "dynamic pathway") within the total model are activated, based on complexity and context required.<p>The overall MoE model can handle multiple, concurrent user requests simultaneously.<p>Each request, with its specific input and context, will trigger its own distinct and isolated pathway of active experts.<p>Shared context that can act as independent shards of (mini) contexts.<p>The massively distributed Mixture of Experts (MoE) architecture, across TPUs in a single pod, have its the long context sharded and managed via parallelism, and with ability to handle concurrent requests by part of that context window and independent expert pathways across a large TPU pod, also it can use the entire context window for a single request if required.<p>Evidence points to this: Google's pioneering MoE research (Shazeer, GShard, Switch), advanced TPUs (v4/v5p/Ironwood) with massive HBM & high-bandwidth 3D Torus/OCS Inter-Chip Interconnect (ICI) enabling essential distribution (MoE experts, sequence parallelism like Ring Attention), and TPU pod VRAM capacities aligning with 10M token context needs. Google's Pathways & system optimizations further support this distributed, concurrent model.<p>og x thread: https://x.com/ditpoo/status/1923966380854157434