展示HN:Lemon Slice Live,一个实时视频音频AI模型
嗨,HN,我是Lemon Slice的Lina、Andrew和Sidney。我们训练了一个自定义的扩散变换器(DiT)模型,实现了以25帧每秒的视频流播放,并将其包装成一个演示,允许任何人将照片转变为实时的会说话的头像。以下是联合创始人Andrew的一段示例对话:<a href="https://www.youtube.com/watch?v=CeYp5xQMFZY" rel="nofollow">https://www.youtube.com/watch?v=CeYp5xQMFZY</a>。您可以在这里亲自尝试:<a href="https://lemonslice.com/live">https://lemonslice.com/live</a>。
(顺便提一下,我们之前叫做Infinity AI,去年以这个名字做过一次Show HN:<a href="https://news.ycombinator.com/item?id=41467704">https://news.ycombinator.com/item?id=41467704</a>。)
与现有的头像视频聊天平台如HeyGen、Tolan或Apple Memoji滤镜不同,我们不需要训练自定义模型、提前设置角色或让人类操控头像。我们的技术允许用户通过上传一张图片来创建并立即视频通话一个自定义角色。角色图像可以是任何风格——从照片真实到卡通、绘画等。
为了实现这个演示,我们必须完成以下几项工作(还有其他工作,但这些是最难的):
1. 训练一个快速的DiT模型。为了使我们的视频生成速度更快,我们必须设计一个在速度和质量之间做出正确权衡的模型,并使用标准的蒸馏方法。我们首先从头开始训练了一个自定义的视频扩散变换器(DiT),实现了与音频的出色嘴唇和面部表情同步。为了进一步优化模型的速度,我们应用了教师-学生蒸馏。蒸馏后的模型在256像素分辨率下实现了25帧每秒的视频生成。专用的变换器ASIC最终将使我们能够以4K分辨率流式传输我们的视频模型。
2. 解决无限视频问题。大多数视频DiT模型(如Sora、Runway、Kling)生成5秒的片段。它们可以通过将第一个片段的末尾以自回归的方式输入到第二个片段的开头,迭代地延长5秒。不幸的是,由于生成错误的积累,这些模型在多次扩展后会出现质量下降。我们开发了一种时间一致性保持技术,能够在长序列中保持视觉连贯性。我们的技术显著减少了伪影的积累,使我们能够生成无限长的视频。
3. 具有最小延迟的复杂流媒体架构。实现端到端的头像Zoom通话需要几个构建模块,包括语音转录、LLM推理和文本转语音生成,此外还有视频生成。我们使用Deepgram作为我们的AI语音合作伙伴,Modal作为端到端计算平台,Daily.co和Pipecat帮助构建一个并行处理管道,通过持续流式传输片段来协调一切。我们的系统实现了从用户输入到头像响应的端到端延迟为3-6秒。我们的目标是低于2秒的延迟。
更多技术细节请见:<a href="https://lemonslice.com/live/technical-report">https://lemonslice.com/live/technical-report</a>。
我们希望解决的当前限制包括:(1)实现全身和背景动作(我们正在为此训练下一代模型),(2)减少延迟并提高分辨率(专用ASIC将有所帮助),(3)在双人对话中训练模型,以便头像能够自然地倾听,以及(4)让角色“看到你”并对他们所见的内容做出反应,以创造更自然和引人入胜的对话。
我们相信,生成视频将开启一种以互动为中心的新媒体类型:电视节目、电影、广告和在线课程将停止并与我们对话。我们的娱乐将根据我们的心情混合被动和主动的体验。预测是困难的,尤其是关于未来,但这就是我们的看法!
我们希望您能试用这个演示,并告诉我们您的想法!请在下面发布您的角色和/或对话录音。
查看原文
Hey HN, this is Lina, Andrew, and Sidney from Lemon Slice. We’ve trained a custom diffusion transformer (DiT) model that achieves video streaming at 25fps and wrapped it into a demo that allows anyone to turn a photo into a real-time, talking avatar. Here’s an example conversation from co-founder Andrew: <a href="https://www.youtube.com/watch?v=CeYp5xQMFZY" rel="nofollow">https://www.youtube.com/watch?v=CeYp5xQMFZY</a>. Try it for yourself at: <a href="https://lemonslice.com/live">https://lemonslice.com/live</a>.<p>(Btw, we used to be called Infinity AI and did a Show HN under that name last year: <a href="https://news.ycombinator.com/item?id=41467704">https://news.ycombinator.com/item?id=41467704</a>.)<p>Unlike existing avatar video chat platforms like HeyGen, Tolan, or Apple Memoji filters, we do not require training custom models, rigging a character ahead of time, or having a human drive the avatar. Our tech allows users to create and immediately video-call a custom character by uploading a single image. The character image can be any style - from photorealistic to cartoons, paintings, and more.<p>To achieve this demo, we had to do the following (among other things! but these were the hardest):<p>1. Training a fast DiT model. To make our video generation fast, we had to both design a model that made the right trade-offs between speed and quality, and use standard distillation approaches. We first trained a custom video diffusion transformer (DiT) from scratch that achieves excellent lip and facial expression sync to audio. To further optimize the model for speed, we applied teacher-student distillation. The distilled model achieves 25fps video generation at 256-px resolution. Purpose-built transformer ASICs will eventually allow us to stream our video model at 4k resolution.<p>2. Solving the infinite video problem. Most video DiT models (Sora, Runway, Kling) generate 5-second chunks. They can iteratively extend it by another 5sec by feeding the end of the 1st chunk into the start of the 2nd in an autoregressive manner. Unfortunately the models experience quality degradation after multiple extensions due to accumulation of generation errors. We developed a temporal consistency preservation technique that maintains visual coherence across long sequences. Our technique significantly reduces artifact accumulation and allows us to generate indefinitely-long videos.<p>3. A complex streaming architecture with minimal latency. Enabling an end-to-end avatar zoom call requires several building blocks, including voice transcription, LLM inference, and text-to-speech generation in addition to video generation. We use Deepgram as our AI voice partner. Modal as the end-to-end compute platform. And Daily.co and Pipecat to help build a parallel processing pipeline that orchestrates everything via continuously streaming chunks. Our system achieves end-to-end latency of 3-6 seconds from user input to avatar response. Our target is <2 second latency.<p>More technical details here: <a href="https://lemonslice.com/live/technical-report">https://lemonslice.com/live/technical-report</a>.<p>Current limitations that we want to solve include: (1) enabling whole-body and background motions (we’re training a next-gen model for this), (2) reducing delays and improving resolution (purpose-built ASICs will help), (3) training a model on dyadic conversations so that avatars learn to listen naturally, and (4) allowing the character to “see you” and respond to what they see to create a more natural and engaging conversation.<p>We believe that generative video will usher in a new media type centered around interactivity: TV shows, movies, ads, and online courses will stop and talk to us. Our entertainment will be a mixture of passive and active experiences depending on what we’re in the mood for. Well, prediction is hard, especially about the future, but that’s how we see it anyway!<p>We’d love for you to try out the demo and let us know what you think! Post your characters and/or conversation recordings below.