启动 HN:BlankBio(YC S25)——让 RNA 可编程化
大家好,我们是Phil、Ian和Jonny,我们正在构建BlankBio([https://blank.bio](https://blank.bio))。我们正在训练RNA基础模型,以支持一个用于治疗的计算工具包。我们的第一个应用是mRNA设计,我们的愿景是让任何生物学家都能设计出有效的治疗序列([https://www.youtube.com/watch?v=ZgI7WJ1SygI](https://www.youtube.com/watch?v=ZgI7WJ1SygI))。
BlankBio源于我们在该领域的博士研究,这些研究是开源的。我们有一个模型[2]和一个带有API访问的基准[0]。
mRNA有潜力编码疫苗、基因疗法和癌症治疗。然而,设计有效的mRNA仍然是一个瓶颈。目前,科学家通过手动编辑序列AUGCGUAC...并通过反复试验测试结果来设计mRNA。这就像编写汇编代码并管理单独的内存地址。这个领域充斥着针对治疗公司的资本:Strand(1.53亿美元)、Orna(2.21亿美元)、Sail Biomedicines(4.4亿美元),但解决这些问题的工具仍然处于低级水平。这正是我们希望解决的问题。
一个大问题是mRNA序列难以理解。它们编码一些特性,比如半衰期(RNA在细胞中存活的时间)和翻译效率(蛋白质产出),但我们不知道如何优化它们。为了获得有效的治疗,我们需要更高的精度。科学家需要针对特定细胞类型的序列,以减少剂量和副作用。
我们设想一个未来,RNA设计师可以在更高的抽象层次上操作。想象一下这样的代码:
```python
seq = "AUGCAUGCAUGC..."
seq = BB.half_life(seq, target="6 hours")
seq = BB.cell_type(seq, target="hepatocytes")
seq = BB.expression(seq, level="high")
```
为了实现这一目标,我们需要来自预训练模型的可泛化RNA嵌入。在我们的博士研究期间,Ian和我致力于RNA的自监督学习(SSL)目标。这种方法使我们能够在未标记数据上进行训练,并具有以下优点:(1)我们不需要嘈杂的实验数据;(2)未标记数据的数量远大于标记数据。然而,挑战在于标准的自然语言处理(NLP)方法在基因组序列上效果不佳。
通过使用联合嵌入架构的方法(对比学习),我们训练模型识别功能相似的序列,而不是预测每个核苷酸。这一方法效果显著。我们的1000万参数模型Orthrus在4个GPU上训练14小时,超越了在1000个GPU上训练一个40亿参数模型Evo2一个月的结果[0]。在mRNA半衰期预测方面,仅通过对我们的嵌入进行线性回归,我们的表现超过了监督模型。这项在我们学术时期完成的工作是我们正在构建的基础。我们正在改进训练算法,扩大预训练数据集,并利用参数扩展,目标是设计出有效的mRNA治疗。
我们有很多话要说,为什么其他SSL方法比下一个标记预测和掩蔽语言建模效果更好:其中一些可以在Ian的博客文章[1]和我们的论文[2]中查看。一个重要的结论是,当前将NLP应用于生物序列模型扩展的方法无法完全解决问题。基因组的90%可以变异而不影响适应性,因此训练模型预测这种嘈杂序列会导致次优的嵌入[3]。
我们认为数字革命和RNA革命之间有很强的相似性。在计算机早期,程序员编写汇编代码,直接管理寄存器和内存地址。今天的RNA设计师通过反复试验手动调整序列,以提高稳定性或降低免疫原性。正如编译器解放了程序员不必关注低级细节,我们正在为RNA构建抽象层。
我们目前与一些早期阶段的生物科技公司进行试点,证明我们的嵌入的实用性,我们的开源模型已被赛诺菲和GSK等公司使用。我们正在寻找:(1)在RNA相关领域工作的合作伙伴;(2)希望从任何尝试设计RNA序列的人那里获得反馈,你们遇到了哪些痛点?(3)其他应用的想法!我们与一些生物标志物提供公司进行了交谈,一些初步分析显示出改善的分层效果。
感谢您的阅读。欢迎就技术方法、基因组与语言的不同之处或其他任何问题进行提问。
- Phil、Ian和Jonny
founders@blankbio.com
[0] mRNABench: [https://www.biorxiv.org/content/10.1101/2025.07.05.662870v1](https://www.biorxiv.org/content/10.1101/2025.07.05.662870v1)
[1] Ian的博客关于扩展: [https://quietflamingo.substack.com/p/scaling-is-dead-long-live-scaling](https://quietflamingo.substack.com/p/scaling-is-dead-long-live-scaling)
[2] Orthrus: [https://www.biorxiv.org/content/10.1101/2024.10.10.617658v3](https://www.biorxiv.org/content/10.1101/2024.10.10.617658v3)
[3] Zoonomia: [https://www.science.org/doi/10.1126/science.abn3943](https://www.science.org/doi/10.1126/science.abn3943)
查看原文
Hey HN, we're Phil, Ian and Jonny, and we're building BlankBio (<a href="https://blank.bio">https://blank.bio</a>). We're training RNA foundation models to power a computational toolkit for therapeutics. The first application is in mRNA design where our vision is for any biologist to design an effective therapeutic sequence <a href="https://www.youtube.com/watch?v=ZgI7WJ1SygI" rel="nofollow">https://www.youtube.com/watch?v=ZgI7WJ1SygI</a>.<p>BlankBio started from our PhD work in this area, which is open-sourced. There’s a model [2] and a benchmark with APIs access [0].<p>mRNA has the potential to encode vaccines, gene therapies, and cancer treatments. Yet designing effective mRNA remains a bottleneck. Today, scientists design mRNA by manually editing sequences AUGCGUAC... and testing the results through trial and error. It's like writing assembly code and managing individual memory addresses. The field is flooded with capital aimed at therapeutics companies: Strand ($153M), Orna ($221M), Sail Biomedicines ($440M) but the tooling to approach these problems remains low-level. That’s what we’re aiming to solve.<p>The big problem is that mRNA sequences are incomprehensible. They encode properties like half-life (how long RNA survives in cells) and translation efficiency (protein output), but we don't know how to optimize them. To get effective treatments, we need more precision. Scientists need sequences that target specific cell types to reduce dosage and side effects.<p>We envision a future where RNA designers operate at a higher level of abstraction. Imagine code like this:<p><pre><code> seq = "AUGCAUGCAUGC..."
seq = BB.half_life(seq, target="6 hours")
seq = BB.cell_type(seq, target="hepatocytes")
seq = BB.expression(seq, level="high")
</code></pre>
To get there we need generalizable RNA embeddings from pre-trained models. During our PhDs, Ian and I worked on self-supervised learning (SSL) objectives for RNA. This approach allows us to train on unlabeled data and has advantages: (1) we don't require noisy experimental data, and (2) the amount of unlabeled data is significantly greater than labeled. However the challenge is that standard NLP approaches don't work well on genomic sequences.<p>Using joint embedding architecture approaches (contrastive learning), we trained model to recognize functionally similar sequences rather than predict every nucleotide. This worked remarkably well. Our 10M parameter model, Orthrus, trained on 4 GPUs for 14 hours, beats Evo2, a 40B parameter model trained on 1000 GPUs for a month [0]. On mRNA half-life prediction, just by fitting a linear regression on our embeddings, we outperform supervised models. This work done during our academic days is the foundation for what we're building. We're improving training algorithms, growing the pre-training dataset, and making use of parameter scaling with the goal of designing effective mRNA therapeutics.<p>We have a lot to say about why other SSL approaches work better than next-token prediction and masked language modeling: some of which you can check out in Ian's blog post [1] and our paper [2]. The big takeaway is that the current approaches of applying NLP to scaling models for biological sequences won't get us all the way there. 90% of the genome can mutate without affecting fitness so training models to predict this noisy sequence results in suboptimal embeddings [3].<p>We think there are strong parallels between the digital and RNA revolutions. In the early days of computing, programmers wrote assembly code, managing registers and memory addresses directly. Today's RNA designers are manually tweaking sequences, improving stability or reduce immunogenicity through trial and error. As compilers freed programmers from low-level details, we're building the abstraction layer for RNA.<p>We currently have pilots with a few early stage biotechs proving out utility of our embeddings and our open source model is used by folks at Sanofi & GSK. We're looking for: (1) partners working on RNA adjacent modalities (2) feedback from anyone who's tried to design RNA sequences what were your pain points?, and (3) Ideas for other applications! We chatted with some biomarker providing companies, and some preliminary analyses demonstrate improved stratification.<p>Thanks for reading. Happy to answer questions about the technical approach, why genomics is different from language, or anything else.<p>- Phil, Ian, and Jonny<p>founders@blankbio.com<p>[0] mRNABench: <a href="https://www.biorxiv.org/content/10.1101/2025.07.05.662870v1" rel="nofollow">https://www.biorxiv.org/content/10.1101/2025.07.05.662870v1</a><p>[1] Ian’s Blog on Scaling: <a href="https://quietflamingo.substack.com/p/scaling-is-dead-long-live-scaling" rel="nofollow">https://quietflamingo.substack.com/p/scaling-is-dead-long-li...</a><p>[2] Orthrus: <a href="https://www.biorxiv.org/content/10.1101/2024.10.10.617658v3" rel="nofollow">https://www.biorxiv.org/content/10.1101/2024.10.10.617658v3</a><p>[3] Zoonomia: <a href="https://www.science.org/doi/10.1126/science.abn3943" rel="nofollow">https://www.science.org/doi/10.1126/science.abn3943</a>