发布 HN:Gecko Security(YC F24)——能够发现代码漏洞的人工智能
大家好,我是JJ,Gecko Security的联合创始人([https://www.gecko.security](https://www.gecko.security))。我们正在构建一种新型的静态分析工具,利用大型语言模型(LLMs)来发现当前扫描器遗漏的复杂业务逻辑和多步骤漏洞。我们已经在Ollama、Gradio和Ragflow等项目中发现了30多个CVE(公共漏洞和暴露)记录([https://www.gecko.security/research](https://www.gecko.security/research))。您可以在任何开源软件(OSS)代码库上亲自试用([https://app.gecko.security](https://app.gecko.security))。
任何使用过静态应用程序安全测试(SAST)工具的人都知道,它们存在高假阳性率的问题,同时也会漏掉像身份验证绕过或权限提升等整个类别的漏洞。这一限制源于其核心架构。SAST工具的设计是将代码解析为简单的模型,如抽象语法树(AST)或调用图,这在动态类型语言或微服务边界中很快失去上下文,并且仅限于解析基本的调用链。在检测漏洞时,它们依赖于正则表达式或YAML规则进行模式匹配,这对于基本的技术类漏洞(如XSS、SQL注入)可能有效,但对于不符合已知模式的逻辑缺陷,以及需要长序列依赖操作才能达到可利用状态的漏洞则显得不足。
我和我的联合创始人在国家情报和军事网络部队的职业生涯中看到了这些限制,我们在其中构建了自动化工具来保护关键基础设施。我们意识到,具有正确架构的LLMs可以最终解决这些问题。
漏洞是有上下文的。可利用性完全依赖于每个应用程序的安全模型。我们意识到,准确的检测需要理解什么是应该被保护的,以及破坏它的原因。这意味着将威胁建模直接嵌入我们的分析中,而不是将其视为事后考虑。
为了实现这一目标,我们首先必须解决代码解析问题。我们的解决方案是构建一个自定义的、编译器精确的索引器,灵感来自GitHub的堆栈图方法,以精确导航代码,类似于集成开发环境(IDE)。我们基于LSIF方法([https://lsif.dev/](https://lsif.dev/)),但用紧凑的protobuf模式替代冗长的JSON,以二进制格式序列化符号定义和引用。我们使用特定语言的工具来解析和类型检查代码,生成一系列Protobuf消息,记录符号的位置、定义和引用信息。通过使用Protobuf的高效性和强类型特性,我们可以生成更小的索引,同时保留检测复杂调用链所需的编译器精确语义信息。
这就是为什么大多数使用AST解析的“SAST + LLM”工具失败的原因——它们向LLMs提供了来自传统解析器的不完整或不正确的代码信息,使得在缺乏上下文的情况下很难准确推理安全问题。
通过我们的索引器提供准确的代码结构,我们使用LLM进行威胁建模,分析开发者意图、数据和信任边界以及暴露的端点,以生成潜在攻击场景。这正是LLMs倾向于产生幻觉的特性成为突破性功能的地方。
对于每个生成的潜在攻击路径,我们进行系统搜索,查询索引器以收集所有必要的上下文,并重建从源到汇的完整调用链。为了验证漏洞,我们使用蒙特卡洛树自我优化(MCTSr)算法和“胜利函数”来确定假设攻击成功的可能性。一旦发现的结果超过设定的实用性阈值,就确认其为真正的阳性。
通过这种方法,我们发现了像CVE-2025-51479这样的漏洞,该漏洞出现在ONYX(一个开源企业搜索平台)中,策展人可以修改任何组,而不仅仅是他们被分配的组。用户组API有一个用户参数应该检查权限,但从未使用它。Gecko推断开发者意图限制策展人的访问,因为用户界面和类似的API函数都正确验证了该权限。这确立了“策展人具有有限范围”作为一个安全不变式,而这个特定API违反了这一点。传统SAST无法检测到这一点。任何标记未使用用户参数的规则都会淹没在假阳性中,因为许多函数合法地保留未使用的参数。更重要的是,检测这一点需要知道哪些函数处理授权,理解ONYX的策展人权限模型,并识别多个文件中的验证模式——这是SAST根本无法做到的上下文推理。
我们有几家企业客户正在使用Gecko,因为它解决了他们无法用传统SAST工具处理的问题。他们在相同代码库上看到假阳性减少了50%,并发现了以前只在手动渗透测试中出现的漏洞。
深入探讨假阳性,没有任何静态分析工具能够实现完美的准确性,无论是人工智能还是其他方法。我们在两个关键点上减少假阳性。首先,我们的索引器消除了任何程序解析错误,这些错误会导致传统AST工具易受影响的错误调用链。其次,我们通过提出具体的、上下文相关的问题来避免不必要的LLM幻觉和推理错误,而不是开放式问题。LLM知道哪些安全不变式需要保持,并可以根据上下文做出确定性的评估。当我们标记某些内容时,人工审核也很迅速,因为我们提供了完整的源到汇的数据流分析,附带概念验证代码和基于置信度评分的输出结果。
我们非常希望得到社区的反馈、未来方向的想法或在这一领域的经验。我会在评论区回复大家!
查看原文
Hey HN, I'm JJ, Co-Founder of Gecko Security (<a href="https://www.gecko.security">https://www.gecko.security</a>). We're building a new kind of static analysis tool that uses LLMs to find complex business logic and multi-step vulnerabilities that current scanners miss. We’ve used it to find 30+ CVEs in projects like Ollama, Gradio, and Ragflow (<a href="https://www.gecko.security/research">https://www.gecko.security/research</a>). You can try it yourself on any OSS repo at (<a href="https://app.gecko.security">https://app.gecko.security</a>).<p>Anyone who’s used SAST (Static Application Security Testing) tools knows the issues of high false positives while missing entire classes of vulnerabilities like AuthN/Z bypasses or privilege escalations. This limitation is a result of their core architecture. By design, SAST tools parse code into a simplistic model like an AST or call graph, which quickly loses context in dynamically typed languages or across microservice boundaries, and limits coverage to only resolving basic call chains. When detecting vulnerabilities they rely on pattern matching with Regex or YAML rules, which can be effective for basic technical classes like (XSS, SQLi) but inadequate for logic flaws that don’t conform to well-known shapes and need long sequences of dependent operations to reach an exploitable state.<p>My co-founder and I saw these limitations throughout our careers in national intelligence and military cyber forces, where we built automated tooling to defend critical infrastructure. We realised that LLMs, with the right architecture, could finally solve them.<p>Vulnerabilities are contextual. What's exploitable depends entirely on each application's security model. We realized accurate detection requires understanding what's supposed to be protected and why breaking it matters. This meant embedding threat modeling directly into our analysis, not treating it as an afterthought.<p>To achieve this, we first had to solve the code parsing problem. Our solution was to build a custom, compiler-accurate indexer inspired by GitHub's stack graphs approach to precisely navigate code, like an IDE. We build on the LSIF approach (<a href="https://lsif.dev/" rel="nofollow">https://lsif.dev/</a>) but replace the verbose JSON with a compact protobuf schema to serialise symbol definitions and references in a binary format. We use language‑specific tools to parse and type‑check code, emitting a sequence of Protobuf messages that record a symbol’s position, definition, and reference information. By using Protobuf’s efficiency and strong typing, we can produce smaller indexes, but also preserve the compiler‑accurate semantic information required for detecting complex call chains.<p>This is why most "SAST + LLM" tools that use AST parsing fail - they feed LLMs incomplete or incorrect code information from traditional parsers, making it difficult to accurately reason about security issues with missing context.<p>With our indexer providing accurate code structure, we use an LLM to perform threat modeling by analyzing developer intent, data and trust boundaries, and exposed endpoints to generate potential attack scenarios. This is where LLMs' tendency to hallucinate becomes a breakthrough feature.<p>For each potential attack path generated, we perform a systematic search, querying the indexer to gather all necessary context and reconstruct the full call chain from source to sink. To validate the vulnerability we use a Monte Carlo Tree Self-refine (MCTSr) algorithm and a 'win function' to determine the likelihood that a hypothesized attack could work. Once a finding is above a set practicality threshold it is confirmed as a true positive.<p>Using this approach, we discovered vulnerabilities like CVE-2025-51479 in ONYX (an OSS enterprise search platform) where Curators could modify any group instead of just their assigned ones. The user-group API had a user parameter that should check permissions but never used it. Gecko inferred developers intended to restrict Curator access because both the UI and similar API functions properly validated this permission. This established "curators have limited scope" as a security invariant that this specific API violated. Traditional SAST can't detect this. Any rule to flag unused user parameters would drown you in false positives since many functions legitimately keep unused parameters. And more importantly, detecting this requires knowing which functions handle authorization, understanding ONYX's Curator permission model, and recognizing the validation pattern across multiple files - contextual reasoning that SAST simply cannot do.<p>We have several enterprise customers using Gecko because it solves problems they couldn't address with traditional SAST tools. They're seeing 50% fewer false positives on the same codebases and finding vulnerabilities that previously only showed up in manual pentests.<p>Digging into false positives, no static analysis tool will ever achieve perfect accuracy, AI or otherwise. We reduce them at two key points. First, our indexer eliminates any programmatic parsing errors that create incorrect call chains that traditional AST tools are susceptible to. Second, we avoid unwanted LLM hallucinations and reasoning errors by asking specific, contextual questions rather than open-ended ones. The LLM knows which security invariants need to hold and can make deterministic assessments based on the context. When we do flag something, manual review is quick because we provide complete source-to-sink dataflow analysis with proof-of-concept code and output findings based on confidence scores.<p>We’d love to get any feedback from the community, ideas for future direction, or experiences in this space. I’ll be in the comments to respond!