HackerNews中文版

大家好，我整理了一个代码库，演示如何从零开始在单个 TSPLIB 实例（lin318）上直接训练 PPO——无需预训练或 GPU。代码库链接：[https://github.com/jivaprime/TSP](https://github.com/jivaprime/TSP) 1. 实验设置问题：TSPLIB lin318（最优解：42,029）和 rd400 硬件：Google Colab（仅 CPU）模型：单实例 PPO 策略 + 价值网络。以随机初始化开始。局部搜索：训练期间使用轻量级的 2-opt，评估时使用 Numba 加速的 3-opt。核心概念：与其说是“稳定的平均误差最小化器”，不如说这个策略被设计为高方差的探索者。目标不是保持平均差距低，而是偶尔“激增”出非常低误差的路径，以便局部搜索进行优化。 2. 结果：lin318 最佳结果：42,064（差距约 +0.08%）时间：在 Colab CPU 上大约 20 分钟内达到。根据日志（包含在代码库中），小于 0.1% 的结果出现在经过时间为 0:19:49 时。虽然平均误差在 3% 到 4% 之间波动，但该策略成功找到了一个深的盆地，3-opt 可以利用。 3. 扩展实验：智能 ILS 和 rd400 我扩展了流程，使用“智能 ILS”（迭代局部搜索）后处理，看看我们是否能达到精确的最优解。 A. lin318 + ILS 以 PPO 生成的路径（0.08% 差距）作为种子。运行智能 ILS 大约 20 分钟。结果：达到了精确的最优解（42,029）。 B. rd400 + ILS PPO 阶段：在 CPU 上约 2 小时。生成的路径差距约为 1.9%。 ILS 阶段：使用 PPO 路径作为种子。运行约 40 分钟。结果：达到了 0.079% 的差距（成本 15,293 对比最优解 15,281）。总结该工作流程有效地分离了关注点： PPO：引导搜索进入高质量的盆地（1% - 2% 差距）。 ILS：在该盆地内深入挖掘以找到最优解。如果您对实例级的强化学习、基于 CPU 的优化，或与 ML-TSP 基线（POMO、AM、NeuroLKH）进行比较感兴趣，欢迎查看代码。欢迎建设性的反馈！

查看原文

Hi allI’ve put together a repo demonstrating how to train PPO directly on a single TSPLIB instance (lin318) from scratch—without pre-training or GPUs.Repo:https://github.com/jivaprime/TSP1. Experiment SetupProblem: TSPLIB lin318 (Opt: 42,029) & rd400Hardware: Google Colab (CPU only)Model: Single-instance PPO policy + Value network. Starts from random initialization.Local Search: Light 2-opt during training, Numba-accelerated 3-opt for evaluation.Core Concept: Instead of a "stable average-error minimizer," this policy is designed as a high-variance explorer. The goal isn't to keep the average gap low, but to occasionally "spike" very low-error tours that local search can polish.2. Results: lin318Best Shot: 42,064 (Gap ≈ +0.08%)Time: Reached within ~20 minutes on Colab CPU.According to the logs (included in the repo), the sub-0.1% shot appeared around elapsed=0:19:49. While the average error oscillates around 3–4%, the policy successfully locates a deep basin that 3-opt can exploit.3. Extended Experiment: Smart ILS & rd400I extended the pipeline with "Smart ILS" (Iterated Local Search) post-processing to see if we could hit the exact optimum.A. lin318 + ILSTook the PPO-generated tour (0.08% gap) as a seed.Ran Smart ILS for ~20 mins.Result: Reached the exact optimal (42,029).B. rd400 + ILSPPO Phase: ~2 hours on CPU. Produced tours with ~1.9% gap.ILS Phase: Used PPO tours as seeds. Ran for ~40 mins.Result: Reached 0.079% gap (Cost 15,293 vs Opt 15,281).SummaryThe workflow separates concerns effectively:PPO: Drives the search into a high-quality basin (1–2% gap).ILS: Digs deep within that basin to find the optimum.If you are interested in instance-wise RL, CPU-based optimization, or comparing against ML-TSP baselines (POMO, AM, NeuroLKH), feel free to check out the code.Constructive feedback is welcome!

仅使用CPU的PPO算法在20分钟内解决了TSPLIB中的lin318问题（差距为0.08%）。