How to Build Your Second Brain

章节 1:引言与数据摄取(Introduction & Data Ingest)

📝 本节摘要

作者分享了近期使用大语言模型(LLM)构建个人主题知识库的心得。工作重心的转移使得大量算力被用于处理以Markdown和图像格式存储的知识,而不是单纯的代码。在数据摄取阶段,作者将各类源文件存入原始目录,随后利用LLM增量“编译”出一个高度互联的Wiki体系,并借助Obsidian网页剪藏等工具实现图文素材的高效本地化同步。

[原文] [Author]: LLM Knowledge Bases

[译文] [Author]: 大语言模型知识库(LLM Knowledge Bases)

[原文] [Author]: Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest.

[译文] [Author]: 最近我发现一件非常有用(的事情):使用大语言模型(LLMs)为各种研究兴趣主题构建个人知识库。

[原文] [Author]: In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images).

[译文] [Author]: 通过这种方式,我最近的一大部分 token 吞吐量(token throughput)较少用于处理代码,而是更多地用于处理知识(存储为 markdown 和图像)。

[原文] [Author]: The latest LLMs are quite good at it.

[译文] [Author]: 最新的大语言模型(LLMs)在这方面非常擅长。

[原文] [Author]: So:

[译文] [Author]: 所以:

[原文] [Author]: Data ingest:

[译文] [Author]: 数据摄取(Data ingest):

[原文] [Author]: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure.

[译文] [Author]: 我将源文档(文章、论文、代码库、数据集、图像等)索引到一个原始数据(raw/)目录中,然后我使用一个大语言模型(LLM)增量地“编译”出一个 Wiki,它仅仅是一个目录结构中的 .md 文件集合。

[原文] [Author]: The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all.

[译文] [Author]: 这个 Wiki 包括了 raw/ 目录中所有数据的摘要和反向链接(backlinks),然后它将数据分类为各个概念,为它们撰写文章,并将它们全部链接起来。

[原文] [Author]: To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them.

[译文] [Author]: 为了将网页文章转换为 .md 文件,我喜欢使用 Obsidian 的网页剪藏扩展(Web Clipper extension),此外我还使用快捷键将所有相关图像下载到本地,以便我的大语言模型(LLM)能够轻松引用它们。


章节 2:开发环境与问答系统(IDE & Q&A)

📝 本节摘要

本节阐述了知识库的前端展示与核心的智能问答机制。作者使用Obsidian作为可视化集成开发环境(IDE),而底层的数据写入与维护则完全交由LLM接管。当Wiki累积到一定规模时,用户可直接向LLM智能体提出复杂的针对性问题。令人惊喜的是,在中小规模数据量下,LLM自带的摘要与索引维护能力已经足够强大,甚至不需要引入复杂的检索增强生成(RAG)技术。

[原文] [Author]: IDE:

[译文] [Author]: 集成开发环境(IDE):

[原文] [Author]: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations.

[译文] [Author]: 我使用 Obsidian 作为 IDE 的“前端(frontend)”,在这里我可以查看原始数据、编译后的 Wiki 以及衍生出的可视化效果。

[原文] [Author]: Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly.

[译文] [Author]: 需要重点注意的是,大语言模型(LLM)编写并维护了 Wiki 的所有数据,我很少直接对其进行修改。

[原文] [Author]: I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides).

[译文] [Author]: 我尝试了一些 Obsidian 插件,以便用其他方式渲染和查看数据(例如用于幻灯片的 Marp)。

[原文] [Author]: Q&A:

[译文] [Author]: 问答(Q&A):

[原文] [Author]: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc.

[译文] [Author]: 有趣的地方在于,一旦你的 Wiki 足够庞大(例如,我最近一些研究的 Wiki 大约有 100 篇文章和 40 万字),你就可以向你的大语言模型智能体(LLM agent)提出针对该 Wiki 的各种复杂问题,它会自己去研究答案等等。

[原文] [Author]: I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale.

[译文] [Author]: 我原本以为必须使用花哨的检索增强生成(fancy RAG),但大语言模型(LLM)在自动维护索引文件和所有文档的简短摘要方面一直做得很好,并且在这个相对较小的规模下,它能相当轻松地读取所有重要的相关数据。


章节 3:输出格式与数据清理(Output & Linting)

📝 本节摘要

作者探讨了输出结果的多样性与知识库的自动化质检。与其局限于终端内的纯文本,作者更倾向于让LLM生成Markdown、幻灯片或可视化图表,并将这些有价值的回答重新归档进Wiki中形成闭环。同时,作者还会利用LLM对Wiki进行定期的“健康检查”,通过识别错误、自动联网补全缺失信息、建立新链接等方式,不断自发提升知识库的数据完整性。

[原文] [Author]: Output:

[译文] [Author]: 输出(Output):

[原文] [Author]: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian.

[译文] [Author]: 相较于在文本/终端(text/terminal)中获取答案,我更喜欢让它为我渲染 Markdown 文件、幻灯片(Marp格式)或 Matplotlib 图像,然后我会在 Obsidian 中再次查看所有这些内容。

[原文] [Author]: You can imagine many other visual output formats depending on the query.

[译文] [Author]: 你可以想象,根据查询(query)的不同,还有许多其他的视觉输出格式。

[原文] [Author]: Often, I end up "filing" the outputs back into the wiki to enhance it for further queries.

[译文] [Author]: 通常情况下,我最终会将这些输出“归档(filing)”回 Wiki 中,以增强它来应对进一步的查询。

[原文] [Author]: So my own explorations and queries always "add up" in the knowledge base.

[译文] [Author]: 所以我自己的探索和查询总是能在知识库中不断“积累(add up)”。

[原文] [Author]: Linting:

[译文] [Author]: 代码规范检查/数据清理(Linting):

[原文] [Author]: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity.

[译文] [Author]: 我对 Wiki 运行了一些大语言模型(LLM)的“健康检查(health checks)”,例如寻找不一致的数据、填补缺失的数据(借助网络搜索)、为可能的新文章寻找有趣的连接等,从而逐步清理 Wiki 并提升其整体数据完整性(data integrity)。

[原文] [Author]: The LLMs are quite good at suggesting further questions to ask and look into.

[译文] [Author]: 大语言模型(LLMs)非常擅长建议进一步要提问和深入研究的问题。


章节 4:额外工具、深入探索与总结(Extra tools, Further explorations & TLDR)

📝 本节摘要

最终章展望了这套工作流的扩展潜力与产品化可能。作者开发了如小型搜索引擎等额外工具,并通过命令行接口(CLI)将其移交给LLM使用。随着知识库的扩张,作者考虑未来通过合成数据生成和微调,让LLM将知识内化到模型权重之中。在全文总结中,作者断言这不仅是一堆拼凑的脚本,更是构建未来全新产品形态的绝佳方向。

[原文] [Author]: Extra tools:

[译文] [Author]: 额外工具(Extra tools):

[原文] [Author]: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries.

[译文] [Author]: 我发现自己在开发额外的工具来处理数据,例如我凭感觉编写(vibe coded)了一个基于该 Wiki 的小型且基础的搜索引擎,我既会直接使用它(在Web界面中),但更常见的是,我希望通过命令行界面(CLI)将它作为工具移交给大语言模型(LLM)以处理更大规模的查询。

[原文] [Author]: Further explorations:

[译文] [Author]: 深入探索(Further explorations):

[原文] [Author]: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows.

[译文] [Author]: 随着代码库(repo)的不断增长,一个自然的想法是开始考虑合成数据生成(synthetic data generation)加微调(finetuning),让你的大语言模型(LLM)在其权重(weights)中“知晓”这些数据,而不仅仅局限于上下文窗口(context windows)内。

[原文] [Author]: TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian.

[译文] [Author]: 简而言之(TLDR):从一定数量的来源收集原始数据,然后由大语言模型(LLM)将其编译为 .md 格式的 Wiki,随后通过大语言模型(LLM)驱动的各种命令行界面(CLIs)进行操作以完成问答并增量提升 Wiki 的质量,而这一切都可以在 Obsidian 中查看。

[原文] [Author]: You rarely ever write or edit the wiki manually, it's the domain of the LLM.

[译文] [Author]: 你几乎从不需要手动编写或编辑 Wiki,这是大语言模型(LLM)的领域。

[原文] [Author]: I think there is room here for an incredible new product instead of a hacky collection of scripts.

[译文] [Author]: 我认为这其中蕴含着打造一款令人惊叹的新产品的空间,而绝不仅仅是一堆拼凑起来的脚本代码(hacky collection of scripts)。