Knowledge Graph RAG — The Promise of Structured Retrieval and the Hidden Cost of Building It
My thesis was on knowledge graph embeddings, so when GraphRAG started trending I was genuinely excited. Finally, knowledge graphs getting the attention they deserve in the LLM era. But having lived in that world, I also know what people aren’t talking about: the cost of actually building and maintaining a knowledge graph from scratch.
- What Is a Knowledge Graph?
- Why Traditional RAG Falls Short on Global Questions
- The GraphRAG Approach
- The Maintenance Problem — What Happens When Knowledge Changes
- What About Knowledge Graphs for Source Code?
- When to Use GraphRAG
What Is a Knowledge Graph?
A knowledge graph is a graph where nodes are entities and edges are relations between them. The canonical example is a triple: (Albert Einstein, bornIn, Ulm). Millions of these triples form a structured representation of knowledge — Wikidata has over 100 billion triples. The key property is that knowledge is explicit and traversable. You can follow links, reason over paths, and answer multi-hop questions that would be impossible with flat text.
graph LR
E["Einstein"] -->|bornIn| U["Ulm"]
E -->|field| P["Physics"]
E -->|won| NP["Nobel Prize 1921"]
NP -->|category| P
U -->|country| DE["Germany"]
style E fill:#264653,stroke:#264653,color:#fff
style U fill:#2a9d8f,stroke:#2a9d8f,color:#fff
style P fill:#e9c46a,stroke:#e9c46a,color:#000
style NP fill:#f4a261,stroke:#f4a261,color:#000
style DE fill:#2d6a4f,stroke:#2d6a4f,color:#fff
Why Traditional RAG Falls Short on Global Questions
Traditional RAG works like this: chunk documents, embed them into vectors, retrieve the top-k most similar chunks for a query. It works well for factual lookups — “what does this API return?” or “what’s the refund policy?” But it falls apart on questions that require synthesizing information across many documents. Try asking “what are the main themes in this dataset?” or “how are these companies connected?” — vector similarity doesn’t help because no single chunk contains the answer.
The GraphRAG Approach
This is exactly what GraphRAG (Edge et al., 2024) [1] addresses. The core idea: build a knowledge graph from your corpus, detect communities of related entities using the Leiden algorithm [3], generate summaries for each community, then use those summaries to answer global questions via map-reduce.
graph LR
DOC["Documents"] --> CHUNK["Chunk"]
CHUNK --> EXT["Extract Entities"]
EXT --> KG["Knowledge Graph"]
KG --> COM["Detect Communities"]
COM --> SUM["Summarize Communities"]
SUM --> QA["Map-Reduce QA"]
style DOC fill:#264653,stroke:#264653,color:#fff
style CHUNK fill:#e76f51,stroke:#e76f51,color:#fff
style EXT fill:#f4a261,stroke:#f4a261,color:#000
style KG fill:#e9c46a,stroke:#e9c46a,color:#000
style COM fill:#2a9d8f,stroke:#2a9d8f,color:#fff
style SUM fill:#40916c,stroke:#40916c,color:#fff
style QA fill:#2d6a4f,stroke:#2d6a4f,color:#fff
The results are compelling. On their benchmarks (~1M token datasets), GraphRAG outperformed vector RAG on comprehensiveness (72-83% win rate) and diversity (62-82% win rate) using LLM-as-judge evaluation. Vector RAG still won on directness — it gives more concise, pointed answers for specific questions. This makes sense: vector RAG is great at finding the needle, GraphRAG is great at describing the haystack.
| Metric | GraphRAG vs Vector RAG | What it means |
|---|---|---|
| Comprehensiveness | 72-83% win | Covers more aspects of the answer |
| Diversity | 62-82% win | Provides more varied perspectives |
| Directness | Vector RAG wins | More concise for specific questions |
| Empowerment | Mixed | Depends on whether quotes or summaries help more |
The Hidden Cost of Building the Knowledge Graph
Here’s what it actually takes to go from documents to a usable knowledge graph:
Step 1: Entity and Relationship Extraction. An LLM reads every chunk and extracts entities and their relationships. The Neo4j implementation [2] using LangChain’s LLMGraphTransformer with GPT-4o extracted ~13,000 entities and ~16,000 relationships from 2,000 news articles. Cost: ~$30, time: ~35 minutes with 10 parallel workers. And this is a small dataset.
# LangChain's LLMGraphTransformer — simplified
from langchain_experimental.graph_transformers import LLMGraphTransformer
transformer = LLMGraphTransformer(llm=ChatOpenAI(model="gpt-4o"))
graph_documents = transformer.convert_to_graph_documents(documents)
# Each document → nodes (entities) + relationships (edges)
Step 2: Entity Resolution. This is the step the original paper mentions but doesn’t ship code for — and it’s arguably the hardest. The same entity appears with different names: “Silicon Valley Bank”, “Silicon_Valley_Bank”, “SVB”. You need to deduplicate them. The Neo4j blog’s approach: compute text embeddings, build a KNN graph (cosine similarity > 0.95), find connected components, filter by edit distance, then LLM verification for final merge decisions. Even with all that, it still has failure modes for dates, abbreviations, and domain-specific terms.
Step 3: Community Detection. Run the Leiden algorithm to partition the graph into hierarchical communities — clusters of closely related entities. The paper’s podcast dataset produced communities ranging from 34 (coarsest) to 1,310 (finest). Not every level adds meaningful information — the Neo4j blog found levels 3 and 4 differed by only 4 communities.
Step 4: Community Summarization. An LLM generates natural language summaries for each community, bottom-up through the hierarchy. That’s potentially thousands of LLM calls. The paper’s indexing step took 281 minutes for ~1M tokens of source documents.
| Pipeline Step | What it does | Cost/Complexity |
|---|---|---|
| Entity Extraction | LLM reads every chunk | ~$30 for 2K articles (GPT-4o) |
| Entity Resolution | Deduplicate entities | Multi-step pipeline, domain-dependent |
| Community Detection | Cluster related entities | Needs graph DB (Neo4j [5] + GDS plugin) |
| Community Summarization | LLM summarizes each community | Potentially thousands of LLM calls |
| Total indexing time | End to end | ~281 min for ~1M tokens (paper) |
The Maintenance Problem — What Happens When Knowledge Changes
If you have a fixed, curated knowledge graph — like Wikidata or a domain-specific ontology that rarely changes — GraphRAG works beautifully. The graph is your ground truth, you run community detection once, generate summaries, and you’re done. Query-time is efficient: root-level community summaries use 9-43x fewer tokens than processing raw text.
But if you’re building the knowledge graph from your own documents — which is the whole point of GraphRAG for most use cases — every update is painful:
Adding new documents means re-extracting entities, but now you need to resolve them against the existing graph. Is “OpenAI” in the new document the same “OpenAI” already in the graph? Probably yes, but what about “GPT-5” vs “GPT 5” vs “the new GPT model”? Every new entity needs link prediction against the full graph.
Entity deduplication gets harder as the graph grows. With 13,000 entities, pairwise comparison is already expensive. At 100K+ entities, you need approximate methods (LSH, blocking strategies), each with its own failure modes.
Community structure shifts. Adding a few hundred nodes can completely reorganize communities, invalidating existing summaries. Do you re-run Leiden on the full graph? Only on affected subgraphs?
Summary staleness. Even if you detect which communities changed, regenerating summaries means more LLM calls. And if higher-level summaries depend on lower-level ones (they do — the paper uses bottom-up summarization), a change at the leaf level cascades through the entire hierarchy.
graph LR
NEW["New Documents"] --> EXT["Re-extract Entities"]
EXT --> RES["Resolve Against\nExisting Graph"]
RES --> LINK["Link Prediction"]
LINK --> DEDUP["Entity Dedup\n(full graph)"]
DEDUP --> RECOM["Re-detect\nCommunities"]
RECOM --> RESUM["Re-summarize\n(cascade)"]
style NEW fill:#e76f51,stroke:#e76f51,color:#fff
style EXT fill:#f4a261,stroke:#f4a261,color:#000
style RES fill:#e9c46a,stroke:#e9c46a,color:#000
style LINK fill:#e9c46a,stroke:#e9c46a,color:#000
style DEDUP fill:#f4a261,stroke:#f4a261,color:#000
style RECOM fill:#e76f51,stroke:#e76f51,color:#fff
style RESUM fill:#e76f51,stroke:#e76f51,color:#fff
This is the fundamental tension: GraphRAG converts unstructured data into structured data, and structured data is harder to update than unstructured data. With vector RAG, adding a new document is trivial — chunk it, embed it, append to the index. With GraphRAG, adding a new document means potentially restructuring your entire knowledge representation.
What About Knowledge Graphs for Source Code?
There’s another place where knowledge graphs have been proposed recently: source code. Projects like CodexGraph (Liu et al., 2024) [9] and GraphCoder [10] build knowledge graphs from codebases — extracting entities like functions, classes, and modules, with edges for calls, imports, inheritance, and type relationships — then use graph retrieval to give LLMs better repository-level context.
The idea sounds appealing: code is full of relationships, and understanding a function often means understanding what it calls, what calls it, and what types it uses. A knowledge graph could capture all of that.
But here’s my issue: code is already structured data. Unlike natural language documents, source code has a formal grammar. We already have tools that parse it perfectly:
| Tool | What it provides | Maintenance cost |
|---|---|---|
| AST (tree-sitter) | Complete syntactic structure of every file | Zero — deterministic parse |
| LSP | Go-to-definition, find-references, call hierarchy, type info | Zero — runs on-demand from source |
| Package managers | Dependency graphs (pip, npm, cargo) | Zero — reads lockfiles |
| CodeQL / Semgrep | Data flow, taint tracking, control flow graphs | Near-zero — static analysis |
These tools give you the exact same entities and relations that a code knowledge graph would extract — functions, classes, call graphs, import chains, type hierarchies — but with perfect accuracy, zero LLM cost, and no maintenance burden. An AST is a lossless representation of code structure. An LLM-extracted knowledge graph is a lossy, probabilistic approximation of the same thing.
Claude Code’s LSP integration [11] is a good example: it can jump to definitions, find all references, and traverse call hierarchies in real time, directly from the source. No graph database, no entity extraction pipeline, no community detection. Just the language server doing what it was designed to do.
graph LR
CODE["Source Code"] --> AST["AST\n(tree-sitter)"]
CODE --> LSP["LSP\n(definitions, refs)"]
CODE --> DEP["Dependency\nGraph"]
CODE --> SA["Static Analysis\n(CodeQL)"]
KG_APPROACH["LLM-Extracted\nCode KG"]
AST -->|"lossless, free"| SAME["Same Relations"]
LSP -->|"real-time, free"| SAME
DEP -->|"exact, free"| SAME
SA -->|"precise, free"| SAME
KG_APPROACH -->|"lossy, $$$"| SAME
style CODE fill:#264653,stroke:#264653,color:#fff
style AST fill:#2a9d8f,stroke:#2a9d8f,color:#fff
style LSP fill:#2a9d8f,stroke:#2a9d8f,color:#fff
style DEP fill:#2a9d8f,stroke:#2a9d8f,color:#fff
style SA fill:#2a9d8f,stroke:#2a9d8f,color:#fff
style KG_APPROACH fill:#e76f51,stroke:#e76f51,color:#fff
style SAME fill:#e9c46a,stroke:#e9c46a,color:#000
And the maintenance problem is even worse for code than for documents. Codebases change constantly — every commit modifies functions, adds files, changes call patterns. If maintaining a knowledge graph for a slowly-changing document corpus is already painful, imagine maintaining one for a codebase with dozens of commits per day. Every refactor, every rename, every new dependency would require re-extraction and re-resolution.
The one argument for code KGs is cross-repository or cross-language reasoning — “which services depend on this shared library?” or “how does the Python backend connect to the TypeScript frontend?” LSP doesn’t cross language boundaries, and package managers don’t trace internal function calls across repos. But even here, tools like Sourcegraph [12] solve this with SCIP-based code intelligence [13] — deterministic, not probabilistic.
My take: knowledge graphs make sense when you’re dealing with unstructured data that has no inherent structure. Documents, research papers, news articles — these genuinely benefit from having structure imposed on them. But code already has structure. Building a knowledge graph on top of code is building a lossy approximation of something you can already access losslessly. It’s solving a problem that’s already solved.
When to Use GraphRAG
| Scenario | Recommendation | Why |
|---|---|---|
| Static knowledge base, global queries | GraphRAG | Upfront cost amortized, global queries excel |
| Rapidly changing documents | Vector RAG | Update cost too high for GraphRAG |
| Specific factual lookups | Vector RAG | No need for global synthesis |
| Existing curated KG (Wikidata, domain ontology) | KG-augmented RAG | Skip the construction step entirely |
| Mixed: some global, some specific | Hybrid | Vector for specific, graph for thematic |
The honest answer for most teams: if you already have a knowledge graph, absolutely use it for retrieval. If you need to build one from scratch for GraphRAG, think very carefully about whether the maintenance cost is worth the improvement over vector RAG. The benchmarks are real — GraphRAG genuinely outperforms on global questions. But benchmarks run on static datasets. Production systems don’t stay static.
Tools like LangChain’s LLMGraphTransformer [4], Neo4j [5], and Microsoft’s GraphRAG library [6] have made the initial construction more accessible. But accessible construction doesn’t mean accessible maintenance.
My take: the future of GraphRAG is in incremental graph updates — methods that can add new knowledge without restructuring the entire graph. Some work is happening here (incremental community detection [7], streaming knowledge graph construction [8]), but it’s still early. Until incremental updates are solved, GraphRAG is best suited for corpora that change infrequently and are queried frequently with global, thematic questions.
What’s your experience with knowledge graphs in production? Are you building them from scratch or leveraging existing ones?
References:
[1] Edge et al. “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” arXiv 2024.
[2] Bratanic, T. “Implementing ‘From Local to Global’ GraphRAG with Neo4j and LangChain.” Neo4j Blog 2024.
[3] Traag, V. A. et al. “From Louvain to Leiden: guaranteeing well-connected communities.” Scientific Reports 2019.
[4] “How to construct knowledge graphs.” LangChain Documentation.
[5] “Neo4j Graph Database.” Neo4j.
[6] “GraphRAG.” Microsoft GitHub.
[7] Banerjee, P. et al. “Incremental Community Detection in Distributed Dynamic Graph.” arXiv 2023.
[8] Chuang, Y. et al. “Streaming Knowledge Graph Construction.” arXiv 2023.
[9] Liu et al. “CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases.” arXiv 2024.
[10] Liu et al. “GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model.” arXiv 2024.
[11] “Claude Code Overview.” Anthropic.
[12] “Sourcegraph — Code Intelligence Platform.” Sourcegraph.
[13] “Announcing SCIP — a better code indexing format.” Sourcegraph Blog.