Blockchain
GraphRAG + Blockchain Provenance on AWS: Relationship-Aware and Tamper-Evident QA
A legal and compliance platform has good vector RAG recall but weak multi-hop reasoning. Teams need answerability across entities, obligations, jurisdictions, and time, plus evidence integrity.
GraphRAG + Blockchain Provenance on AWS: Relationship-Aware and Tamper-Evident QA
Scenario
A legal and compliance platform has good vector RAG recall but weak multi-hop reasoning. Teams need answerability across entities, obligations, jurisdictions, and time, plus evidence integrity.
Why combine GraphRAG and blockchain-style provenance
GraphRAG improves relationship reasoning. Provenance anchoring improves trust and auditability. Combined, they support both “what the model found” and “why to trust that source path.”
Architecture
graph TD
S3[Raw Docs in S3] --> ETL[Extraction Pipeline]
ETL --> KG[(Neptune/Neo4j Knowledge Graph)]
ETL --> VEC[(Vector Store)]
ETL --> PROOF[Hash + Signature Generator]
PROOF --> ROOT[Merkle Root + Ledger Anchor]
API[FastAPI Orchestrator] --> VEC
API --> KG
API --> VERIFY[Proof Verifier]
VERIFY --> ROOT
API --> LLM[Bedrock Model]
API --> AUDIT[(CloudWatch + DynamoDB Audit Trails)]
Trade-offs
- Higher retrieval quality for relationship-heavy queries.
- Increased ingestion complexity.
- Need robust schema governance for graph evolution.
Step-by-step tutorial
1) Provision graph + vector foundations
aws neptune create-db-cluster \
--db-cluster-identifier graphrag-cluster \
--engine neptune \
--db-subnet-group-name my-neptune-subnets \
--vpc-security-group-ids sg-0123456789abcdef0
aws opensearchserverless create-collection \
--name graphrag-vectors \
--type VECTORSEARCH
aws neptune create-db-cluster `
--db-cluster-identifier graphrag-cluster `
--engine neptune `
--db-subnet-group-name my-neptune-subnets `
--vpc-security-group-ids sg-0123456789abcdef0
aws opensearchserverless create-collection `
--name graphrag-vectors `
--type VECTORSEARCH
2) Extract entities and relations
import re
def extract_entities(text: str):
orgs = re.findall(r"\b[A-Z][A-Za-z0-9& ]+(?:LLC|Inc|Ltd|Corp|Bank)\b", text)
return sorted(set(o.strip() for o in orgs))
def build_relations(entities: list[str]):
relations = []
for i in range(len(entities)-1):
relations.append((entities[i], "RELATED_TO", entities[i+1]))
return relations
3) Load graph edges
from gremlin_python.driver import client
g = client.Client("wss://<neptune-endpoint>:8182/gremlin", "g")
def upsert_edge(src, rel, dst):
q = f"""
g.V().has('Entity','name','{src}').fold().coalesce(unfold(), addV('Entity').property('name','{src}')).as('a')
.V().has('Entity','name','{dst}').fold().coalesce(unfold(), addV('Entity').property('name','{dst}')).as('b')
.coalesce(__.select('a').outE('{rel}').where(inV().as('b')), __.addE('{rel}').from('a').to('b'))
"""
g.submit(q).all().result()
4) Query orchestration with graph expansion
from fastapi import FastAPI
app = FastAPI()
def vector_candidates(question: str):
return ["chunk_12", "chunk_45", "chunk_90"]
def graph_expand(seed_entities: list[str]):
return ["clause_17", "jurisdiction_uk", "obligation_renewal"]
@app.post("/ask")
def ask(payload: dict):
q = payload["question"]
chunks = vector_candidates(q)
related = graph_expand(["Acme LLC"])
return {"chunks": chunks, "graph_context": related}
5) Add provenance receipts
import hashlib
def receipt(doc_id: str, chunk_text: str, graph_path: str) -> dict:
digest = hashlib.sha256((doc_id + chunk_text + graph_path).encode("utf-8")).hexdigest()
return {"doc_id": doc_id, "graph_path": graph_path, "digest": digest}
6) Security and governance
- isolate graph write and read permissions
- enforce schema migration approvals
- sign ingestion artifacts
- maintain document lineage metadata
Monitoring and quality
Track:
- graph expansion depth and latency
- citation completeness
- grounded answer rate
- graph schema drift events
Cost optimization
- cap graph traversal depth for default path
- warm caches for frequent entities
- run heavy enrichment offline
Pricing reminder: verify current pricing for Neptune, OpenSearch Serverless, Bedrock, and S3.
Production checklist
- Graph schema and ontology reviewed by domain experts
- Ingestion is idempotent and replay-safe
- Retrieval and proof receipts logged for each answer
- Security tests include poisoning and relationship tampering
- Cost dashboard includes graph and vector components
References
- https://docs.aws.amazon.com/architecture-diagrams/latest/knowledge-graphs-and-graphrag-with-neo4j/knowledge-graphs-and-graphrag-with-neo4j.html
- https://docs.aws.amazon.com/prescriptive-guidance/latest/retrieval-augmented-generation-options/choosing-option.html
- https://docs.aws.amazon.com/prescriptive-guidance/latest/choosing-an-aws-vector-database-for-rag-use-cases/introduction.html
Source
platform/archive/articles/graphrag-blockchain-provenance-on-aws.md