Fix Data Leakage in Vector Databases with Role-Based Access Control

Stop unauthorized users from retrieving sensitive embeddings in vector DBs. Implement RBAC with metadata filtering in Pinecone, Weaviate, and pgvector.

Problem: Your Vector Search Returns Data Users Shouldn't See

You've built a semantic search or RAG pipeline. Users are querying it. And somewhere in those results, a document from another tenant — or a confidential HR record, or a private customer file — is leaking through.

You'll learn:

  • Why vector databases skip access control by default
  • How to enforce RBAC using metadata filtering
  • Implementation patterns for Pinecone, Weaviate, and pgvector

Time: 25 min | Level: Intermediate


Why This Happens

Vector databases are designed for similarity search, not access control. When you store an embedding, most systems don't attach ownership or permissions to it — that's your job.

The query flow looks like this: a user submits a prompt → it's embedded → the nearest vectors are returned. Without filters, "nearest" means nearest to anyone's data.

Common symptoms:

  • Semantic search returns documents from other users or tenants
  • RAG responses include context the current user has no business seeing
  • No errors — just wrong data silently surfacing

This gets worse in multi-tenant apps where all customers share the same vector index.

Diagram showing unfiltered vector search crossing tenant boundaries Without metadata filters, similarity search ignores ownership entirely


Solution

The fix is metadata filtering: tag every vector with ownership metadata at insert time, then apply a filter on every query. Never query without a filter.

Step 1: Tag Vectors at Insert Time

Whatever your vector store, attach user/tenant context as metadata. Don't store it in the document text — store it as structured metadata alongside the embedding.

# Pinecone example
import pinecone
from openai import OpenAI

client = OpenAI()
index = pinecone.Index("your-index")

def embed_and_store(doc_id: str, text: str, user_id: str, tenant_id: str, role: str):
    embedding = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    ).data[0].embedding

    index.upsert(vectors=[{
        "id": doc_id,
        "values": embedding,
        "metadata": {
            "user_id": user_id,
            "tenant_id": tenant_id,        # Which organization owns this
            "access_roles": [role],        # Which roles can see it (list for multi-role)
            "classification": "internal"   # Optional: sensitivity label
        }
    }])

Expected: Every vector in your index now has ownership metadata attached.

If it fails:

  • Pinecone metadata size limit: Keep metadata under 40KB per vector — don't store the full document, store just the access control fields.
  • Missing metadata on old vectors: Backfill with index.update(id=doc_id, set_metadata={...}).

Step 2: Filter on Every Query

Build a query helper that requires caller identity. Never expose a raw query method — callers should never be able to opt out of filtering.

from typing import Optional

def secure_query(
    query_text: str,
    user_id: str,
    tenant_id: str,
    user_roles: list[str],
    top_k: int = 5,
) -> list[dict]:
    embedding = client.embeddings.create(
        input=query_text,
        model="text-embedding-3-small"
    ).data[0].embedding

    # Build filter — user sees their own docs OR docs shared with their roles
    # Never let this be optional
    access_filter = {
        "$and": [
            {"tenant_id": {"$eq": tenant_id}},   # Always scope to tenant first
            {
                "$or": [
                    {"user_id": {"$eq": user_id}},
                    {"access_roles": {"$in": user_roles}}
                ]
            }
        ]
    }

    results = index.query(
        vector=embedding,
        top_k=top_k,
        filter=access_filter,   # This is the critical line
        include_metadata=True
    )

    return results.matches

Expected: Query results contain only vectors matching the caller's tenant and role.

If it fails:

  • Empty results for valid queries: Confirm the metadata keys match exactly — tenant_id vs tenantId will silently return nothing.
  • $in not supported: Some index tiers don't support all filter operators — check your plan's filter support matrix.

Step 3: Weaviate — Use Built-in RBAC (v1.28+)

Weaviate 1.28+ ships native RBAC. Use it instead of manual metadata filtering when possible.

import weaviate
from weaviate.classes.config import Configure, Permission, Role

client = weaviate.connect_to_local()

# Create a collection with tenant isolation
client.collections.create(
    name="Documents",
    multi_tenancy_config=Configure.multi_tenancy(enabled=True)
)

# Create roles
client.roles.create(
    role_name="analyst",
    permissions=[
        Permission.collections(collection="Documents", read=True),
    ]
)

# Assign role to user
client.users.assign_role(user_id="alice@company.com", role="analyst")
# Query scoped to tenant — Weaviate enforces access at the server
docs = client.collections.get("Documents")
tenant_docs = docs.with_tenant("tenant_acme")

results = tenant_docs.query.near_text(
    query="quarterly revenue",
    limit=5
)

Why this is better: RBAC at the database level means a bug in your application code can't accidentally bypass it.


Step 4: pgvector — Use Row-Level Security

If you're running pgvector in Postgres, use Row-Level Security (RLS). This enforces access at the database engine, not the application layer.

-- Enable RLS on your embeddings table
ALTER TABLE document_embeddings ENABLE ROW LEVEL SECURITY;

-- Policy: users only see rows matching their user_id
CREATE POLICY tenant_isolation ON document_embeddings
    USING (tenant_id = current_setting('app.current_tenant')::uuid);

-- Policy: role-based visibility
CREATE POLICY role_access ON document_embeddings
    USING (
        required_role = ANY(
            string_to_array(current_setting('app.current_roles'), ',')
        )
    );
# Python: set session variables before every query
import psycopg2

def get_secure_connection(user_id: str, tenant_id: str, roles: list[str]):
    conn = psycopg2.connect(DATABASE_URL)
    cur = conn.cursor()

    # Set session context — RLS policies read these
    cur.execute("SELECT set_config('app.current_tenant', %s, true)", (tenant_id,))
    cur.execute("SELECT set_config('app.current_roles', %s, true)", (",".join(roles),))

    return conn
-- Your vector similarity query — RLS applies automatically
SELECT id, content, 1 - (embedding <=> %s::vector) AS similarity
FROM document_embeddings
ORDER BY similarity DESC
LIMIT 5;

Expected: Postgres silently filters rows before your application ever sees them. There's no way for application code to accidentally bypass this.

pgvector RLS query flow diagram RLS policies intercept the query at the storage engine level


Verification

Test that your filters actually block cross-tenant access. This is worth automating.

import pytest

def test_tenant_isolation(index):
    # Store a doc owned by tenant A
    embed_and_store("doc-1", "confidential revenue data", "user-a", "tenant-a", "admin")

    # Query as tenant B — should return nothing
    results = secure_query(
        query_text="revenue data",
        user_id="user-b",
        tenant_id="tenant-b",
        user_roles=["analyst"]
    )

    assert len(results) == 0, "Cross-tenant data leaked!"

def test_role_access(index):
    embed_and_store("doc-2", "salary bands", "hr-user", "tenant-a", "hr")

    # Admin should see HR docs
    admin_results = secure_query("salary", "admin-user", "tenant-a", ["admin", "hr"])
    assert len(admin_results) > 0

    # Analyst should NOT see HR docs
    analyst_results = secure_query("salary", "analyst-user", "tenant-a", ["analyst"])
    assert len(analyst_results) == 0
pytest tests/test_vector_rbac.py -v

You should see: Both tests pass. If test_tenant_isolation fails, your filter isn't being applied.

Test output showing both RBAC tests passing Both isolation tests should pass before you ship this to production


What You Learned

  • Vector databases don't enforce access control — you have to build it in from day one
  • Metadata filters (Pinecone), native RBAC (Weaviate 1.28+), and RLS (pgvector/Postgres) are the three main approaches
  • Database-level enforcement (RLS, Weaviate RBAC) is more robust than application-layer filtering

Limitation: Metadata filtering doesn't prevent embedding inversion attacks — a determined attacker with API access can still probe the embedding space. For highly sensitive data, consider namespace-level isolation (separate indexes per tenant) rather than filter-based isolation.

When NOT to use this pattern: If your data has no access hierarchy (fully public index), adding filters adds latency with no benefit. Profile the filter overhead — on large indexes with complex $and/$or filters, it can add 10–40ms per query.


Tested on Pinecone serverless (2025-Q4), Weaviate 1.28.3, pgvector 0.7.x, Python 3.12