How to Give Your AI Agent Memory Across Sessions

Implement persistent memory for AI agents using embeddings, vector stores, and summarization — so context survives beyond a single conversation.

Problem: Your AI Agent Forgets Everything

You build a great AI agent. User comes back the next day — the agent has no idea who they are, what they asked, or what was decided. Every session starts from zero.

You'll learn:

  • Why LLMs are stateless and what that means for agents
  • Three practical memory patterns: buffer, summary, and vector search
  • How to persist and retrieve memory across sessions using Python

Time: 25 min | Level: Intermediate


Why This Happens

LLMs don't store state. When a session ends, the context window is gone. Nothing carries over unless you explicitly save and reload it.

Most frameworks handle in-session memory fine — but cross-session memory is your problem to solve.

Common symptoms:

  • Agent re-asks for user's name, preferences, or prior decisions
  • No continuity in multi-day workflows
  • Users frustrated by repeating themselves

Solution

There are three memory patterns. Use the one that matches your use case.

PatternBest ForStorage
BufferShort sessions, exact recallDB / file
SummaryLong sessions, big pictureDB + LLM
Vector searchLarge history, semantic retrievalVector DB

Step 1: Set Up Storage

You need somewhere to persist memory between sessions. SQLite is fine to start.

import sqlite3
import json
from datetime import datetime

def get_db():
    conn = sqlite3.connect("agent_memory.db")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS memories (
            session_id TEXT,
            user_id    TEXT,
            role       TEXT,
            content    TEXT,
            timestamp  TEXT
        )
    """)
    conn.commit()
    return conn

def save_message(user_id: str, session_id: str, role: str, content: str):
    db = get_db()
    db.execute(
        "INSERT INTO memories VALUES (?, ?, ?, ?, ?)",
        (session_id, user_id, role, content, datetime.utcnow().isoformat())
    )
    db.commit()

def load_history(user_id: str, limit: int = 20) -> list[dict]:
    db = get_db()
    rows = db.execute(
        "SELECT role, content FROM memories WHERE user_id = ? ORDER BY timestamp DESC LIMIT ?",
        (user_id, limit)
    ).fetchall()
    # Reverse so oldest messages come first
    return [{"role": r[0], "content": r[1]} for r in reversed(rows)]

Expected: A agent_memory.db file created on first run.


Step 2: Buffer Memory (Simple Recall)

Reload the last N messages and prepend them to every new request.

import anthropic

client = anthropic.Anthropic()

def chat_with_memory(user_id: str, session_id: str, user_input: str) -> str:
    # Load prior history for this user
    history = load_history(user_id, limit=20)

    # Append the new user message
    history.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant. Use the conversation history to maintain context.",
        messages=history
    )

    reply = response.content[0].text

    # Persist both sides of the exchange
    save_message(user_id, session_id, "user", user_input)
    save_message(user_id, session_id, "assistant", reply)

    return reply

Why this works: The model sees prior exchanges just like a continuing conversation, even though they came from a different session.

If it fails:

  • Context too long error: Reduce limit or switch to summary memory below
  • Wrong user history loading: Double-check you're filtering by user_id, not just session_id

Step 3: Summary Memory (Long-Running Agents)

Buffer memory hits token limits fast. Summary memory compresses old context into a paragraph and prepends that instead.

def summarize_history(history: list[dict]) -> str:
    if not history:
        return ""

    formatted = "\n".join(f"{m['role'].upper()}: {m['content']}" for m in history)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": (
                f"Summarize this conversation history in 3-5 sentences. "
                f"Focus on decisions made, user preferences, and open tasks.\n\n{formatted}"
            )
        }]
    )
    return response.content[0].text


def chat_with_summary_memory(user_id: str, session_id: str, user_input: str) -> str:
    history = load_history(user_id, limit=50)
    summary = summarize_history(history)

    messages = []
    if summary:
        # Inject summary as context before the live message
        messages.append({
            "role": "user",
            "content": f"[Context from prior sessions]: {summary}"
        })
        messages.append({
            "role": "assistant",
            "content": "Understood. I'll keep that context in mind."
        })

    messages.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="You are a helpful assistant with memory of prior interactions.",
        messages=messages
    )

    reply = response.content[0].text
    save_message(user_id, session_id, "user", user_input)
    save_message(user_id, session_id, "assistant", reply)

    return reply

Why this works: Summaries are 10-20x smaller than raw history, so you can cover weeks of conversation in a few hundred tokens.


Step 4: Vector Search Memory (Large or Multi-Topic History)

For agents with extensive history, use embeddings to retrieve only the relevant past context — not all of it.

pip install chromadb anthropic
import chromadb
from chromadb.utils import embedding_functions

# Use a local embedding model or swap for OpenAI/Cohere
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

chroma = chromadb.PersistentClient(path="./chroma_memory")
collection = chroma.get_or_create_collection("agent_memory", embedding_function=ef)


def store_memory(user_id: str, text: str, metadata: dict = {}):
    doc_id = f"{user_id}_{datetime.utcnow().timestamp()}"
    collection.add(
        documents=[text],
        metadatas=[{"user_id": user_id, **metadata}],
        ids=[doc_id]
    )


def retrieve_relevant_memory(user_id: str, query: str, top_k: int = 5) -> list[str]:
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        where={"user_id": user_id}
    )
    return results["documents"][0] if results["documents"] else []


def chat_with_vector_memory(user_id: str, session_id: str, user_input: str) -> str:
    # Fetch semantically relevant past exchanges
    relevant = retrieve_relevant_memory(user_id, user_input)
    context = "\n".join(relevant)

    system_prompt = "You are a helpful assistant."
    if context:
        system_prompt += f"\n\nRelevant past context:\n{context}"

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_input}]
    )

    reply = response.content[0].text

    # Store this exchange for future retrieval
    store_memory(user_id, f"User: {user_input}\nAssistant: {reply}", {"session": session_id})

    return reply

Why this works: Instead of dumping all history into the prompt, you embed the user's current question and pull only the most semantically similar past exchanges. Fast and token-efficient.

If it fails:

  • ChromaDB import error: Run pip install chromadb --upgrade
  • Empty results on first run: Expected — the store is empty until previous exchanges have been saved

Verification

Run a quick two-session test:

# Session 1
reply1 = chat_with_memory("user_42", "session_001", "My name is Dana and I prefer dark mode.")
print(reply1)

# Simulate a new session
reply2 = chat_with_memory("user_42", "session_002", "What UI preference did I mention?")
print(reply2)
# Expected: Agent recalls "dark mode" preference from session_001

You should see: The agent correctly references Dana's dark mode preference in the second session without being told again.


What You Learned

  • LLMs are stateless — persistence is always your responsibility
  • Buffer memory is the simplest approach but hits token limits quickly
  • Summary memory compresses history, trading exact recall for scale
  • Vector search retrieves relevant context without loading everything — best for production agents with long-running users

Limitation: Summary memory loses specific details. If exact recall matters (e.g., a user's account number), use buffer or store structured data separately alongside your memory layer.

When NOT to use this: Single-turn tools or stateless APIs where continuity doesn't add value — the overhead isn't worth it.


Tested on Python 3.12, anthropic SDK 0.40+, ChromaDB 0.5+, Ubuntu 24 & macOS Sequoia