All Posts

Browse all articles and tutorials on MarkAI Code. Discover the latest insights on AI, machine learning, programming, and technology trends.

50 articles Page 1 of 164

Build an LLM Fallback Chain: Multi-Provider Reliability Pattern 2026

Build a production-ready LLM fallback chain across OpenAI, Anthropic, and Groq using Python 3.12. Prevent outages with automatic provider switching and retry logic.

Mar 14, 2026 LLM 8 min read

Build LLM Rate Limiting: Protect Your API from Abuse 2026

Implement LLM rate limiting with token bucket, sliding window, and Redis-backed strategies. Protect OpenAI, Anthropic, and self-hosted APIs from abuse. Python 3.12 + FastAPI.

Mar 14, 2026 Backend 8 min read

Cache LLM Responses with Redis: Cut API Costs 60% 2026

LLM response caching with Redis using exact-match and semantic strategies. Reduce OpenAI and Anthropic API costs by up to 60% with Python 3.12 and Redis 7.2.

Mar 14, 2026 LLM 10 min read

Count LLM Tokens with Tiktoken: Model-Specific Limits 2026

Use Tiktoken to count tokens for GPT-4o, Claude, and Gemini models. Avoid context limit errors with exact per-model limits and Python 3.12 examples. 155 chars

Mar 14, 2026 LLM 7 min read

Track LLM Costs: Per-Request Budget Alerts in Python 2026

Add LLM cost tracking and per-request budget alerts to any Python app. Works with OpenAI, Anthropic, and Gemini. Tested on Python 3.12 + uv.

Mar 13, 2026 Python 8 min read

Benchmark Cohere Command R+: Enterprise RAG Performance 2026

Cohere Command R+ enterprise RAG benchmark 2026: latency, retrieval accuracy, and cost vs GPT-4o and Gemini 1.5 Pro. Tested on Python 3.12 + Docker.

Mar 12, 2026 RAG 8 min read

Build a Local RAG Pipeline with Ollama and LangChain 2026

Build a fully local RAG pipeline using Ollama and LangChain in Python. Ingest PDFs, embed with nomic-embed-text, retrieve with FAISS, and query with Llama 3. No API keys.

Mar 12, 2026 RAG 7 min read

Build Agentic RAG: Self-Querying and Adaptive Retrieval 2026

Build agentic RAG with self-querying and adaptive retrieval in Python 3.12 + LangChain. Covers SelfQueryRetriever, query rewriting, and multi-step retrieval loops.

Mar 12, 2026 RAG 9 min read

Build Apps with LM Studio REST API and Local LLMs 2026

Use the LM Studio REST API to build Python and Node.js apps with local LLMs. OpenAI-compatible endpoints, streaming, and zero cloud costs. Tested on LM Studio 0.3.

Mar 12, 2026 Local LLM 7 min read

Build BGE Reranker: Cross-Encoder Reranking for Better RAG 2026

Add BGE Reranker cross-encoder reranking to your RAG pipeline in Python 3.12. Cut hallucinations and boost retrieval precision with FlagEmbedding + LangChain.

Mar 12, 2026 RAG 9 min read

Build ColBERT RAG Pipeline: Late Interaction Retrieval with PLAID 2026

Implement ColBERT and PLAID late interaction retrieval for RAG in Python 3.12. Get 30–50% better recall than dense bi-encoders with sub-100ms latency on CPU.

Mar 12, 2026 RAG 9 min read

Build Contextual Retrieval RAG: Anthropic's Technique Explained 2026

Implement Anthropic's contextual retrieval technique to cut RAG retrieval failures by 49%. Python 3.12 + LangChain + Claude API. Step-by-step with code.

Mar 12, 2026 RAG 10 min read

Build Faster Apps with OpenAI Prompt Caching: How It Works 2026

OpenAI prompt caching cuts latency by up to 80% and cost by 50% on repeated context. Learn how cache hits work, when to use it, and how to structure prompts. Python + Node.js tested.

Mar 12, 2026 LLM 7 min read

Build GraphRAG: Knowledge Graph Enhanced Retrieval Guide 2026

Build a GraphRAG pipeline with Neo4j, LangChain, and Python 3.12. Boost retrieval accuracy with knowledge graphs over standard vector RAG. Tested on AWS us-east-1.

Mar 12, 2026 RAG 9 min read

Build Groq Compound AI: Mixture-of-Agents Inference 2026

Run Groq Compound AI with Mixture-of-Agents inference in Python. Layer proposer and aggregator LLMs on Groq's LPU for sub-second ensemble responses. Tested on Python 3.12.

Mar 12, 2026 LLM 9 min read

Build LlamaIndex Property Graph: Knowledge Graph RAG 2026

Build a LlamaIndex property graph for Knowledge Graph RAG using Python 3.12 and Neo4j. Extract entities, query graph context, and outperform vector-only RAG.

Mar 12, 2026 RAG 9 min read

Build LlamaIndex Workflows: Complex Agentic RAG Patterns 2026

Build complex agentic RAG pipelines with LlamaIndex Workflows — multi-step retrieval, tool-calling agents, and state machines. Python 3.12 + LlamaIndex 0.11.

Mar 12, 2026 RAG 11 min read

Build Multimodal RAG with Images: Python Retrieval Tutorial 2026

Build a multimodal RAG pipeline that retrieves and reasons over images and text using LangChain, ChromaDB, and GPT-4o. Tested on Python 3.12 + Docker.

Mar 12, 2026 RAG 9 min read

Build Prompt Caching Patterns: System Prompts and Few-Shot Examples 2026

Implement prompt caching for system prompts and few-shot examples with Anthropic Claude API. Cut latency 80% and token costs 90% on Python 3.12. Tested on real workloads.

Mar 12, 2026 LLM 9 min read

Build RAG Guardrails: Prevent Hallucination with Validation 2026

Add RAG guardrails to stop hallucinations using retrieval validation, confidence scoring, and answer grounding checks. Python 3.12 + LangChain + pgvector.

Mar 12, 2026 RAG 8 min read

Build RAG Reranking with Cohere and FlashRank for Better Retrieval 2026

Add reranking to your RAG pipeline using Cohere Rerank API and FlashRank local model. Boost retrieval precision with Python 3.12, LangChain, and FAISS.

Mar 12, 2026 RAG 7 min read

Build RAG with Tables: Extract Data from PDFs and Excel 2026

RAG with tables from PDFs and Excel using Python, LangChain, and Chroma. Parse structured data with camelot, openpyxl, and embed table chunks for accurate retrieval.

Mar 12, 2026 RAG 8 min read

Build with Groq API: Fastest LLM Inference in Python 2026

Set up the Groq API in Python, run Llama 3.3 70B at 750+ tokens/sec, and benchmark inference speed against OpenAI. Tested on Python 3.12 + Node 22.

Mar 12, 2026 LLM 6 min read

Build with Solidity 0.8.26 Transient Storage: Complete Guide 2026

Master Solidity 0.8.26 transient storage with TSTORE and TLOAD opcodes. Save up to 80% gas on reentrancy locks and flash loans. Tested on Foundry and Hardhat.

Mar 12, 2026 RAG 9 min read

Call Ollama REST API With Python Requests: No SDK 2026

Use Python requests to call Ollama's REST API directly — no SDK needed. Chat, generate, stream, and embed with full control. Tested on Python 3.12 + Ollama 0.5.

Mar 12, 2026 Ollama 7 min read

Compile llama.cpp: CPU, CUDA, and Metal Backends 2026

Build llama.cpp from source for CPU, NVIDIA CUDA, and Apple Metal backends. Step-by-step compilation on Ubuntu 24, Windows 11, and macOS with M-series chips.

Mar 12, 2026 Local LLM 7 min read

Configure LM Studio GPU Layers: Optimize VRAM Usage 2026

Set GPU layers in LM Studio to maximize VRAM usage and inference speed. Includes per-model calculations for 8GB, 16GB, and 24GB cards. Tested on RTX 4070 and M2 Max.

Mar 12, 2026 Local LLM 6 min read

Configure Ollama Concurrent Requests: Parallel Inference Setup 2026

Set up Ollama concurrent requests and parallel inference with OLLAMA_NUM_PARALLEL, OLLAMA_MAX_QUEUE, and GPU config. Tested on RTX 4090 and M3 Max, Docker + Linux.

Mar 12, 2026 Ollama 7 min read

Configure Ollama Keep-Alive: Memory Management for Always-On Models 2026

Configure Ollama keep_alive to control GPU/RAM usage, prevent model eviction, and run always-on LLMs. Tested on Ollama 0.6, CUDA 12, macOS & Ubuntu.

Mar 12, 2026 Ollama 7 min read

Cut Anthropic API Costs 90% with Prompt Caching 2026

Use Anthropic prompt caching to slash Claude API costs by up to 90% and cut latency 85%. Step-by-step Python and TypeScript setup for claude-sonnet-4-20250514.

Mar 12, 2026 LLM 10 min read

Cut Gemini API Costs with Context Caching for Long Documents 2026

Use Gemini context caching to slash API costs on long documents. Save up to 75% on repeated prompts with Python 3.12 and google-genai SDK. Tested on 1M token docs.

Mar 12, 2026 LLM 8 min read

Deploy ML Models with BentoML 1.4: Serving Simplified 2026

Serve ML models in production with BentoML 1.4. Build REST APIs, batch runners, and Docker containers from any framework. Tested on Python 3.12 + CUDA 12.

Mar 12, 2026 LLM 7 min read

Deploy ML Workloads on Modal Serverless GPU Compute 2026

Run Modal serverless GPU compute for ML workloads: training, inference, and batch jobs on A100s with Python 3.12. No cluster management. Starts at $0.000463/GPU-sec.

Mar 12, 2026 Local LLM 7 min read

Deploy Open-Source Models with Replicate API in Minutes 2026

Use the Replicate API to deploy open-source models like Llama 3, SDXL, and Whisper in minutes. Python + Node.js examples, pricing in USD, no GPU required.

Mar 12, 2026 LLM 7 min read

Deploy vLLM: Production LLM API with OpenAI Compatibility 2026

Deploy vLLM as a production-ready OpenAI-compatible LLM API on Docker with tensor parallelism, quantization, and auth. Tested on CUDA 12.4 + Python 3.12.

Mar 12, 2026 Local LLM 8 min read

Extend Ollama Context Length Beyond Default Limits 2026

Extend Ollama context length beyond the 2048-token default using num_ctx, Modelfiles, and API parameters. Tested on Llama 3.3, Qwen2.5, and Mistral with CUDA and Metal.

Mar 12, 2026 Ollama 9 min read

Filter RAG Search Results with Document Metadata Tags 2026

Add metadata filtering to your RAG pipeline to narrow vector search by document tags, date, or category. Tested with Python 3.12, LangChain 0.3, and Qdrant.

Mar 12, 2026 RAG 6 min read

Integrate Ollama REST API: Local LLMs in Any App 2026

Use the Ollama REST API to integrate local LLMs into Python, Node.js, or any HTTP client. Covers /api/generate, /api/chat, streaming, and embeddings. Tested on Ollama 0.6.

Mar 12, 2026 Ollama 8 min read

LM Studio GGUF vs GPTQ: Which Quantization Format? 2026

LM Studio GGUF vs GPTQ compared on speed, memory use, compatibility, and quality. Pick the right quantization format for your GPU or CPU setup.

Mar 12, 2026 Local LLM 8 min read

LM Studio vs Ollama: Developer Experience Comparison 2026

LM Studio vs Ollama compared on setup, API compatibility, model management, GPU support, and local dev workflow. Choose the right local LLM tool for your stack.

Mar 12, 2026 Local LLM 7 min read

Manage RAG Context Windows: Chunk Strategy Guide 2026

Master RAG context window management with proven chunk strategies. Fixed-size, semantic, and recursive chunking compared with Python code. Tested on LangChain + pgvector.

Mar 12, 2026 RAG 7 min read

Ollama Python Library: Complete API Reference 2026

Complete Ollama Python library API reference covering chat, generate, embeddings, streaming, and async. Tested on Python 3.12, Ollama 0.4, macOS and Ubuntu.

Mar 12, 2026 Ollama 10 min read

RAG Evaluation: RAGAS Metrics for Production Systems 2026

Evaluate RAG pipelines with RAGAS metrics—faithfulness, answer relevancy, context precision, and recall. Tested on Python 3.12, LangChain, and OpenAI.

Mar 12, 2026 RAG 8 min read

Run GPU Workloads on Modal Labs: Serverless Training and Inference 2026

Deploy serverless GPU training and inference on Modal Labs using Python 3.12. Run Llama, Stable Diffusion, and custom models on A100s with zero cold-start config.

Mar 12, 2026 LLM 7 min read

Run llama.cpp Server: OpenAI-Compatible API from GGUF Models 2026

Serve any GGUF model as an OpenAI-compatible REST API using llama.cpp server. Drop-in replacement for GPT-4o endpoints. Tested on Ubuntu 24 + CUDA 12.4.

Mar 12, 2026 Local LLM 8 min read

Run Mistral Pixtral: Multimodal Vision Model Guide 2026

Deploy Mistral Pixtral 12B for multimodal vision tasks with Python 3.12 and vLLM. Process images and text with state-of-the-art vision-language inference. Tested on CUDA 12.

Mar 12, 2026 LLM 6 min read

Run MLX Models in LM Studio: Apple Silicon Guide 2026

Load LM Studio MLX format models on Apple Silicon for fast local inference. Covers M1/M2/M3/M4, unified memory, model selection, and benchmark results.

Mar 12, 2026 Local LLM 7 min read

Run Ollama Vision Models: LLaVA and BakLLaVA Setup 2026

Run ollama vision models LLaVA and BakLLaVA locally for image analysis. Pull, prompt, and call the REST API with Python. Tested on 8GB VRAM + M2.

Mar 12, 2026 Ollama 7 min read

Run SGLang: Fast LLM Inference with Structured Generation 2026

Deploy SGLang for fast LLM inference and structured generation with JSON schema constraints. Tested on Python 3.12, CUDA 12, Docker — A100 and RTX 4090.

Mar 12, 2026 Local LLM 8 min read

Setup LM Studio Preset System Prompts: Custom Chat Templates 2026

Configure LM Studio preset system prompts and custom chat templates to control model behavior, persona, and output format. Tested on LM Studio 0.3 + Llama 3.

Mar 12, 2026 Local LLM 6 min read