Problem: Pandas is Slowing Down Your Data Pipeline

Your data processing runs fine on small datasets but takes 10+ minutes on production data. Polars is 5-50x faster, but manually rewriting hundreds of lines of Pandas code feels overwhelming.

You'll learn:

How to auto-convert Pandas code to Polars using Claude API
Which patterns translate directly and which need manual fixes
How to verify your conversion works correctly

Time: 20 min | Level: Intermediate

Why This Works

Polars uses lazy evaluation and parallel processing written in Rust, making it significantly faster than Pandas for most operations. The API is similar enough that 80% of Pandas code can be mechanically translated, but different enough that doing it manually is tedious and error-prone.

Common symptoms of needing this:

Processing 1M+ rows takes minutes instead of seconds
Memory usage spikes during groupby or merge operations
Need to process data in parallel across CPU cores

Solution

Step 1: Install Dependencies

pip install polars anthropic --break-system-packages

Expected: Both packages install without errors. Polars 0.20+ required.

If it fails:

Error: "externally managed environment": Add --break-system-packages flag
Polars won't install: Ensure Python 3.8+ with python --version

Step 2: Get Your Anthropic API Key

Visit console.anthropic.com and create an API key. Export it:

export ANTHROPIC_API_KEY='your-key-here'

Why this works: The AI script uses Claude to understand context and handle edge cases that pattern matching can't catch.

Step 3: Create the AI Conversion Script

Save this as pandas_to_polars.py:

import anthropic
import os
import sys

def convert_pandas_to_polars(pandas_code: str) -> str:
    """Convert Pandas code to Polars using Claude API."""
    
    client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    
    # System prompt defines conversion rules
    system_prompt = """You are a Python expert converting Pandas code to Polars.

Rules:
1. Use polars.scan_csv() for lazy loading (not read_csv)
2. Replace df.groupby().agg() with df.group_by().agg()
3. Use pl.col() for column references in expressions
4. Chain operations with .pipe() when needed
5. Replace inplace=True operations with reassignment
6. Use .collect() at the end to execute lazy operations
7. Keep imports minimal: import polars as pl

Preserve:
- Variable names
- Comments
- Logic flow
- Error handling

Return ONLY the converted Python code, no explanations."""

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4000,
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": f"Convert this Pandas code to Polars:\n\n```python\n{pandas_code}\n```"
        }]
    )
    
    # Extract code from response
    response_text = message.content[0].text
    
    # Remove markdown code fences if present
    if "```python" in response_text:
        code = response_text.split("```python")[1].split("```")[0].strip()
    elif "```" in response_text:
        code = response_text.split("```")[1].split("```")[0].strip()
    else:
        code = response_text.strip()
    
    return code


def main():
    if len(sys.argv) != 2:
        print("Usage: python pandas_to_polars.py <input_file.py>")
        sys.exit(1)
    
    input_file = sys.argv[1]
    output_file = input_file.replace('.py', '_polars.py')
    
    # Read original code
    with open(input_file, 'r') as f:
        pandas_code = f.read()
    
    print(f"Converting {input_file}...")
    
    # Convert using AI
    polars_code = convert_pandas_to_polars(pandas_code)
    
    # Write converted code
    with open(output_file, 'w') as f:
        f.write(polars_code)
    
    print(f"✓ Converted code saved to {output_file}")
    print(f"✓ Review the output and test before using in production")


if __name__ == "__main__":
    main()

Why this approach works: Claude understands context like variable usage across functions, which regex-based tools miss. The system prompt encodes Polars best practices.

Step 4: Test with Sample Code

Create a test file sample_pandas.py:

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('sales_data.csv')

# Clean data
df['revenue'] = df['quantity'] * df['price']
df = df[df['revenue'] > 0]

# Aggregate by category
summary = df.groupby('category').agg({
    'revenue': ['sum', 'mean'],
    'quantity': 'sum'
}).reset_index()

# Add percentage
summary['pct_of_total'] = summary[('revenue', 'sum')] / summary[('revenue', 'sum')].sum()

# Save results
summary.to_csv('summary.csv', index=False)
print(f"Processed {len(df)} rows")

Run the converter:

python pandas_to_polars.py sample_pandas.py

Expected: Creates sample_pandas_polars.py with converted code:

import polars as pl

# Load data lazily
df = pl.scan_csv('sales_data.csv')

# Clean data
df = df.with_columns([
    (pl.col('quantity') * pl.col('price')).alias('revenue')
]).filter(pl.col('revenue') > 0)

# Aggregate by category
summary = df.group_by('category').agg([
    pl.col('revenue').sum().alias('revenue_sum'),
    pl.col('revenue').mean().alias('revenue_mean'),
    pl.col('quantity').sum().alias('quantity_sum')
]).collect()

# Add percentage
total_revenue = summary['revenue_sum'].sum()
summary = summary.with_columns([
    (pl.col('revenue_sum') / total_revenue).alias('pct_of_total')
])

# Save results
summary.write_csv('summary.csv')
print(f"Processed {len(df.collect())} rows")

Step 5: Handle Common Edge Cases

The AI handles most cases, but verify these manually:

Date parsing:

# Pandas
df['date'] = pd.to_datetime(df['date'])

# Polars (check conversion)
df = df.with_columns(pl.col('date').str.strptime(pl.Date, '%Y-%m-%d'))

Chained indexing:

# Pandas
df.loc[df['age'] > 18, 'category'] = 'adult'

# Polars (AI converts to when-then)
df = df.with_columns(
    pl.when(pl.col('age') > 18)
      .then(pl.lit('adult'))
      .otherwise(pl.col('category'))
      .alias('category')
)

Inplace operations:

# Pandas
df.drop_duplicates(inplace=True)

# Polars (no inplace, always reassign)
df = df.unique()

If conversion looks wrong:

Missing .collect(): Add it before operations that need results
Wrong column syntax: Replace df['col'] with pl.col('col') in expressions
Type errors: Check if dates/categoricals need explicit casting

Verification

Create a test script verify_conversion.py:

import pandas as pd
import polars as pl
import numpy as np

# Generate test data
np.random.seed(42)
data = {
    'category': np.random.choice(['A', 'B', 'C'], 1000),
    'quantity': np.random.randint(1, 100, 1000),
    'price': np.random.uniform(10, 100, 1000)
}

# Test Pandas version
df_pd = pd.DataFrame(data)
df_pd['revenue'] = df_pd['quantity'] * df_pd['price']
result_pd = df_pd.groupby('category')['revenue'].sum().sort_index()

# Test Polars version
df_pl = pl.DataFrame(data)
df_pl = df_pl.with_columns([
    (pl.col('quantity') * pl.col('price')).alias('revenue')
])
result_pl = df_pl.group_by('category').agg(
    pl.col('revenue').sum()
).sort('category')

# Compare results
print("Pandas result:")
print(result_pd)
print("\nPolars result:")
print(result_pl)

# Check if values match (within floating point precision)
assert np.allclose(result_pd.values, result_pl['revenue'].to_numpy())
print("\n✓ Results match!")

Run it:

python verify_conversion.py

You should see: Both outputs match, confirming the conversion is correct.

Performance Comparison

Benchmark your converted code:

import time
import polars as pl
import pandas as pd

# Create test dataset
df_pd = pd.DataFrame({
    'id': range(1_000_000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 1_000_000),
    'value': np.random.randn(1_000_000)
})

# Pandas timing
start = time.time()
result_pd = df_pd.groupby('category').agg({'value': ['mean', 'std', 'count']})
pandas_time = time.time() - start

# Polars timing
df_pl = pl.DataFrame(df_pd)
start = time.time()
result_pl = df_pl.group_by('category').agg([
    pl.col('value').mean(),
    pl.col('value').std(),
    pl.col('value').count()
])
polars_time = time.time() - start

print(f"Pandas: {pandas_time:.3f}s")
print(f"Polars: {polars_time:.3f}s")
print(f"Speedup: {pandas_time/polars_time:.1f}x")

Expected: Polars runs 5-20x faster on this operation.

What You Learned

AI-powered conversion handles 80% of Pandas→Polars translation automatically
Polars uses lazy evaluation via scan_csv() and .collect()
Main differences: group_by() vs groupby(), pl.col() for expressions, no inplace ops

Limitations:

Custom Pandas functions may need manual rewrite
Some advanced indexing patterns don't translate directly
Always verify results match before switching in production

When NOT to use Polars:

Small datasets (<10K rows) where Pandas is fast enough
Heavy use of Pandas-specific features like MultiIndex
Team unfamiliar with Polars and no time to learn

Advanced: Batch Convert Entire Projects

For multiple files, extend the script:

import os
from pathlib import Path

def convert_project(src_dir: str, out_dir: str):
    """Convert all Python files in a directory."""
    
    Path(out_dir).mkdir(exist_ok=True)
    
    for root, dirs, files in os.walk(src_dir):
        for file in files:
            if file.endswith('.py'):
                input_path = os.path.join(root, file)
                rel_path = os.path.relpath(input_path, src_dir)
                output_path = os.path.join(out_dir, rel_path)
                
                # Ensure output directory exists
                os.makedirs(os.path.dirname(output_path), exist_ok=True)
                
                with open(input_path, 'r') as f:
                    pandas_code = f.read()
                
                # Skip files without pandas imports
                if 'import pandas' not in pandas_code:
                    continue
                
                print(f"Converting {rel_path}...")
                polars_code = convert_pandas_to_polars(pandas_code)
                
                with open(output_path, 'w') as f:
                    f.write(polars_code)

# Usage
convert_project('src/', 'src_polars/')

Warning: Always review AI-generated code before running in production. Test thoroughly.

Troubleshooting

API key errors:

# Check if key is set
echo $ANTHROPIC_API_KEY

# If empty, re-export
export ANTHROPIC_API_KEY='sk-ant-...'

Import errors after conversion:

Add import polars as pl at the top if missing
Remove unused import pandas as pd

Performance worse than Pandas:

Ensure you're using scan_csv() not read_csv()
Add .collect() only when you need the actual results
Use .lazy() to convert eager DataFrames to lazy

Type mismatches:

# Explicitly cast when needed
df = df.with_columns([
    pl.col('date').cast(pl.Date),
    pl.col('amount').cast(pl.Float64)
])

Tested on Polars 0.20.7, Python 3.11, Anthropic API 2026-02-01, macOS & Ubuntu