Problem: Pandas is Slowing Down Your Data Pipeline
Your data processing runs fine on small datasets but takes 10+ minutes on production data. Polars is 5-50x faster, but manually rewriting hundreds of lines of Pandas code feels overwhelming.
You'll learn:
- How to auto-convert Pandas code to Polars using Claude API
- Which patterns translate directly and which need manual fixes
- How to verify your conversion works correctly
Time: 20 min | Level: Intermediate
Why This Works
Polars uses lazy evaluation and parallel processing written in Rust, making it significantly faster than Pandas for most operations. The API is similar enough that 80% of Pandas code can be mechanically translated, but different enough that doing it manually is tedious and error-prone.
Common symptoms of needing this:
- Processing 1M+ rows takes minutes instead of seconds
- Memory usage spikes during groupby or merge operations
- Need to process data in parallel across CPU cores
Solution
Step 1: Install Dependencies
pip install polars anthropic --break-system-packages
Expected: Both packages install without errors. Polars 0.20+ required.
If it fails:
- Error: "externally managed environment": Add
--break-system-packagesflag - Polars won't install: Ensure Python 3.8+ with
python --version
Step 2: Get Your Anthropic API Key
Visit console.anthropic.com and create an API key. Export it:
export ANTHROPIC_API_KEY='your-key-here'
Why this works: The AI script uses Claude to understand context and handle edge cases that pattern matching can't catch.
Step 3: Create the AI Conversion Script
Save this as pandas_to_polars.py:
import anthropic
import os
import sys
def convert_pandas_to_polars(pandas_code: str) -> str:
"""Convert Pandas code to Polars using Claude API."""
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
# System prompt defines conversion rules
system_prompt = """You are a Python expert converting Pandas code to Polars.
Rules:
1. Use polars.scan_csv() for lazy loading (not read_csv)
2. Replace df.groupby().agg() with df.group_by().agg()
3. Use pl.col() for column references in expressions
4. Chain operations with .pipe() when needed
5. Replace inplace=True operations with reassignment
6. Use .collect() at the end to execute lazy operations
7. Keep imports minimal: import polars as pl
Preserve:
- Variable names
- Comments
- Logic flow
- Error handling
Return ONLY the converted Python code, no explanations."""
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4000,
system=system_prompt,
messages=[{
"role": "user",
"content": f"Convert this Pandas code to Polars:\n\n```python\n{pandas_code}\n```"
}]
)
# Extract code from response
response_text = message.content[0].text
# Remove markdown code fences if present
if "```python" in response_text:
code = response_text.split("```python")[1].split("```")[0].strip()
elif "```" in response_text:
code = response_text.split("```")[1].split("```")[0].strip()
else:
code = response_text.strip()
return code
def main():
if len(sys.argv) != 2:
print("Usage: python pandas_to_polars.py <input_file.py>")
sys.exit(1)
input_file = sys.argv[1]
output_file = input_file.replace('.py', '_polars.py')
# Read original code
with open(input_file, 'r') as f:
pandas_code = f.read()
print(f"Converting {input_file}...")
# Convert using AI
polars_code = convert_pandas_to_polars(pandas_code)
# Write converted code
with open(output_file, 'w') as f:
f.write(polars_code)
print(f"✓ Converted code saved to {output_file}")
print(f"✓ Review the output and test before using in production")
if __name__ == "__main__":
main()
Why this approach works: Claude understands context like variable usage across functions, which regex-based tools miss. The system prompt encodes Polars best practices.
Step 4: Test with Sample Code
Create a test file sample_pandas.py:
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv('sales_data.csv')
# Clean data
df['revenue'] = df['quantity'] * df['price']
df = df[df['revenue'] > 0]
# Aggregate by category
summary = df.groupby('category').agg({
'revenue': ['sum', 'mean'],
'quantity': 'sum'
}).reset_index()
# Add percentage
summary['pct_of_total'] = summary[('revenue', 'sum')] / summary[('revenue', 'sum')].sum()
# Save results
summary.to_csv('summary.csv', index=False)
print(f"Processed {len(df)} rows")
Run the converter:
python pandas_to_polars.py sample_pandas.py
Expected: Creates sample_pandas_polars.py with converted code:
import polars as pl
# Load data lazily
df = pl.scan_csv('sales_data.csv')
# Clean data
df = df.with_columns([
(pl.col('quantity') * pl.col('price')).alias('revenue')
]).filter(pl.col('revenue') > 0)
# Aggregate by category
summary = df.group_by('category').agg([
pl.col('revenue').sum().alias('revenue_sum'),
pl.col('revenue').mean().alias('revenue_mean'),
pl.col('quantity').sum().alias('quantity_sum')
]).collect()
# Add percentage
total_revenue = summary['revenue_sum'].sum()
summary = summary.with_columns([
(pl.col('revenue_sum') / total_revenue).alias('pct_of_total')
])
# Save results
summary.write_csv('summary.csv')
print(f"Processed {len(df.collect())} rows")
Step 5: Handle Common Edge Cases
The AI handles most cases, but verify these manually:
Date parsing:
# Pandas
df['date'] = pd.to_datetime(df['date'])
# Polars (check conversion)
df = df.with_columns(pl.col('date').str.strptime(pl.Date, '%Y-%m-%d'))
Chained indexing:
# Pandas
df.loc[df['age'] > 18, 'category'] = 'adult'
# Polars (AI converts to when-then)
df = df.with_columns(
pl.when(pl.col('age') > 18)
.then(pl.lit('adult'))
.otherwise(pl.col('category'))
.alias('category')
)
Inplace operations:
# Pandas
df.drop_duplicates(inplace=True)
# Polars (no inplace, always reassign)
df = df.unique()
If conversion looks wrong:
- Missing .collect(): Add it before operations that need results
- Wrong column syntax: Replace
df['col']withpl.col('col')in expressions - Type errors: Check if dates/categoricals need explicit casting
Verification
Create a test script verify_conversion.py:
import pandas as pd
import polars as pl
import numpy as np
# Generate test data
np.random.seed(42)
data = {
'category': np.random.choice(['A', 'B', 'C'], 1000),
'quantity': np.random.randint(1, 100, 1000),
'price': np.random.uniform(10, 100, 1000)
}
# Test Pandas version
df_pd = pd.DataFrame(data)
df_pd['revenue'] = df_pd['quantity'] * df_pd['price']
result_pd = df_pd.groupby('category')['revenue'].sum().sort_index()
# Test Polars version
df_pl = pl.DataFrame(data)
df_pl = df_pl.with_columns([
(pl.col('quantity') * pl.col('price')).alias('revenue')
])
result_pl = df_pl.group_by('category').agg(
pl.col('revenue').sum()
).sort('category')
# Compare results
print("Pandas result:")
print(result_pd)
print("\nPolars result:")
print(result_pl)
# Check if values match (within floating point precision)
assert np.allclose(result_pd.values, result_pl['revenue'].to_numpy())
print("\n✓ Results match!")
Run it:
python verify_conversion.py
You should see: Both outputs match, confirming the conversion is correct.
Performance Comparison
Benchmark your converted code:
import time
import polars as pl
import pandas as pd
# Create test dataset
df_pd = pd.DataFrame({
'id': range(1_000_000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 1_000_000),
'value': np.random.randn(1_000_000)
})
# Pandas timing
start = time.time()
result_pd = df_pd.groupby('category').agg({'value': ['mean', 'std', 'count']})
pandas_time = time.time() - start
# Polars timing
df_pl = pl.DataFrame(df_pd)
start = time.time()
result_pl = df_pl.group_by('category').agg([
pl.col('value').mean(),
pl.col('value').std(),
pl.col('value').count()
])
polars_time = time.time() - start
print(f"Pandas: {pandas_time:.3f}s")
print(f"Polars: {polars_time:.3f}s")
print(f"Speedup: {pandas_time/polars_time:.1f}x")
Expected: Polars runs 5-20x faster on this operation.
What You Learned
- AI-powered conversion handles 80% of Pandas→Polars translation automatically
- Polars uses lazy evaluation via
scan_csv()and.collect() - Main differences:
group_by()vsgroupby(),pl.col()for expressions, no inplace ops
Limitations:
- Custom Pandas functions may need manual rewrite
- Some advanced indexing patterns don't translate directly
- Always verify results match before switching in production
When NOT to use Polars:
- Small datasets (<10K rows) where Pandas is fast enough
- Heavy use of Pandas-specific features like MultiIndex
- Team unfamiliar with Polars and no time to learn
Advanced: Batch Convert Entire Projects
For multiple files, extend the script:
import os
from pathlib import Path
def convert_project(src_dir: str, out_dir: str):
"""Convert all Python files in a directory."""
Path(out_dir).mkdir(exist_ok=True)
for root, dirs, files in os.walk(src_dir):
for file in files:
if file.endswith('.py'):
input_path = os.path.join(root, file)
rel_path = os.path.relpath(input_path, src_dir)
output_path = os.path.join(out_dir, rel_path)
# Ensure output directory exists
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(input_path, 'r') as f:
pandas_code = f.read()
# Skip files without pandas imports
if 'import pandas' not in pandas_code:
continue
print(f"Converting {rel_path}...")
polars_code = convert_pandas_to_polars(pandas_code)
with open(output_path, 'w') as f:
f.write(polars_code)
# Usage
convert_project('src/', 'src_polars/')
Warning: Always review AI-generated code before running in production. Test thoroughly.
Troubleshooting
API key errors:
# Check if key is set
echo $ANTHROPIC_API_KEY
# If empty, re-export
export ANTHROPIC_API_KEY='sk-ant-...'
Import errors after conversion:
- Add
import polars as plat the top if missing - Remove unused
import pandas as pd
Performance worse than Pandas:
- Ensure you're using
scan_csv()notread_csv() - Add
.collect()only when you need the actual results - Use
.lazy()to convert eager DataFrames to lazy
Type mismatches:
# Explicitly cast when needed
df = df.with_columns([
pl.col('date').cast(pl.Date),
pl.col('amount').cast(pl.Float64)
])
Tested on Polars 0.20.7, Python 3.11, Anthropic API 2026-02-01, macOS & Ubuntu