Decode IBM Mainframe Assembly with AI in 30 Minutes

Use Claude and Python to analyze BAL/HLASM code, extract business logic, and generate documentation for legacy mainframes.

Problem: Nobody Understands the Mainframe Code Anymore

Your bank runs critical systems on IBM z/OS mainframes written in BAL (Basic Assembler Language) or HLASM. The original developers retired. Documentation is sparse. You need to extract business logic before migration deadlines hit.

You'll learn:

  • How to parse mainframe assembly with Python
  • Using Claude API to analyze instruction sequences
  • Extracting business rules from register operations
  • Generating human-readable documentation

Time: 30 min | Level: Advanced


Why This Happens

Mainframe assembly from the 1970s-90s was optimized for hardware, not readability. Comments are minimal. Register usage follows conventions lost to time. Business logic is buried in bit manipulation and conditional branches.

Common symptoms:

  • Code with labels like L00P372A and no comments
  • Critical calculations in packed decimal (PACK/ZAP/AP)
  • Undocumented register conventions (R7 always holds account balance)
  • Mixed assembly and macro expansions

Solution

Step 1: Extract Assembly Source

# Transfer from mainframe (z/OS)
ftp mainframe.company.com
> get 'PROD.PAYROLL.ASM(MODULE01)' module01.asm

# Or use modern tools
zowe files download ds "PROD.PAYROLL.ASM(MODULE01)" -f module01.asm

Expected: Text file with assembly instructions, typically EBCDIC encoded.

If it fails:

  • Binary garbage: Convert EBCDIC to ASCII with iconv -f EBCDIC-US -t ASCII
  • Access denied: You need READ access to the dataset

Step 2: Parse Assembly Structure

# parse_asm.py
import re
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class Instruction:
    label: str
    opcode: str
    operands: str
    comment: str
    line_num: int

def parse_hlasm(source: str) -> List[Instruction]:
    """Parse HLASM/BAL assembly into structured format"""
    instructions = []
    
    for line_num, line in enumerate(source.split('\n'), 1):
        # Skip blank lines and full-line comments
        if not line.strip() or line.strip().startswith('*'):
            continue
            
        # HLASM format: label(1-8) opcode(10-14) operands(16-71) comment(72+)
        # Modern files may not follow strict columns
        match = re.match(
            r'^(\w{0,8})\s+(\w+)\s+([^*]+?)(?:\s+\*(.*))?$',
            line.rstrip()
        )
        
        if match:
            label, opcode, operands, comment = match.groups()
            instructions.append(Instruction(
                label=label.strip(),
                opcode=opcode.strip(),
                operands=operands.strip(),
                comment=(comment or '').strip(),
                line_num=line_num
            ))
    
    return instructions

# Test it
with open('module01.asm', 'r') as f:
    source = f.read()
    
parsed = parse_hlasm(source)
print(f"Found {len(parsed)} instructions")
print(f"First 3: {parsed[:3]}")

Why this works: HLASM has semi-structured format. Modern files may not use strict column positions, so we use flexible regex.


Step 3: Identify Code Blocks

def find_subroutines(instructions: List[Instruction]) -> Dict[str, List[Instruction]]:
    """Group instructions into subroutines by CSECT/ENTRY labels"""
    subroutines = {}
    current_routine = "MAIN"
    current_block = []
    
    for inst in instructions:
        # New subroutine starts at CSECT or labeled ENTRY
        if inst.opcode in ('CSECT', 'ENTRY') or (
            inst.label and inst.opcode in ('STM', 'SAVE')
        ):
            if current_block:
                subroutines[current_routine] = current_block
            current_routine = inst.label or inst.operands.split(',')[0]
            current_block = [inst]
        else:
            current_block.append(inst)
    
    # Add final block
    if current_block:
        subroutines[current_routine] = current_block
        
    return subroutines

routines = find_subroutines(parsed)
print(f"Found subroutines: {list(routines.keys())}")

Expected: Dictionary mapping routine names to instruction lists.


Step 4: Use Claude API for Analysis

import anthropic
import json

def analyze_routine_with_claude(
    routine_name: str,
    instructions: List[Instruction]
) -> dict:
    """Send assembly block to Claude for analysis"""
    
    client = anthropic.Anthropic()
    
    # Format assembly for Claude
    asm_text = "\n".join([
        f"{i.label:8s} {i.opcode:6s} {i.operands:40s} * {i.comment}"
        for i in instructions[:50]  # Limit to first 50 lines
    ])
    
    prompt = f"""Analyze this IBM mainframe assembly (HLASM/BAL) subroutine:

ROUTINE: {routine_name}

{asm_text}

Provide analysis in JSON format:
{{
  "purpose": "High-level description of what this routine does",
  "inputs": ["List of expected inputs (registers, parameters)"],
  "outputs": ["What it returns or modifies"],
  "business_logic": ["Key business rules in plain English"],
  "risk_areas": ["Potential issues for modernization"]
}}

Focus on BUSINESS LOGIC, not low-level register mechanics."""

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    
    # Extract JSON from response
    response_text = message.content[0].text
    json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
    
    return json.loads(json_match.group(0)) if json_match else {}

# Analyze first routine
first_routine = list(routines.keys())[0]
analysis = analyze_routine_with_claude(first_routine, routines[first_routine])
print(json.dumps(analysis, indent=2))

Why this works: Claude recognizes mainframe assembly patterns from training data. Asking for JSON ensures structured output.

If it fails:

  • "Invalid JSON": Add error handling for partial responses
  • "Context too long": Split large routines into smaller chunks
  • Rate limit: Add time.sleep(1) between API calls

Step 5: Generate Documentation

def generate_markdown_docs(analyses: Dict[str, dict], output_file: str):
    """Create readable documentation from AI analysis"""
    
    with open(output_file, 'w') as f:
        f.write("# Mainframe Module Documentation\n\n")
        f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d')}\n\n")
        f.write("---\n\n")
        
        for routine_name, analysis in analyses.items():
            f.write(f"## {routine_name}\n\n")
            f.write(f"**Purpose:** {analysis.get('purpose', 'Unknown')}\n\n")
            
            if analysis.get('inputs'):
                f.write("**Inputs:**\n")
                for inp in analysis['inputs']:
                    f.write(f"- {inp}\n")
                f.write("\n")
            
            if analysis.get('outputs'):
                f.write("**Outputs:**\n")
                for out in analysis['outputs']:
                    f.write(f"- {out}\n")
                f.write("\n")
            
            if analysis.get('business_logic'):
                f.write("**Business Logic:**\n")
                for rule in analysis['business_logic']:
                    f.write(f"- {rule}\n")
                f.write("\n")
            
            if analysis.get('risk_areas'):
                f.write("**⚠️ Migration Risks:**\n")
                for risk in analysis['risk_areas']:
                    f.write(f"- {risk}\n")
                f.write("\n")
            
            f.write("---\n\n")

# Analyze all routines
all_analyses = {}
for name, instructions in routines.items():
    print(f"Analyzing {name}...")
    all_analyses[name] = analyze_routine_with_claude(name, instructions)

generate_markdown_docs(all_analyses, "mainframe_docs.md")
print("Documentation generated: mainframe_docs.md")

Expected: Markdown file with business logic extracted from assembly.


Verification

# Check output
cat mainframe_docs.md | head -50

# Validate JSON structure
python3 -c "
import json
with open('analyses.json') as f:
    data = json.load(f)
    print(f'Analyzed {len(data)} routines')
"

You should see: Human-readable descriptions of assembly routines, not just instruction lists.


What You Learned

  • Mainframe assembly can be parsed with regex despite age
  • LLMs recognize BAL/HLASM patterns from training data
  • Business logic extraction works better than full translation
  • JSON output enables automated documentation pipelines

Limitations:

  • Claude may misinterpret custom macros (not in training data)
  • Packed decimal calculations need manual verification
  • Register conventions vary by shop - context is critical

Advanced: Handling Packed Decimal

def explain_packed_decimal(instructions: List[Instruction]) -> str:
    """Find PACK/ZAP/AP sequences and explain business logic"""
    
    packed_ops = [i for i in instructions if i.opcode in ('PACK', 'ZAP', 'AP', 'SP', 'MP', 'DP')]
    
    if not packed_ops:
        return "No packed decimal operations"
    
    # Send to Claude with specialized prompt
    client = anthropic.Anthropic()
    
    asm_text = "\n".join([f"{i.opcode} {i.operands}" for i in packed_ops])
    
    prompt = f"""These mainframe assembly instructions manipulate packed decimal numbers:

{asm_text}

Packed decimal is used for financial calculations. Explain what calculation is being performed in plain English, assuming this is accounting or payroll logic."""

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return message.content[0].text

# Use it
first_routine_insts = routines[first_routine]
decimal_explanation = explain_packed_decimal(first_routine_insts)
print(f"\nPacked decimal logic:\n{decimal_explanation}")

Why packed decimal matters: Financial systems use this for precision. Understanding these calculations is critical for migration.


Real-World Example Output

## PAYROLL_CALC

**Purpose:** Calculates gross pay with overtime premium for hourly employees

**Inputs:**
- R3: Employee record address
- R5: Hours worked (packed decimal)
- R7: Hourly rate (packed decimal)

**Outputs:**
- R8: Gross pay amount
- Updates employee record at offset +24 (gross pay field)

**Business Logic:**
- If hours > 40, apply 1.5x multiplier to excess hours
- Regular hours = min(hours, 40) × hourly_rate
- Overtime hours = max(0, hours - 40) × hourly_rate × 1.5
- Gross = regular + overtime

**⚠️ Migration Risks:**
- Uses packed decimal for precision - must preserve in target system
- Overtime threshold hardcoded as 40 - should be configurable
- No validation for negative hours or rates

Production Tips

  1. Batch processing: Process multiple modules overnight with rate limiting
  2. Version control: Track assembly source and analyses together
  3. Human review: AI explains logic; SMEs validate accuracy
  4. Incremental approach: Start with highest-risk modules
  5. Preserve context: Include JCL and copybooks for complete picture

Tools that help:

  • zowe CLI for mainframe access
  • pygments for syntax highlighting in docs
  • graphviz for call graph visualization
  • Git LFS for storing large assembly dumps

Tested with Claude Sonnet 4, Python 3.12, z/OS 2.5 assembly exports