IBM: Scalable and Efficient Large-Scale Log Analysis with LLMs

Takeaway

  • Core pattern: Drain clustering first, then LLM classification on representatives only - this is the right approach for LAPP too
  • The three label types are highly reusable for LAPP:
    • Golden Signal Classification (error / availability / latency / saturation / information)
    • Fault Category Prediction (application / network / I/O / etc.)
    • Named Entity Recognition (host, session ID, error code, etc.)
  • Label Broadcasting = the same “control plane / data plane” split as LILAC’s cache, just at a different granularity
  • 3.2% edge case where template variables carry diagnostic cues - LAPP needs to handle this
  • Engineering-oriented: runs on CPU (BERTOps, small fine-tuned BERT, not GPT-class), deployed across 70 IBM products for 15 months
  • Report generation - important for LAPP product design:
    • Summary Report: representative lines sorted by rarity (rarest = most important), filterable by label
    • Temporal Trend: golden signal counts over time answers “when did it start breaking?”
    • Causal Graph: Granger causality on cluster time series answers “how did the fault propagate?”
    • Diagnosis Report: only show time windows containing faults, chronological, searchable by entity answers “what exactly happened at that time?”
    • Workflow: Summary (overview) Temporal Trend (locate time) Causal Graph (understand causality) Diagnosis Report (deep dive)
  • 425K lines 74 representative lines (99.9% reduction) in real case study