What STRRL Known

❯

❯

❯

❯

03 loghub 2.0 issta24

03-loghub-2.0-issta24

Feb 26, 20262 min read

Loghub 2.0: Large-scale Evaluation for Log Parsing

Venue: ISSTA’24
Authors: Zhihan Jiang, Jinyang Liu et al.
Paper: https://arxiv.org/abs/2308.10828
Project: https://github.com/logpai/Loghub-2.0
Status: Read

Takeaway

14 datasets, avg 3.6M log lines each (vs Loghub-2k’s 2000 lines) - much more realistic
Datasets: Hadoop, HDFS, OpenStack, Spark, Zookeeper, BGL, HPC, Thunderbird, Linux, Mac, Apache, OpenSSH, HealthApp, Proxifier
15 parsers evaluated: AEL, Drain, IPLoM, LenMa, LFA, LogCluster, LogMine, Logram, LogSig, MoLFI, SHISO, SLCT, Spell, UniParser, LogPPT
Key finding: all parsers perform significantly worse on large-scale data vs 2k samples
Best traditional parsers on GA (grouping accuracy): Drain (0.85), AEL (0.81)
ML-based parsers (UniParser, LogPPT) dominate on PA (parsing accuracy): LogPPT 0.76, UniParser 0.68 vs Drain 0.47
Rare log events are hardest to parse correctly - critical for diagnosis but poorly handled by all parsers
New metric FGA (frequency-weighted GA) to handle imbalanced distributions
Many parsers crash or timeout on large datasets (marked as ------ in results)
Drain is the best balance of speed + accuracy among traditional parsers
Most important value for LAPP: the dataset itself. Use Loghub-2.0 for integration testing and benchmarking
- Download: https://zenodo.org/record/8275861
- 14 real-world system log datasets with ground truth annotations
Drain as baseline, target beating LogPPT’s PA scores with LLM approach

Parser survivability on full-scale datasets

Only 7 out of 15 parsers completed all 14 large-scale datasets without crash/timeout: Drain, IPLoM, LFA, LogCluster, LogSig, UniParser, LogPPT

Parser ranking (combining efficiency, FGA, PA/FTA, stability)

1st: Drain

Highest average GA and FGA (paper conclusion)
Strongest grouping, stable template clustering, low variance at scale
Linear complexity, no GPU, online tree-based, streaming friendly
Best for: real-time log streams, high throughput, production use

2nd: IPLoM (conservative/stable)

Also statistical, decent efficiency, stable grouping
Slightly weaker than Drain overall
Simple implementation, interpretable, few parameters
Best for: rule-heavy log systems, minimal tuning

3rd tier: LogCluster / LogSig / LFA

Can finish all datasets, statistical, decent efficiency
FGA lower than Drain, weak on rare templates and high-parameter templates
Best for: low accuracy requirements, batch processing, legacy migration

Semantic-based: UniParser / LogPPT (conditional recommendation)

Best PA/FTA scores (token-level accuracy): LogPPT 0.76, UniParser 0.68
But GA/FGA (grouping) is lower
GPU required, high compute cost
Performance degrades significantly on Loghub-2.0 (paper explicitly states this)
Best for: when per-line template accuracy matters more than grouping

Recent Thoughts

Cleanup DNS Cache
Feb 26, 2026
GCP Artifact Login
Feb 26, 2026
datadog-toto-time-series-observability-model
Feb 26, 2026
00-lapp-design-notes
Feb 26, 2026
01-lilac-log-parsing-llm-cache
Feb 26, 2026
02-ibm-label-broadcasting-log-analytics
Feb 26, 2026
03-loghub-2.0-issta24
Feb 26, 2026
04-sok-llm-log-parsing-2025
Feb 26, 2026
05-l4-llm-training-log-diagnosis-fse25
Feb 26, 2026
06-wide-events-scuba-burmistrov
Feb 26, 2026
07-observability-2.0-honeycomb
Feb 26, 2026
08-drain3-logpai
Feb 26, 2026
09-logparser-tools-benchmarks-icse19
Feb 26, 2026
10-llmparser-icse24
Feb 26, 2026
11-logparser-llm-kdd24
Feb 26, 2026
12-llmloganalyzer
Feb 26, 2026
13-logpilot-alert-diagnosis-ase25
Feb 26, 2026
14-iknow-rag-chatbot-cloud-ops-ase25
Feb 26, 2026
15-loghub-issre23
Feb 26, 2026
16-sazabi-logs-are-all-you-need
Feb 26, 2026

Graph View

Loghub 2.0: Large-scale Evaluation for Log Parsing
Takeaway
Parser survivability on full-scale datasets
Parser ranking (combining efficiency, FGA, PA/FTA, stability)

Backlinks

00-lapp-design-notes

Created with Quartz v4.5.1 © 2026

GitHub
Discord Community