GA (Grouping Accuracy): do logs that share a template get grouped together correctly? Measures clustering quality. A parser can get wrong templates but still group well.
PA (Parsing Accuracy): fraction of log messages where the extracted template exactly matches the ground truth. Strict per-message correctness.
FTA (F1-score of Template Accuracy): F1 over unique templates. Measures how many ground-truth templates the parser discovers correctly (precision + recall over template set).
NED (Normalized Edit Distance): edit distance between predicted and ground-truth templates, normalized to 0-1. Captures partial correctness when templates are close but not exact.
Only two LLM parsers clearly lead the pack: LogBatcher, LILAC
For LAPP: ICL (few-shot with strategic retrieval) is the sweet spot