LogAn Performance Optimization by rh-rahulshetty · Pull Request #16 · Log-Analyzer/LogAn

rh-rahulshetty · 2026-03-10T15:27:08Z

Reduces end-to-end LogAn processing time by ~57% (413s → 178s) through targeted optimizations across preprocessing, template mining, diagnosis backtracking, and anomaly grouping — with no changes to output behavior.

Note: Code Generation assisted with Cursor (Claude 4.6)

Key Changes

Preprocessing (`preprocessing.py`)

Threaded parallel file I/O — Replace sequential file reads with ThreadPoolExecutor, and collect file stats (size, line count) during the initial read to eliminate a redundant second pass in compute_preprocessing_statistics.
Timestamp pattern reordering — Sample 200 log lines to profile which timestamp regex patterns actually match, then reorder the master pattern list so the most common formats are tried first (reduces average iterations per log line).
Fast-path timestamp parsing — Try datetime.strptime() first (~5–10x faster than dateutil.parse), falling back to dateutil only as a last resort.
Precompiled regex for character counting — Replace per-character isalpha()/isdigit() with precompiled regex sub() + len().
Eliminate per-row pd.Series construction — parallel_apply now returns plain tuples; a single DataFrame is built from the collected list, avoiding the overhead of constructing a pd.Series per row.
Vectorized log truncation — Replace parallel_apply row-by-row truncation with np.where + str[:upper_bound].
JSON detection fast-path — Quick lstrip()[0] == '{' prefix check avoids calling json.loads() on non-JSON lines.
pandarallel tuning — Use all available CPU cores (removed min(..., 4) cap) and disable progress bar to reduce synchronization overhead.

Diagnosis backtracking (`core.py`)

Vectorized golden-signal lookup — Replace progress_apply row-by-row backpropagation with a list-comprehension lookup + direct column assignment.
np.where for error_test_ids — Replace row-level apply(lambda) with a vectorized np.where conditional.

Anomaly grouping (`anomaly.py`)

Union-Find (disjoint-set) merge — Replace the recursive O(n² per pass) superset/subset merging with a single-pass union-find algorithm with path compression.

Template mining (`run_drain.py`)

Direct column iteration — Replace df.apply(lambda, axis=1) with a loop over .values, avoiding per-row pd.Series construction.

Performance

Metric	Before	After	Improvement
End-to-end time	413,365 ms	177,893 ms	2.32x faster

Benchmarked on the same log corpus with identical hardware.

Signed-off-by: Rahul Shetty <rashetty@redhat.com>

rh-rahulshetty added 3 commits March 10, 2026 17:08

optimize logan workflow

3445be1

Signed-off-by: Rahul Shetty <rashetty@redhat.com>

preprocessing timestamp matching optimization

3591dfe

Signed-off-by: Rahul Shetty <rashetty@redhat.com>

disjoin set optimization

8e3b148

Signed-off-by: Rahul Shetty <rashetty@redhat.com>

rh-rahulshetty requested review from Pranjal-Gupta2, jhutar and siddardh-ra as code owners March 10, 2026 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LogAn Performance Optimization#16

LogAn Performance Optimization#16
rh-rahulshetty wants to merge 3 commits intoLog-Analyzer:mainfrom
rh-rahulshetty:main

rh-rahulshetty commented Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rh-rahulshetty commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Preprocessing (preprocessing.py)

Diagnosis backtracking (core.py)

Anomaly grouping (anomaly.py)

Template mining (run_drain.py)

Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rh-rahulshetty commented Mar 10, 2026 •

edited

Loading

Preprocessing (`preprocessing.py`)

Diagnosis backtracking (`core.py`)

Anomaly grouping (`anomaly.py`)

Template mining (`run_drain.py`)