Skip to content

LogAn Performance Optimization#16

Open
rh-rahulshetty wants to merge 3 commits intoLog-Analyzer:mainfrom
rh-rahulshetty:main
Open

LogAn Performance Optimization#16
rh-rahulshetty wants to merge 3 commits intoLog-Analyzer:mainfrom
rh-rahulshetty:main

Conversation

@rh-rahulshetty
Copy link
Collaborator

@rh-rahulshetty rh-rahulshetty commented Mar 10, 2026

Reduces end-to-end LogAn processing time by ~57% (413s → 178s) through targeted optimizations across preprocessing, template mining, diagnosis backtracking, and anomaly grouping — with no changes to output behavior.

Note: Code Generation assisted with Cursor (Claude 4.6)

Key Changes

Preprocessing (preprocessing.py)

  • Threaded parallel file I/O — Replace sequential file reads with ThreadPoolExecutor, and collect file stats (size, line count) during the initial read to eliminate a redundant second pass in compute_preprocessing_statistics.
  • Timestamp pattern reordering — Sample 200 log lines to profile which timestamp regex patterns actually match, then reorder the master pattern list so the most common formats are tried first (reduces average iterations per log line).
  • Fast-path timestamp parsing — Try datetime.strptime() first (~5–10x faster than dateutil.parse), falling back to dateutil only as a last resort.
  • Precompiled regex for character counting — Replace per-character isalpha()/isdigit() with precompiled regex sub() + len().
  • Eliminate per-row pd.Series constructionparallel_apply now returns plain tuples; a single DataFrame is built from the collected list, avoiding the overhead of constructing a pd.Series per row.
  • Vectorized log truncation — Replace parallel_apply row-by-row truncation with np.where + str[:upper_bound].
  • JSON detection fast-path — Quick lstrip()[0] == '{' prefix check avoids calling json.loads() on non-JSON lines.
  • pandarallel tuning — Use all available CPU cores (removed min(..., 4) cap) and disable progress bar to reduce synchronization overhead.

Diagnosis backtracking (core.py)

  • Vectorized golden-signal lookup — Replace progress_apply row-by-row backpropagation with a list-comprehension lookup + direct column assignment.
  • np.where for error_test_ids — Replace row-level apply(lambda) with a vectorized np.where conditional.

Anomaly grouping (anomaly.py)

  • Union-Find (disjoint-set) merge — Replace the recursive O(n² per pass) superset/subset merging with a single-pass union-find algorithm with path compression.

Template mining (run_drain.py)

  • Direct column iteration — Replace df.apply(lambda, axis=1) with a loop over .values, avoiding per-row pd.Series construction.

Performance

Metric Before After Improvement
End-to-end time 413,365 ms 177,893 ms 2.32x faster

Benchmarked on the same log corpus with identical hardware.

Signed-off-by: Rahul Shetty <rashetty@redhat.com>
Signed-off-by: Rahul Shetty <rashetty@redhat.com>
Signed-off-by: Rahul Shetty <rashetty@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant