Skip to content

cRennert/split-miner-in-python

Repository files navigation

Split Miner in Python

An implementation of Split Miner v.1.0 and 2.0 in Python derived from "Split miner: automated discovery of accurate and simple business process models from event logs" and "Automated Discovery of Process Models with True Concurrency and Inclusive Choices"

The project covers the full pipeline end-to-end:

  1. Directly-follows graph + self/short-loop detection
  2. Concurrency discovery and pruned DFG
  3. Source-to-sink filtering
  4. Initial BPMN construction
  5. Splits discovery
  6. Joins discovery
  7. OR-join minimization
  8. SM 2.0 only: lifecycle-aware DFG, refined concurrency, improper completion fix, OR-split heuristic.

Requirements

  • Python 3.10+
  • The packages listed in requirements.txt:
  • Graphviz (only for PNG rendering through pm4py.save_vis_bpmn). Install it on Windows from https://graphviz.org/download/ and make sure dot.exe is on PATH.

Install packages in requirements.txt with:

pip install -r requirements.txt

Reproducing the paper's running example

python debug.py

Reads the 10-trace running example from the first version paper, runs the pipeline phase-by-phase, prints intermediate sizes, and saves paper_example.bpmn and paper_example.png.

Quick start

1. Discover a BPMN from an XES log (Split Miner 1.x)

python main.py path/to/log.xes.gz --out model.bpmn --png model.png

This produces model.bpmn (BPMN 2.0 XML, openable in any BPMN tool such as bpmn.io) and a PNG rendering at model.png.

2. Discover with Split Miner 2.0

If your log has lifecycle:transition attributes that distinguish start and complete/end events, SM 2.0 can detect true concurrency (truly overlapping activity lifecycles) and inclusive (OR) splits:

python main.py path/to/log.xes.gz --v2 --out model_v2.bpmn --png model_v2.png

On logs without lifecycle info the algorithm degrades gracefully to SM 1.x behaviour and prints a warning.

3. CLI options

Flag Meaning Default
log Path to a .xes / .xes.gz file required
--eps FLOAT Concurrency threshold ε ∈ [0, 1] 0.1
--eta FLOAT Filtering percentile η ∈ [0, 1] 0.4
--no-or-min Skip the OR-join minimisation phase off
--v2 Use Split Miner 2.0 algorithm off
--out PATH Output BPMN XML path model.bpmn
--png PATH Optional PNG rendering path none

ε controls when two activities are considered concurrent: lower ε is more permissive (more || relations), higher ε is stricter. η controls the filter — lower η retains more edges (higher fitness), higher η keeps the model simple (higher precision).


Programmatic usage

The CLI is a thin wrapper around two functions in main.py. You can call them directly:

from log_loader import load_traces, load_traces_v2
from main import run, run_v2
from bpmn_export import to_bpmn, save_bpmn_xml, save_bpmn_png

# ---- Split Miner 1.x ----------------------------------------------------
traces = load_traces("path/to/log.xes.gz")  # list[list[str]]
wg = run(traces, eps=0.1, eta=0.4, or_minimise=True)

# ---- Split Miner 2.0 ----------------------------------------------------
refined = load_traces_v2("path/to/log.xes.gz")  # list[list[(label, lc, ts)]]
wg = run_v2(refined, eps=0.1, eta=0.4, or_minimise=True)

# ---- Export -------------------------------------------------------------
bpmn = to_bpmn(wg)  # pm4py BPMN object
save_bpmn_xml(bpmn, "model.bpmn")  # writes BPMN 2.0 XML
save_bpmn_png(bpmn, "model.png")  # writes PNG (needs Graphviz)

wg is a WorkingGraph (see types_.py) — a mutable adjacency representation of the BPMN. You can inspect / post-process it before exporting.


Project layout

.
├── main.py              CLI + run() / run_v2() entry points
├── debug.py             Phase-by-phase runner on the paper's example
├── log_loader.py        XES -> trace list (with optional lifecycle info)
├── types_.py            DFG, Node, WorkingGraph data classes
├── dfg_builder.py       Phase 1 — DFG, self/short-loops, refined DFG
├── concurrency.py       Phase 2 — pruned DFG, SM 2.0 Eq.5 oracle
├── filtering.py         Phase 3 — Algorithm 1 (max-min-frequency BFS)
├── bpmn_init.py         Phase 4 — initial BPMN from filtered PDFG
├── splits.py            Phase 5 — Algorithms 5, 6, 7 (XOR / AND splits)
├── rpst.py              Lightweight RPST decomposition for joins
├── joins.py             Phase 6 — split-aware join insertion (Alg. 8)
├── or_minimize.py       Phase 7 — Algorithm 9 (OR-min)
├── heuristics.py        SM 2.0 — improper-completion + OR-split heuristics
├── bpmn_export.py       WorkingGraph -> pm4py BPMN object / XML / PNG

Algorithmic notes & caveats

  • Concurrency threshold for SM 1.x. The original paper writes the imbalance condition as < ε, but its own running example only matches Figure 3c with ≤ ε (the pair e||g has imbalance exactly 0.2 at the paper's reported ε = 0.2). The Java reference implementation also uses . The Python port follows the Java behaviour to keep the running example reproducible.

  • Joins discovery. The implementation follows Algorithm 8 of the paper but groups incoming edges of every join target by their nearest split-gateway ancestor, then recurses. This produces several "smaller" joins instead of one large multi-input join when the SESE structure has multiple nested fragments. Both shapes are semantically equivalent; the Java reference tends to emit larger n-ary joins, so its nodes/edges counts are slightly lower than this port's.

  • SM 2.0 with complete-only logs. Many real logs (Sepsis, BPIC12, …) record only lifecycle:transition = complete. In that case the loader synthesises a start event with the same timestamp so the SM 2.0 pipeline stays well-formed, but it cannot detect any overlaps and the output is identical to SM 1.x. The Java reference behaves the same way.


References

  1. Augusto, A., Conforti, R., Dumas, M. et al. Split miner: automated discovery of accurate and simple business process models from event logs. Knowl Inf Syst 59, 251–284 (2019). https://doi.org/10.1007/s10115-018-1214-x
  2. Augusto, A., Dumas, M., La Rosa, M. (2021). Automated Discovery of Process Models with True Concurrency and Inclusive Choices. In: Leemans, S., Leopold, H. (eds) Process Mining Workshops. ICPM 2020. Lecture Notes in Business Information Processing, vol 406. Springer, Cham. https://doi.org/10.1007/978-3-030-72693-5_4

License

This work is licensed under GPL-3.0.

About

An implementation of Split Miner v.1.0 and 2.0 in Python derived from "Split miner: automated discovery of accurate and simple business process models from event logs" and "Automated Discovery of Process Models with True Concurrency and Inclusive Choices"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages