An implementation of Split Miner v.1.0 and 2.0 in Python derived from "Split miner: automated discovery of accurate and simple business process models from event logs" and "Automated Discovery of Process Models with True Concurrency and Inclusive Choices"
The project covers the full pipeline end-to-end:
- Directly-follows graph + self/short-loop detection
- Concurrency discovery and pruned DFG
- Source-to-sink filtering
- Initial BPMN construction
- Splits discovery
- Joins discovery
- OR-join minimization
- SM 2.0 only: lifecycle-aware DFG, refined concurrency, improper completion fix, OR-split heuristic.
- Python 3.10+
- The packages listed in
requirements.txt: - Graphviz (only for PNG rendering through
pm4py.save_vis_bpmn). Install it on Windows from https://graphviz.org/download/ and make suredot.exeis onPATH.
Install packages in requirements.txt with:
pip install -r requirements.txtpython debug.pyReads the 10-trace running example from the first version paper, runs the
pipeline phase-by-phase, prints intermediate sizes, and saves
paper_example.bpmn and paper_example.png.
python main.py path/to/log.xes.gz --out model.bpmn --png model.pngThis produces model.bpmn (BPMN 2.0 XML, openable in any BPMN tool such as
bpmn.io) and a PNG rendering at model.png.
If your log has lifecycle:transition attributes that distinguish start
and complete/end events, SM 2.0 can detect true concurrency (truly
overlapping activity lifecycles) and inclusive (OR) splits:
python main.py path/to/log.xes.gz --v2 --out model_v2.bpmn --png model_v2.pngOn logs without lifecycle info the algorithm degrades gracefully to SM 1.x behaviour and prints a warning.
| Flag | Meaning | Default |
|---|---|---|
log |
Path to a .xes / .xes.gz file |
required |
--eps FLOAT |
Concurrency threshold ε ∈ [0, 1] | 0.1 |
--eta FLOAT |
Filtering percentile η ∈ [0, 1] | 0.4 |
--no-or-min |
Skip the OR-join minimisation phase | off |
--v2 |
Use Split Miner 2.0 algorithm | off |
--out PATH |
Output BPMN XML path | model.bpmn |
--png PATH |
Optional PNG rendering path | none |
ε controls when two activities are considered concurrent: lower ε is more
permissive (more || relations), higher ε is stricter. η controls the
filter — lower η retains more edges (higher fitness), higher η keeps the
model simple (higher precision).
The CLI is a thin wrapper around two functions in main.py. You can call
them directly:
from log_loader import load_traces, load_traces_v2
from main import run, run_v2
from bpmn_export import to_bpmn, save_bpmn_xml, save_bpmn_png
# ---- Split Miner 1.x ----------------------------------------------------
traces = load_traces("path/to/log.xes.gz") # list[list[str]]
wg = run(traces, eps=0.1, eta=0.4, or_minimise=True)
# ---- Split Miner 2.0 ----------------------------------------------------
refined = load_traces_v2("path/to/log.xes.gz") # list[list[(label, lc, ts)]]
wg = run_v2(refined, eps=0.1, eta=0.4, or_minimise=True)
# ---- Export -------------------------------------------------------------
bpmn = to_bpmn(wg) # pm4py BPMN object
save_bpmn_xml(bpmn, "model.bpmn") # writes BPMN 2.0 XML
save_bpmn_png(bpmn, "model.png") # writes PNG (needs Graphviz)wg is a WorkingGraph (see types_.py) — a mutable adjacency
representation of the BPMN. You can inspect / post-process it before
exporting.
.
├── main.py CLI + run() / run_v2() entry points
├── debug.py Phase-by-phase runner on the paper's example
├── log_loader.py XES -> trace list (with optional lifecycle info)
├── types_.py DFG, Node, WorkingGraph data classes
├── dfg_builder.py Phase 1 — DFG, self/short-loops, refined DFG
├── concurrency.py Phase 2 — pruned DFG, SM 2.0 Eq.5 oracle
├── filtering.py Phase 3 — Algorithm 1 (max-min-frequency BFS)
├── bpmn_init.py Phase 4 — initial BPMN from filtered PDFG
├── splits.py Phase 5 — Algorithms 5, 6, 7 (XOR / AND splits)
├── rpst.py Lightweight RPST decomposition for joins
├── joins.py Phase 6 — split-aware join insertion (Alg. 8)
├── or_minimize.py Phase 7 — Algorithm 9 (OR-min)
├── heuristics.py SM 2.0 — improper-completion + OR-split heuristics
├── bpmn_export.py WorkingGraph -> pm4py BPMN object / XML / PNG
-
Concurrency threshold for SM 1.x. The original paper writes the imbalance condition as
< ε, but its own running example only matches Figure 3c with≤ ε(the paire||ghas imbalance exactly 0.2 at the paper's reportedε = 0.2). The Java reference implementation also uses≤. The Python port follows the Java behaviour to keep the running example reproducible. -
Joins discovery. The implementation follows Algorithm 8 of the paper but groups incoming edges of every join target by their nearest split-gateway ancestor, then recurses. This produces several "smaller" joins instead of one large multi-input join when the SESE structure has multiple nested fragments. Both shapes are semantically equivalent; the Java reference tends to emit larger n-ary joins, so its
nodes/edgescounts are slightly lower than this port's. -
SM 2.0 with
complete-only logs. Many real logs (Sepsis, BPIC12, …) record onlylifecycle:transition = complete. In that case the loader synthesises astartevent with the same timestamp so the SM 2.0 pipeline stays well-formed, but it cannot detect any overlaps and the output is identical to SM 1.x. The Java reference behaves the same way.
- Augusto, A., Conforti, R., Dumas, M. et al. Split miner: automated discovery of accurate and simple business process models from event logs. Knowl Inf Syst 59, 251–284 (2019). https://doi.org/10.1007/s10115-018-1214-x
- Augusto, A., Dumas, M., La Rosa, M. (2021). Automated Discovery of Process Models with True Concurrency and Inclusive Choices. In: Leemans, S., Leopold, H. (eds) Process Mining Workshops. ICPM 2020. Lecture Notes in Business Information Processing, vol 406. Springer, Cham. https://doi.org/10.1007/978-3-030-72693-5_4
This work is licensed under GPL-3.0.