diff --git a/.gitignore b/.gitignore
index d0d3515..201888b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -23,6 +23,9 @@ benchpark/
 # BenchPark workspace (generated during CI)
 benchpark-workspace/
 
+# Estimator tool checkouts prepared during CI/local smoke tests
+.benchkit_estimation_tools/
+
 # Dev mode data and config (NEVER commit)
 result_server/_dev_data/
 result_server/config/allowed_emails.json
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index a8b5a43..20a9cde 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -30,6 +30,17 @@ variables:
   estimate_result_uuid: ""
   reestimation_reason: ""
   reestimation_trigger: ""
+  # Temporary bring-up switches for the GPU estimation integration.
+  # Remove or replace them after the real estimator runner/package flow is fixed.
+  BK_QWS_GPU_MLP_SMOKE: "true"
+  BK_QWS_GPU_MLP_SMOKE_MODE: "perftools"
+  BK_ESTIMATE_RUNNER_TAG: "fncx-estimate-python"
+  BK_GPU_MLP_PERFTOOLS_REPO: "https://github.com/masaaki-kondo/PerfTools.git"
+  BK_GPU_MLP_PERFTOOLS_REF: "main"
+  BK_GENESIS_GPU_MLP_PROFILE: "true"
+  BK_GPU_MLP_NCU_LAUNCH_COUNT: "20"
+  BK_GPU_MLP_SOURCE_GPU: "H100"
+  BK_GPU_MLP_KERNEL_COUNT: "20"
 
 # Extract system and code filters from API variables or commit message
 .filters: &filters
diff --git a/docs/cx/REESTIMATION_SPEC.md b/docs/cx/REESTIMATION_SPEC.md
index b519b92..d2f5292 100644
--- a/docs/cx/REESTIMATION_SPEC.md
+++ b/docs/cx/REESTIMATION_SPEC.md
@@ -199,7 +199,7 @@ However, when those artifacts do not exist on the server:
 
 - estimate result UUID で estimate JSON を返す取得 API
 - estimate JSON に記録された `source_result_uuid` に基づいて Result JSON を返す取得 API
-- `source_result_uuid` に対応する estimation input artifact を返す取得 API
+- `source_result_uuid` に対応する estimation artifact bundle を返す取得 API
 
 現時点では、再推定の shell フローと取得口自体は実装済みである。
 一方で、取得 API の公開方針と認証条件、ならびに compare UI や portal からの再推定起動導線は文書としてまだ十分に整理されていない。
@@ -216,7 +216,18 @@ Re-estimation from `estimate_result_uuid` requires retrieval paths for:
 
 - Estimate JSON by estimate-result UUID
 - Result JSON through the resolved `source_result_uuid`
-- estimation-input artifacts associated with that source result
+- estimation artifact bundles associated with that source result
+
+The canonical artifact endpoints are:
+
+- `POST /api/ingest/estimation-artifacts`
+- `GET /api/query/estimation-artifacts?uuid=<source_result_uuid>`
+
+The older `estimation-inputs` endpoints remain as compatibility aliases during
+the transition, but new clients should use `estimation-artifacts`. The artifact
+bundle may contain prepared estimation inputs, prediction outputs, and logs; it
+must not be used to duplicate large profiler archives such as PA Data or
+`*.ncu-rep`.
 
 At present, the shell-side re-estimation flow and these retrieval endpoints exist, but the exposure rules, authentication conditions, and portal-facing documentation are not yet fixed clearly enough in the documents.
 
@@ -296,23 +307,23 @@ Re-estimation in BenchKit should preferably satisfy at least:
 4. different estimation methods can coexist for the same benchmark result
 5. `weakscaling`-based minimum estimation and detailed estimation can be compared along the same comparison axis
 6. insufficient inputs can be reported explicitly as not applicable, fallback, or re-measurement required
-7. detailed re-estimation should be able to restore estimation-input artifacts associated with the source result
+7. detailed re-estimation should be able to restore estimation artifacts associated with the source result
 
-## 8.1 estimation input artifact の復元 / Restoration of Estimation Input Artifacts
+## 8.1 estimation artifact の復元 / Restoration of Estimation Artifacts
 
 当面の再推定では、artifact 復元を次の流れで扱う。
 
 1. `estimate_result_uuid` から開始し、estimate JSON から `source_result_uuid` を解決する
 2. source result JSON を取得する
-3. `received_estimation_inputs/<result-stem>/` が存在する場合は、その内容を `results/estimation_inputs/` に復元する
-4. server 側に estimation input artifact が無い場合は、artifact 不要な推定のみを許可し、必要な推定は `not_applicable` とする
+3. `received_estimation_artifacts/<result-stem>/` が存在する場合は、その内容を `results/estimation_artifacts/` に復元する
+4. server 側に estimation artifact が無い場合は、artifact 不要な推定のみを許可し、必要な推定は `not_applicable` とする
 
 Current restoration should follow this flow:
 
 1. if starting from `estimate_result_uuid`, resolve `source_result_uuid` from the estimate JSON
 2. fetch the source result JSON
-3. if `received_estimation_inputs/<result-stem>/` exists, restore its contents into `results/estimation_inputs/`
-4. if no stored estimation inputs exist, allow only methods that do not require them; otherwise terminate as `not_applicable`
+3. if `received_estimation_artifacts/<result-stem>/` exists, restore its contents into `results/estimation_artifacts/`
+4. if no stored estimation artifacts exist, allow only methods that do not require them; otherwise terminate as `not_applicable`
 
 ## 9. 次の実装候補 / Next Implementation Candidates
 
@@ -337,7 +348,7 @@ Candidate next steps include:
 - 利用者向け入口として `estimate_result_uuid` を使える
 - `estimate_result_uuid` から stored estimate JSON を取得し、そこから `source_result_uuid` を解決できる
 - source result JSON を結果サーバから再取得できる
-- `received_estimation_inputs/<result-stem>/` から detailed estimation input artifact を復元できる
+- `received_estimation_artifacts/<result-stem>/` から detailed estimation artifact bundle を復元できる
 - 復元した artifact を使って detailed re-estimation を実行できる
 - 保存済み estimate JSON に `reestimation` ブロックを持てる
 - `reestimation` の既定値として `scope=both` と `baseline_policy=reuse-recorded-baseline` を持てる
diff --git a/docs/deploy/hardening-guide.md b/docs/deploy/hardening-guide.md
index 87a8dad..2110bbd 100644
--- a/docs/deploy/hardening-guide.md
+++ b/docs/deploy/hardening-guide.md
@@ -10,13 +10,13 @@ The portal enforces an application-level request body limit:
 RESULT_SERVER_MAX_UPLOAD_MB=512
 ```
 
-Large estimation input archives are also checked per member:
+Large estimation artifact archives are also checked per member:
 
 ```text
 RESULT_SERVER_MAX_ARCHIVE_MEMBER_MB=1024
 ```
 
-Set these values to match the largest expected PA Data or estimation input
+Set these values to match the largest expected PA Data or estimation artifact
 archive. Keep the reverse proxy body limit at or below the Flask limit so that
 oversized uploads are rejected before they consume worker memory.
 
diff --git a/docs/guides/add-estimation-package.md b/docs/guides/add-estimation-package.md
index 9d1a6b5..4b81662 100644
--- a/docs/guides/add-estimation-package.md
+++ b/docs/guides/add-estimation-package.md
@@ -43,6 +43,7 @@
   - `counter_papi_detailed.sh`
   - `trace_mpi_basic.sh`
   - `overlap_max_basic.sh`
+  - `gpu_kernel_mlp_v15.sh`
 
 ## 3. top-level package の責務
 
@@ -67,6 +68,55 @@ section package はもっと小さくてかまいません。
 
 ここでは「1 区間の変換規則」に集中し、Estimate JSON 全体の組み立てや current / future の side 管理は BenchKit 共通層や top-level package 側へ寄せる方が自然です。
 
+GPU kernel 単位の外部推定ツールは、通常は section package として扱います。
+たとえば `gpu_kernel_mlp_v15` は、PerfTools の `MLP_NN/v1.5` を「GPU 区間だけを変換する package」として接続します。
+top-level package は `instrumented_app_sections_dummy` などのままにして、GPU 区間にだけ `gpu_kernel_mlp_v15` を割り当てます。
+
+```bash
+bk_declare_section --side future gpu_kernel_region gpu_kernel_mlp_v15
+bk_emit_declared_section --side future gpu_kernel_region "$measured_gpu_time" results/estimation_artifacts/gpu_kernel_region_input.csv
+```
+
+PerfTools 本体は BenchKit に vendoring せず、実行時に次の環境変数で渡します。
+
+```bash
+export BK_GPU_MLP_PERFTOOLS_ROOT=/path/to/PerfTools
+export BK_GPU_MLP_PYTHON=python3
+```
+
+section artifact は PerfTools 側の static GPU spec sheet から作られた prepared CSV を想定します。
+BenchKit 実行時に GPU spec を動的採取しません。
+テストやデバッグでは、既に作成済みの prediction CSV を使えます。
+
+```bash
+export BK_GPU_MLP_ARTIFACT_MODE=prediction
+# or section-specific override:
+export BK_GPU_MLP_PREDICTION_CSV_GPU_KERNEL_REGION=/path/to/pred.csv
+```
+
+section package は prediction CSV の `Execution Time [ns]` を合算し、その section の future-side `time` にします。
+
+qws を使って CI 配管だけを確認する場合は、実際の qws が GPU 化されていなくても GPU MLP smoke test を有効にできます。
+`BK_QWS_GPU_MLP_SMOKE_MODE=prediction` では、同梱のサンプル prediction CSV を使い、run job が `gpu_kernel_region` section と prediction CSV artifact を結果に埋め込みます。
+`BK_QWS_GPU_MLP_SMOKE_MODE=perftools` では、estimate job が PerfTools repo を checkout し、`MLP_NN/examples/example_input_mixed-src_20kernels.csv` を `predict_v15.py` に渡して prediction CSV を生成します。
+どちらのモードでも、estimate job が `gpu_kernel_mlp_v15` section package を通して Estimate JSON へ変換できることを確認します。
+qws の推定スクリプト単体では既定無効ですが、GPU estimator integration の立ち上げ期間中は GitLab CI 側の既定を一時的に有効化しています。
+
+```bash
+export BK_QWS_GPU_MLP_SMOKE=true
+export BK_QWS_GPU_MLP_SMOKE_MODE=perftools
+export BK_ESTIMATE_RUNNER_TAG=<python-and-jq-estimator-runner-tag>
+export BK_GPU_MLP_PERFTOOLS_REPO=https://github.com/masaaki-kondo/PerfTools.git
+export BK_GPU_MLP_PERFTOOLS_REF=main
+```
+
+これらの変数は、GPU estimator integration の立ち上げ期間だけの暫定スイッチです。
+`BK_QWS_GPU_MLP_SMOKE` は qws を使った配管確認用、`BK_QWS_GPU_MLP_SMOKE_MODE` は prediction fixture 取り込みと PerfTools 実行の切り替え用、`BK_ESTIMATE_RUNNER_TAG` は推定用 runner/container を手動で逃がすためのものです。
+実際の GPU profiling input と推定 runner の運用が固まったら、専用の package/runner 設定へ置き換え、これらの暫定変数は削除対象として見直してください。
+
+`perftools` smoke mode は GitHub から PerfTools を取得するため、推定 runner/container には `git` と外部接続、Python 3.12 以上、numpy/pandas/torch が必要です。
+実運用では smoke mode ではなく、推定 runner/container に PerfTools checkout を用意し、section artifact として実アプリ由来の prepared input CSV を渡してください。
+
 ## 5. metadata に持たせるもの
 
 現在の実装では、package metadata がかなり重要です。
diff --git a/docs/guides/add-estimation-to-app.md b/docs/guides/add-estimation-to-app.md
index 387bbd1..e71d43a 100644
--- a/docs/guides/add-estimation-to-app.md
+++ b/docs/guides/add-estimation-to-app.md
@@ -192,13 +192,13 @@ bk_emit_declared_section \
 bk_emit_declared_section \
   --side future \
   compute_solver 1.03 \
-  results/estimation_inputs/compute_solver_papi.tgz \
+  results/estimation_artifacts/compute_solver_papi.tgz \
   >> results/result
 
 bk_emit_declared_overlap \
   --side future \
   compute_hopping,halo_exchange 0.23 \
-  results/estimation_inputs/compute_halo_overlap.json \
+  results/estimation_artifacts/compute_halo_overlap.json \
   >> results/result
 ```
 
@@ -288,7 +288,7 @@ app 側では、まず section 名と `estimation_package` を決めることを
 
 特に PAPI のように複数回実行が必要になる採取は、app 側に細かく書かせすぎると重くなります。package 側は「`papi` が必要」と定義し、BenchKit 側が採取や保存の共通処理を引き受ける形が自然です。
 
-現状の参照実装では `results/estimation_inputs/` を使う例がありますが、これは将来も app 側が細かく書き続けるべきという意味ではありません。
+現状の参照実装では `results/estimation_artifacts/` を使う例がありますが、これは将来も app 側が細かく書き続けるべきという意味ではありません。
 
 `bk_emit_section` や `bk_emit_overlap` は残してよく、`estimate.sh` 内の宣言と共存できます。宣言は package 割当てを先に示し、`bk_emit_*` は実際に得られた値を Result JSON に流し込む手段として使います。
 
diff --git a/docs/guides/developer-reference.md b/docs/guides/developer-reference.md
index 33f3553..9c968a2 100644
--- a/docs/guides/developer-reference.md
+++ b/docs/guides/developer-reference.md
@@ -71,7 +71,7 @@ The supported baseline is that contributors can add apps, sites, and estimation
 
 `result_server/` provides:
 
-- ingest APIs for results, estimates, profiler archives, and estimation inputs
+- ingest APIs for results, estimates, profiler archives, and estimation artifacts
 - public and confidential result views
 - detailed result and estimate pages
 - usage reporting
@@ -101,6 +101,21 @@ The supported baseline is that contributors can add apps, sites, and estimation
 - `result_server/routes/admin.py`
   Admin-only user management.
 
+### Main API Endpoints
+
+The canonical estimation artifact endpoints are:
+
+- `POST /api/ingest/estimation-artifacts`
+  Upload a lightweight estimation artifact bundle associated with a source result UUID.
+- `GET /api/query/estimation-artifacts?uuid=<source_result_uuid>`
+  Download the stored estimation artifact bundle for re-estimation.
+
+The older `estimation-inputs` endpoint names remain as compatibility aliases
+only. New client code and documentation should use `estimation-artifacts`.
+Estimation artifact bundles may contain prepared estimator inputs, prediction
+outputs, and logs, but should not duplicate large profiler archives such as PA
+Data or `*.ncu-rep`.
+
 ### Main Templates
 
 - `result_server/templates/_results_base.html`
diff --git a/programs/genesis/estimate.sh b/programs/genesis/estimate.sh
new file mode 100644
index 0000000..1c26103
--- /dev/null
+++ b/programs/genesis/estimate.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+# estimate.sh — GENESIS estimation entrypoint and run-time section metadata.
+
+genesis_declare_estimation_layout() {
+  bk_clear_estimation_defaults
+  bk_clear_estimation_declarations
+  bk_define_current_estimation_package weakscaling
+  bk_define_future_estimation_package instrumented_app_sections_dummy
+  bk_define_baseline_system "${BK_ESTIMATION_BASELINE_SYSTEM:-MiyabiG}"
+  bk_define_baseline_exp "${BK_ESTIMATION_BASELINE_EXP:-${BK_GENESIS_EXP:-p8}}"
+  bk_define_future_system "${BK_ESTIMATION_FUTURE_SYSTEM:-GPU_MLP_TARGET}"
+  bk_define_current_target_nodes "${BK_ESTIMATION_CURRENT_TARGET_NODES:-1}"
+  bk_define_future_target_nodes "${BK_ESTIMATION_FUTURE_TARGET_NODES:-1}"
+  bk_declare_section --side future gpu_kernel_region gpu_kernel_mlp_v15
+}
+
+genesis_emit_estimation_data_from_fom() {
+  local fom="$1"
+  local artifact_path="results/padata0.tgz"
+  local padata_path="$artifact_path"
+
+  case "${BK_GENESIS_GPU_MLP_PROFILE:-false}" in
+    1|true|TRUE|yes|YES|on|ON) ;;
+    *) return 0 ;;
+  esac
+
+  if [[ -n "${GENESIS_BENCHKIT_ROOT:-}" ]]; then
+    padata_path="${GENESIS_BENCHKIT_ROOT}/${artifact_path}"
+  fi
+  if [[ ! -f "$padata_path" ]]; then
+    echo "Genesis GPU MLP estimation requested but profiler archive was not found: ${padata_path}" >&2
+    return 0
+  fi
+
+  bk_emit_declared_section --side future gpu_kernel_region "$fom" "$artifact_path"
+}
+
+source scripts/bk_functions.sh
+source scripts/estimation/common.sh
+
+BK_ESTIMATION_SECTION_DEFAULT_FACTOR="${BK_ESTIMATION_SECTION_DEFAULT_FACTOR:-1.0}"
+BK_GPU_MLP_ARTIFACT_MODE="${BK_GPU_MLP_ARTIFACT_MODE:-ncu}"
+BK_GPU_MLP_SOURCE_GPU="${BK_GPU_MLP_SOURCE_GPU:-H100}"
+BK_GPU_MLP_KERNEL_COUNT="${BK_GPU_MLP_KERNEL_COUNT:-20}"
+export BK_GPU_MLP_ARTIFACT_MODE
+export BK_GPU_MLP_SOURCE_GPU
+export BK_GPU_MLP_KERNEL_COUNT
+
+genesis_declare_estimation_layout
+bk_estimation_apply_declared_defaults
+BK_ESTIMATION_PACKAGE="${BK_ESTIMATION_PACKAGE:-$BK_ESTIMATION_FUTURE_PACKAGE}"
+
+if [[ "${BASH_SOURCE[0]}" != "$0" ]]; then
+  return 0 2>/dev/null || exit 0
+fi
+
+BK_ESTIMATION_INPUT_JSON="$1"
+
+bk_estimation_run_declared_future_package "$BK_ESTIMATION_INPUT_JSON"
+bk_estimation_run_recorded_current_with_weakscaling \
+  "${BK_ESTIMATION_BASELINE_SYSTEM:-MiyabiG}" \
+  "${BK_ESTIMATION_BASELINE_EXP:-}" \
+  "${BK_ESTIMATION_CURRENT_TARGET_NODES:-1}" \
+  "${BK_ESTIMATION_CURRENT_PACKAGE:-weakscaling}"
+
+bk_estimation_write_output "results/estimate_${est_code}_0.json"
diff --git a/programs/genesis/run.sh b/programs/genesis/run.sh
index dc15d40..05f97cf 100644
--- a/programs/genesis/run.sh
+++ b/programs/genesis/run.sh
@@ -8,13 +8,16 @@ nthreads="$4"
 numproc=$(( numproc_node * nodes ))
 
 source "${PWD}/scripts/bk_functions.sh"
+source "${PWD}/programs/genesis/estimate.sh"
 
 SCRIPT_DIR="${PWD}"
+export GENESIS_BENCHKIT_ROOT="$SCRIPT_DIR"
 REPO_DIR="genesis_benchmark_input"
 REPO_URL="https://github.com/genesis-release-r-ccs/${REPO_DIR}.git"
 BRANCH="main"
 dir_path="npt/genesis2.0beta_3.5fs/apoa1"
 header=p8
+exp="${BK_GENESIS_EXP:-$header}"
 input=${header}.inp
 resultsdir=${SCRIPT_DIR}/results
 artifactsdir=${SCRIPT_DIR}/artifacts
@@ -152,7 +155,15 @@ run_genesis_gh200_gpu() {
     fi
 
     genesis_profiler_tool=$(bk_get_profiler_tool "$genesis_profiler_requested") || return 1
-    genesis_profiler_level="${!profiler_level_var:-${GENESIS_PROFILER_LEVEL:-single}}"
+    local genesis_default_profiler_level="single"
+    case "${BK_GENESIS_GPU_MLP_PROFILE:-false}" in
+      1|true|TRUE|yes|YES|on|ON)
+        genesis_default_profiler_level="detailed"
+        export BK_PROFILER_NCU_RAW_CSV="${BK_PROFILER_NCU_RAW_CSV:-true}"
+        export BK_PROFILER_ARGS="${BK_PROFILER_ARGS:---launch-count ${BK_GPU_MLP_NCU_LAUNCH_COUNT:-20}}"
+        ;;
+    esac
+    genesis_profiler_level="${!profiler_level_var:-${GENESIS_PROFILER_LEVEL:-${genesis_default_profiler_level}}}"
     if [ -n "$genesis_profiler_tool" ]; then
         if [ "$genesis_profiler_tool" = "ncu" ] && ! command -v ncu >/dev/null 2>&1; then
             if [ "$genesis_profiler_explicit" -eq 1 ]; then
@@ -223,14 +234,17 @@ fom_val=$(awk -F'=' '/^[[:space:]]*dynamics[[:space:]]*=/ {
 			print $2;
 			exit
 			}' ${output})
-cd - > /dev/null
+cd "$SCRIPT_DIR" > /dev/null
 
 if [[ -z "$fom_val" ]]; then
     echo "Warning: FOM value not found in ${output}" >&2
     fom_val="nan"   # or 0.0
 fi
 
-bk_emit_result --fom "$fom_val" --nodes "$nodes" --numproc-node "$numproc_node" --nthreads "$nthreads" >> ${resultsdir}/result
+{
+    bk_emit_result --fom "$fom_val" --exp "$exp" --nodes "$nodes" --numproc-node "$numproc_node" --nthreads "$nthreads"
+    genesis_emit_estimation_data_from_fom "$fom_val"
+} >> ${resultsdir}/result
 # if information is requierd
 #printf "%-10s nodes=%2d numproc=%3d  FOM: %.3f\n" \
 #    "$system" "$nodes" "$numproc" "$fom_val" >> ../results/result
diff --git a/programs/qws/estimate.sh b/programs/qws/estimate.sh
index 796355a..0b90341 100644
--- a/programs/qws/estimate.sh
+++ b/programs/qws/estimate.sh
@@ -1,6 +1,41 @@
 #!/bin/bash
 # estimate.sh — Reference package-based estimation entrypoint for qws
 
+qws_gpu_mlp_smoke_enabled() {
+  case "${BK_QWS_GPU_MLP_SMOKE:-false}" in
+    1|true|TRUE|yes|YES|on|ON) return 0 ;;
+    *) return 1 ;;
+  esac
+}
+
+qws_gpu_mlp_smoke_mode() {
+  case "${BK_QWS_GPU_MLP_SMOKE_MODE:-prediction}" in
+    perftools|input|predictor) printf 'perftools\n' ;;
+    *) printf 'prediction\n' ;;
+  esac
+}
+
+qws_repo_root() {
+  if [[ -f programs/qws/estimate.sh ]]; then
+    printf '.\n'
+  elif [[ -f ../programs/qws/estimate.sh ]]; then
+    printf '..\n'
+  else
+    printf '.\n'
+  fi
+}
+
+qws_results_dir() {
+  local root
+
+  root=$(qws_repo_root)
+  if [[ "$root" == "." ]]; then
+    printf 'results\n'
+  else
+    printf '%s/results\n' "$root"
+  fi
+}
+
 qws_declare_estimation_layout() {
   bk_clear_estimation_defaults
   bk_clear_estimation_declarations
@@ -17,17 +52,51 @@ qws_declare_estimation_layout() {
   bk_declare_section --side future halo_exchange quarter
   bk_declare_section --side future allreduce logp
   bk_declare_section --side future write_result half
+  if qws_gpu_mlp_smoke_enabled; then
+    bk_declare_section --side future gpu_kernel_region gpu_kernel_mlp_v15
+  fi
   bk_declare_overlap --side future compute_hopping,halo_exchange half
 }
 
 qws_create_dummy_estimation_artifact() {
   local rel_path="$1"
   local content="$2"
-  local full_path="results/${rel_path}"
+  local full_path
+
+  full_path="$(qws_results_dir)/${rel_path}"
   mkdir -p "$(dirname "$full_path")"
   printf '%s\n' "$content" > "$full_path"
 }
 
+qws_create_gpu_mlp_smoke_artifact() {
+  local mode
+  local rel_path
+  local root
+  local fixture_path
+  local full_path
+
+  mode=$(qws_gpu_mlp_smoke_mode)
+  if [[ "$mode" == "perftools" ]]; then
+    rel_path="estimation_artifacts/qws_gpu_kernel_mlp_v15_input.csv"
+    full_path="$(qws_results_dir)/${rel_path}"
+    mkdir -p "$(dirname "$full_path")"
+    cat > "$full_path" <<'EOF'
+kernel_name,src_gpu,tgt_gpu
+qws_smoke_uses_perftools_example,A100,H100
+EOF
+    printf 'results/%s\n' "$rel_path"
+    return 0
+  fi
+
+  rel_path="estimation_artifacts/qws_gpu_kernel_mlp_v15_pred.csv"
+  root=$(qws_repo_root)
+  fixture_path="${root}/programs/qws/fixtures/gpu_kernel_mlp_v15_pred.csv"
+  full_path="$(qws_results_dir)/${rel_path}"
+  mkdir -p "$(dirname "$full_path")"
+  cp "$fixture_path" "$full_path"
+  printf 'results/%s\n' "$rel_path"
+}
+
 qws_emit_estimation_data_from_fom() {
   local fom="$1"
   local section_prepare_rhs
@@ -36,6 +105,8 @@ qws_emit_estimation_data_from_fom() {
   local section_halo_exchange
   local section_allreduce
   local section_write_result
+  local section_gpu_kernel_region
+  local gpu_mlp_artifact
   local overlap_compute_halo
 
   section_prepare_rhs=$(awk -v x="$fom" 'BEGIN {printf "%.3f", x * 0.16}')
@@ -45,22 +116,29 @@ qws_emit_estimation_data_from_fom() {
   section_allreduce=$(awk -v x="$fom" 'BEGIN {printf "%.3f", x * 0.16}')
   section_write_result=$(awk -v x="$fom" 'BEGIN {printf "%.3f", x * 0.08}')
   overlap_compute_halo=$(awk -v x="$fom" 'BEGIN {printf "%.3f", x * 0.04}')
+  if qws_gpu_mlp_smoke_enabled; then
+    section_gpu_kernel_region=$(awk -v x="$fom" 'BEGIN {printf "%.3f", x * 0.10}')
+    gpu_mlp_artifact=$(qws_create_gpu_mlp_smoke_artifact)
+  fi
+
+  qws_create_dummy_estimation_artifact "estimation_artifacts/prepare_rhs_interval.json" "{\"section\":\"prepare_rhs\",\"kind\":\"interval_time\"}"
+  qws_create_dummy_estimation_artifact "estimation_artifacts/compute_hopping_papi.tgz" "dummy papi archive for compute_hopping"
+  qws_create_dummy_estimation_artifact "estimation_artifacts/compute_solver_papi.tgz" "dummy papi archive for compute_solver"
+  qws_create_dummy_estimation_artifact "estimation_artifacts/halo_exchange_trace.tgz" "dummy mpi trace archive for halo_exchange"
+  qws_create_dummy_estimation_artifact "estimation_artifacts/allreduce_trace.tgz" "dummy collective trace archive for allreduce"
+  qws_create_dummy_estimation_artifact "estimation_artifacts/write_result_interval.json" "{\"section\":\"write_result\",\"kind\":\"interval_time\"}"
+  qws_create_dummy_estimation_artifact "estimation_artifacts/compute_halo_overlap.json" "{\"overlap\":[\"compute_hopping\",\"halo_exchange\"],\"kind\":\"overlap_time\"}"
 
-  qws_create_dummy_estimation_artifact "estimation_inputs/prepare_rhs_interval.json" "{\"section\":\"prepare_rhs\",\"kind\":\"interval_time\"}"
-  qws_create_dummy_estimation_artifact "estimation_inputs/compute_hopping_papi.tgz" "dummy papi archive for compute_hopping"
-  qws_create_dummy_estimation_artifact "estimation_inputs/compute_solver_papi.tgz" "dummy papi archive for compute_solver"
-  qws_create_dummy_estimation_artifact "estimation_inputs/halo_exchange_trace.tgz" "dummy mpi trace archive for halo_exchange"
-  qws_create_dummy_estimation_artifact "estimation_inputs/allreduce_trace.tgz" "dummy collective trace archive for allreduce"
-  qws_create_dummy_estimation_artifact "estimation_inputs/write_result_interval.json" "{\"section\":\"write_result\",\"kind\":\"interval_time\"}"
-  qws_create_dummy_estimation_artifact "estimation_inputs/compute_halo_overlap.json" "{\"overlap\":[\"compute_hopping\",\"halo_exchange\"],\"kind\":\"overlap_time\"}"
-
-  bk_emit_declared_section --side future prepare_rhs "$section_prepare_rhs" results/estimation_inputs/prepare_rhs_interval.json
-  bk_emit_declared_section --side future compute_hopping "$section_compute_hopping" results/estimation_inputs/compute_hopping_papi.tgz
-  bk_emit_declared_section --side future compute_solver "$section_compute_solver" results/estimation_inputs/compute_solver_papi.tgz
-  bk_emit_declared_section --side future halo_exchange "$section_halo_exchange" results/estimation_inputs/halo_exchange_trace.tgz
-  bk_emit_declared_section --side future allreduce "$section_allreduce" results/estimation_inputs/allreduce_trace.tgz
-  bk_emit_declared_section --side future write_result "$section_write_result" results/estimation_inputs/write_result_interval.json
-  bk_emit_declared_overlap --side future compute_hopping,halo_exchange "$overlap_compute_halo" results/estimation_inputs/compute_halo_overlap.json
+  bk_emit_declared_section --side future prepare_rhs "$section_prepare_rhs" results/estimation_artifacts/prepare_rhs_interval.json
+  bk_emit_declared_section --side future compute_hopping "$section_compute_hopping" results/estimation_artifacts/compute_hopping_papi.tgz
+  bk_emit_declared_section --side future compute_solver "$section_compute_solver" results/estimation_artifacts/compute_solver_papi.tgz
+  bk_emit_declared_section --side future halo_exchange "$section_halo_exchange" results/estimation_artifacts/halo_exchange_trace.tgz
+  bk_emit_declared_section --side future allreduce "$section_allreduce" results/estimation_artifacts/allreduce_trace.tgz
+  bk_emit_declared_section --side future write_result "$section_write_result" results/estimation_artifacts/write_result_interval.json
+  if qws_gpu_mlp_smoke_enabled; then
+    bk_emit_declared_section --side future gpu_kernel_region "$section_gpu_kernel_region" "$gpu_mlp_artifact"
+  fi
+  bk_emit_declared_overlap --side future compute_hopping,halo_exchange "$overlap_compute_halo" results/estimation_artifacts/compute_halo_overlap.json
 }
 
 source scripts/bk_functions.sh
@@ -70,6 +148,14 @@ BK_ESTIMATION_SECTION_DEFAULT_FACTOR="${BK_ESTIMATION_SECTION_DEFAULT_FACTOR:-0.
 BK_ESTIMATION_LOGP_SECTION_NAME="${BK_ESTIMATION_LOGP_SECTION_NAME:-allreduce}"
 BK_ESTIMATION_INPUT_JSON="$1"
 
+if qws_gpu_mlp_smoke_enabled; then
+  if [[ "$(qws_gpu_mlp_smoke_mode)" == "perftools" ]]; then
+    export BK_GPU_MLP_ARTIFACT_MODE="${BK_GPU_MLP_ARTIFACT_MODE:-input}"
+  else
+    export BK_GPU_MLP_ARTIFACT_MODE="${BK_GPU_MLP_ARTIFACT_MODE:-prediction}"
+  fi
+fi
+
 qws_declare_estimation_layout
 bk_estimation_apply_declared_defaults
 BK_ESTIMATION_PACKAGE="${BK_ESTIMATION_PACKAGE:-$BK_ESTIMATION_FUTURE_PACKAGE}"
diff --git a/programs/qws/fixtures/gpu_kernel_mlp_v15_pred.csv b/programs/qws/fixtures/gpu_kernel_mlp_v15_pred.csv
new file mode 100644
index 0000000..18c6336
--- /dev/null
+++ b/programs/qws/fixtures/gpu_kernel_mlp_v15_pred.csv
@@ -0,0 +1,5 @@
+# Minimal PerfTools MLP_NN/v1.5-style prediction fixture for CI plumbing tests.
+kernel_name,src_gpu,tgt_gpu,Execution Time [ns],Memory Throughput [%],Achieved Occupancy,brk_memory,brk_pipeline_contention,brk_sync,brk_scheduling_overhead,t_mem_ns,t_comp_ns,t_roof_ns,efficiency_eta
+qws_smoke_kernel_0,A100,H100,1500000,48.0,0.72,0.42,0.28,0.20,0.10,700000,650000,700000,0.54
+qws_smoke_kernel_1,A100,H100,2500000,53.0,0.77,0.46,0.24,0.20,0.10,1200000,900000,1200000,0.62
+qws_smoke_kernel_2,A100,H100,2000000,58.0,0.81,0.50,0.20,0.20,0.10,1000000,780000,1000000,0.68
diff --git a/result_server/app.py b/result_server/app.py
index a94d41a..ac43cfe 100644
--- a/result_server/app.py
+++ b/result_server/app.py
@@ -77,7 +77,7 @@ def _configure_result_directories(app, base_dir):
     dir_map = {
         "RECEIVED_DIR": os.path.join(base_dir, "received"),
         "RECEIVED_PADATA_DIR": os.path.join(base_dir, "received_padata"),
-        "RECEIVED_ESTIMATION_INPUTS_DIR": os.path.join(base_dir, "received_estimation_inputs"),
+        "RECEIVED_ESTIMATION_ARTIFACTS_DIR": os.path.join(base_dir, "received_estimation_artifacts"),
         "ESTIMATED_DIR": os.path.join(base_dir, "estimated_results"),
     }
     for path in dir_map.values():
diff --git a/result_server/app_dev.py b/result_server/app_dev.py
index 89756fe..cb537ce 100644
--- a/result_server/app_dev.py
+++ b/result_server/app_dev.py
@@ -50,12 +50,12 @@ def setup_dev_environment(base_dir):
     for sub in [
         "main/received",
         "main/received_padata",
-        "main/received_estimation_inputs",
+        "main/received_estimation_artifacts",
         "main/estimated_results",
         "main/flask_session",
         "dev1/received",
         "dev1/received_padata",
-        "dev1/received_estimation_inputs",
+        "dev1/received_estimation_artifacts",
         "dev1/estimated_results",
         "dev1/flask_session",
     ]:
@@ -209,16 +209,16 @@ def payload_too_large(_error):
 
     received_dir = os.path.join(base_dir, "main", "received")
     received_padata_dir = os.path.join(base_dir, "main", "received_padata")
-    received_estimation_inputs_dir = os.path.join(base_dir, "main", "received_estimation_inputs")
+    received_estimation_artifacts_dir = os.path.join(base_dir, "main", "received_estimation_artifacts")
     estimated_dir = os.path.join(base_dir, "main", "estimated_results")
     os.makedirs(received_dir, exist_ok=True)
     os.makedirs(received_padata_dir, exist_ok=True)
-    os.makedirs(received_estimation_inputs_dir, exist_ok=True)
+    os.makedirs(received_estimation_artifacts_dir, exist_ok=True)
     os.makedirs(estimated_dir, exist_ok=True)
 
     app.config["RECEIVED_DIR"] = received_dir
     app.config["RECEIVED_PADATA_DIR"] = received_padata_dir
-    app.config["RECEIVED_ESTIMATION_INPUTS_DIR"] = received_estimation_inputs_dir
+    app.config["RECEIVED_ESTIMATION_ARTIFACTS_DIR"] = received_estimation_artifacts_dir
     app.config["ESTIMATED_DIR"] = estimated_dir
 
     # Home routes and loaders pull everything from current_app.config.
diff --git a/result_server/routes/api.py b/result_server/routes/api.py
index 1ce499e..1c6b255 100644
--- a/result_server/routes/api.py
+++ b/result_server/routes/api.py
@@ -361,10 +361,11 @@ def ingest_padata():
     return response, 200
 
 
+@api_bp.route("/api/ingest/estimation-artifacts", methods=["POST"])
 @api_bp.route("/api/ingest/estimation-inputs", methods=["POST"])
 @rate_limited(max_per_minute=120, key_fn=_api_rate_key, scope="api_ingest")
-def ingest_estimation_inputs():
-    """Estimation input archive (tgz) upload and expansion."""
+def ingest_estimation_artifacts():
+    """Estimation artifact archive (tgz) upload and expansion."""
     runner_id = require_api_key()
 
     uuid_str = request.form.get("id")
@@ -381,10 +382,10 @@ def ingest_estimation_inputs():
         abort(404, description=f"No result found for uuid={uuid_str}")
 
     result_stem = os.path.splitext(result_filename)[0]
-    inputs_root = current_app.config["RECEIVED_ESTIMATION_INPUTS_DIR"]
-    os.makedirs(inputs_root, exist_ok=True)
-    target_dir = os.path.join(inputs_root, result_stem)
-    temp_dir = tempfile.mkdtemp(prefix=f".{result_stem}.", dir=inputs_root)
+    artifacts_root = current_app.config["RECEIVED_ESTIMATION_ARTIFACTS_DIR"]
+    os.makedirs(artifacts_root, exist_ok=True)
+    target_dir = os.path.join(artifacts_root, result_stem)
+    temp_dir = tempfile.mkdtemp(prefix=f".{result_stem}.", dir=artifacts_root)
     try:
         _safe_extract_tar_bytes(uploaded_file, temp_dir)
         replaced = _replace_directory_after_success(temp_dir, target_dir)
@@ -393,7 +394,7 @@ def ingest_estimation_inputs():
             shutil.rmtree(temp_dir)
         raise
 
-    print(f"Saved estimation inputs: {target_dir}", flush=True)
+    print(f"Saved estimation artifacts: {target_dir}", flush=True)
     response = {
         "status": "uploaded",
         "id": uuid_str,
@@ -406,7 +407,7 @@ def ingest_estimation_inputs():
         target=result_stem,
         result="success",
         details={
-            "ingest_type": "estimation_inputs",
+            "ingest_type": "estimation_artifacts",
             "id": uuid_str,
             "replaced": replaced,
         },
@@ -508,10 +509,11 @@ def query_result():
     abort(404, description=f"No result found for system={system}, code={code}, exp={exp}")
 
 
+@api_bp.route("/api/query/estimation-artifacts", methods=["GET"])
 @api_bp.route("/api/query/estimation-inputs", methods=["GET"])
 @rate_limited(max_per_minute=60, key_fn=_api_rate_key, scope="api_query")
-def query_estimation_inputs():
-    """Return estimation input artifacts for a result UUID as a tar.gz archive."""
+def query_estimation_artifacts():
+    """Return estimation artifacts for a result UUID as a tar.gz archive."""
     runner_id = require_api_key()
 
     uuid_value = request.args.get("uuid")
@@ -526,16 +528,16 @@ def query_estimation_inputs():
 
     result_stem = os.path.splitext(result_filename)[0]
     source_dir = os.path.join(
-        current_app.config["RECEIVED_ESTIMATION_INPUTS_DIR"], result_stem
+        current_app.config["RECEIVED_ESTIMATION_ARTIFACTS_DIR"], result_stem
     )
     if not os.path.isdir(source_dir):
-        abort(404, description=f"No estimation inputs found for uuid={uuid_value}")
+        abort(404, description=f"No estimation artifacts found for uuid={uuid_value}")
 
     audit_event(
         "api_query_accepted",
         actor=runner_id,
         result="success",
-        details={"query_type": "estimation_inputs"},
+        details={"query_type": "estimation_artifacts"},
     )
 
     buffer = io.BytesIO()
@@ -551,7 +553,7 @@ def query_estimation_inputs():
         buffer,
         mimetype="application/gzip",
         as_attachment=True,
-        download_name=f"estimation_inputs_{result_stem}.tgz",
+        download_name=f"estimation_artifacts_{result_stem}.tgz",
     )
 
 
diff --git a/result_server/test_support.py b/result_server/test_support.py
index 8cda7d0..437609c 100644
--- a/result_server/test_support.py
+++ b/result_server/test_support.py
@@ -152,14 +152,14 @@ def build_api_route_app(
     *,
     received_dir,
     received_padata_dir,
-    received_estimation_inputs_dir,
+    received_estimation_artifacts_dir,
     estimated_dir,
 ):
     """Build a Flask app with the API, results, and estimated blueprints for API tests."""
     app = Flask(__name__)
     app.config["RECEIVED_DIR"] = received_dir
     app.config["RECEIVED_PADATA_DIR"] = received_padata_dir
-    app.config["RECEIVED_ESTIMATION_INPUTS_DIR"] = received_estimation_inputs_dir
+    app.config["RECEIVED_ESTIMATION_ARTIFACTS_DIR"] = received_estimation_artifacts_dir
     app.config["ESTIMATED_DIR"] = estimated_dir
     app.config["TESTING"] = True
 
diff --git a/result_server/tests/test_api_routes.py b/result_server/tests/test_api_routes.py
index 170b073..b33dd61 100644
--- a/result_server/tests/test_api_routes.py
+++ b/result_server/tests/test_api_routes.py
@@ -25,24 +25,24 @@ def tmp_dirs():
     """Create temporary directories used by the API tests."""
     received = tempfile.mkdtemp()
     received_padata = tempfile.mkdtemp()
-    received_estimation_inputs = tempfile.mkdtemp()
+    received_estimation_artifacts = tempfile.mkdtemp()
     estimated = tempfile.mkdtemp()
-    yield received, received_padata, received_estimation_inputs, estimated
+    yield received, received_padata, received_estimation_artifacts, estimated
     shutil.rmtree(received)
     shutil.rmtree(received_padata)
-    shutil.rmtree(received_estimation_inputs)
+    shutil.rmtree(received_estimation_artifacts)
     shutil.rmtree(estimated)
 
 
 @pytest.fixture
 def app(tmp_dirs):
     """Build a Flask app configured for API route tests."""
-    received, received_padata, received_estimation_inputs, estimated = tmp_dirs
+    received, received_padata, received_estimation_artifacts, estimated = tmp_dirs
 
     app = build_api_route_app(
         received_dir=received,
         received_padata_dir=received_padata,
-        received_estimation_inputs_dir=received_estimation_inputs,
+        received_estimation_artifacts_dir=received_estimation_artifacts,
         estimated_dir=estimated,
     )
     app.config["INGEST_KEYS"] = {API_KEY: "test-runner"}
@@ -126,11 +126,11 @@ def test_legacy_result_server_key_env_is_still_accepted(self, tmp_dirs, monkeypa
         """RESULT_SERVER_KEY should remain valid as the default runner fallback."""
         monkeypatch.delenv("RESULT_SERVER_KEYS", raising=False)
         monkeypatch.setenv("RESULT_SERVER_KEY", "legacy-key-12345678901234567890")
-        received, received_padata, received_estimation_inputs, estimated = tmp_dirs
+        received, received_padata, received_estimation_artifacts, estimated = tmp_dirs
         app = build_api_route_app(
             received_dir=received,
             received_padata_dir=received_padata,
-            received_estimation_inputs_dir=received_estimation_inputs,
+            received_estimation_artifacts_dir=received_estimation_artifacts,
             estimated_dir=estimated,
         )
 
@@ -419,9 +419,9 @@ def _seed_result(self, received_dir, uuid_value):
             json.dump({"code": "qws", "_server_uuid": uuid_value}, f)
         return os.path.splitext(filename)[0]
 
-    def test_ingest_estimation_inputs_expands_under_result_stem(self, client, tmp_dirs):
+    def test_ingest_estimation_artifacts_expands_under_result_stem(self, client, tmp_dirs):
         received = tmp_dirs[0]
-        estimation_inputs_dir = tmp_dirs[2]
+        estimation_artifacts_dir = tmp_dirs[2]
         uuid_value = "12345678-1234-1234-1234-123456789abc"
         result_stem = self._seed_result(received, uuid_value)
 
@@ -434,16 +434,16 @@ def test_ingest_estimation_inputs_expands_under_result_stem(self, client, tmp_di
         archive_bytes.seek(0)
 
         resp = client.post(
-            "/api/ingest/estimation-inputs",
-            data={"id": uuid_value, "file": (archive_bytes, "estimation_inputs.tgz")},
+            "/api/ingest/estimation-artifacts",
+            data={"id": uuid_value, "file": (archive_bytes, "estimation_artifacts.tgz")},
             headers={"X-API-Key": API_KEY},
             content_type="multipart/form-data",
         )
         assert resp.status_code == 200
-        saved_path = os.path.join(estimation_inputs_dir, result_stem, "prepare_rhs_interval.json")
+        saved_path = os.path.join(estimation_artifacts_dir, result_stem, "prepare_rhs_interval.json")
         assert os.path.exists(saved_path)
 
-    def test_ingest_estimation_inputs_rejects_parent_path_entry(self, client, tmp_dirs):
+    def test_ingest_estimation_artifacts_rejects_parent_path_entry(self, client, tmp_dirs):
         received = tmp_dirs[0]
         uuid_value = "12345678-1234-1234-1234-123456789abc"
         self._seed_result(received, uuid_value)
@@ -457,19 +457,19 @@ def test_ingest_estimation_inputs_rejects_parent_path_entry(self, client, tmp_di
         archive_bytes.seek(0)
 
         resp = client.post(
-            "/api/ingest/estimation-inputs",
-            data={"id": uuid_value, "file": (archive_bytes, "estimation_inputs.tgz")},
+            "/api/ingest/estimation-artifacts",
+            data={"id": uuid_value, "file": (archive_bytes, "estimation_artifacts.tgz")},
             headers={"X-API-Key": API_KEY},
             content_type="multipart/form-data",
         )
         assert resp.status_code == 400
 
-    def test_ingest_estimation_inputs_keeps_existing_data_on_bad_archive(self, client, tmp_dirs):
+    def test_ingest_estimation_artifacts_keeps_existing_data_on_bad_archive(self, client, tmp_dirs):
         received = tmp_dirs[0]
-        estimation_inputs_dir = tmp_dirs[2]
+        estimation_artifacts_dir = tmp_dirs[2]
         uuid_value = "12345678-1234-1234-1234-123456789abc"
         result_stem = self._seed_result(received, uuid_value)
-        target_dir = os.path.join(estimation_inputs_dir, result_stem)
+        target_dir = os.path.join(estimation_artifacts_dir, result_stem)
         os.makedirs(target_dir, exist_ok=True)
         existing_path = os.path.join(target_dir, "existing.json")
         with open(existing_path, "w", encoding="utf-8") as f:
@@ -484,15 +484,15 @@ def test_ingest_estimation_inputs_keeps_existing_data_on_bad_archive(self, clien
         archive_bytes.seek(0)
 
         resp = client.post(
-            "/api/ingest/estimation-inputs",
-            data={"id": uuid_value, "file": (archive_bytes, "estimation_inputs.tgz")},
+            "/api/ingest/estimation-artifacts",
+            data={"id": uuid_value, "file": (archive_bytes, "estimation_artifacts.tgz")},
             headers={"X-API-Key": API_KEY},
             content_type="multipart/form-data",
         )
         assert resp.status_code == 400
         assert os.path.exists(existing_path)
 
-    def test_ingest_estimation_inputs_rejects_absolute_path_entry(self, client, tmp_dirs):
+    def test_ingest_estimation_artifacts_rejects_absolute_path_entry(self, client, tmp_dirs):
         received = tmp_dirs[0]
         uuid_value = "12345678-1234-1234-1234-123456789abc"
         self._seed_result(received, uuid_value)
@@ -506,14 +506,14 @@ def test_ingest_estimation_inputs_rejects_absolute_path_entry(self, client, tmp_
         archive_bytes.seek(0)
 
         resp = client.post(
-            "/api/ingest/estimation-inputs",
-            data={"id": uuid_value, "file": (archive_bytes, "estimation_inputs.tgz")},
+            "/api/ingest/estimation-artifacts",
+            data={"id": uuid_value, "file": (archive_bytes, "estimation_artifacts.tgz")},
             headers={"X-API-Key": API_KEY},
             content_type="multipart/form-data",
         )
         assert resp.status_code == 400
 
-    def test_ingest_estimation_inputs_rejects_absolute_symlink(self, client, tmp_dirs):
+    def test_ingest_estimation_artifacts_rejects_absolute_symlink(self, client, tmp_dirs):
         received = tmp_dirs[0]
         uuid_value = "12345678-1234-1234-1234-123456789abc"
         self._seed_result(received, uuid_value)
@@ -527,14 +527,14 @@ def test_ingest_estimation_inputs_rejects_absolute_symlink(self, client, tmp_dir
         archive_bytes.seek(0)
 
         resp = client.post(
-            "/api/ingest/estimation-inputs",
-            data={"id": uuid_value, "file": (archive_bytes, "estimation_inputs.tgz")},
+            "/api/ingest/estimation-artifacts",
+            data={"id": uuid_value, "file": (archive_bytes, "estimation_artifacts.tgz")},
             headers={"X-API-Key": API_KEY},
             content_type="multipart/form-data",
         )
         assert resp.status_code == 400
 
-    def test_ingest_estimation_inputs_rejects_absolute_hardlink(self, client, tmp_dirs):
+    def test_ingest_estimation_artifacts_rejects_absolute_hardlink(self, client, tmp_dirs):
         received = tmp_dirs[0]
         uuid_value = "12345678-1234-1234-1234-123456789abc"
         self._seed_result(received, uuid_value)
@@ -548,25 +548,25 @@ def test_ingest_estimation_inputs_rejects_absolute_hardlink(self, client, tmp_di
         archive_bytes.seek(0)
 
         resp = client.post(
-            "/api/ingest/estimation-inputs",
-            data={"id": uuid_value, "file": (archive_bytes, "estimation_inputs.tgz")},
+            "/api/ingest/estimation-artifacts",
+            data={"id": uuid_value, "file": (archive_bytes, "estimation_artifacts.tgz")},
             headers={"X-API-Key": API_KEY},
             content_type="multipart/form-data",
         )
         assert resp.status_code == 400
 
-    def test_query_estimation_inputs_returns_archive(self, client, tmp_dirs):
+    def test_query_estimation_artifacts_returns_archive(self, client, tmp_dirs):
         received = tmp_dirs[0]
-        estimation_inputs_dir = tmp_dirs[2]
+        estimation_artifacts_dir = tmp_dirs[2]
         uuid_value = "12345678-1234-1234-1234-123456789abc"
         result_stem = self._seed_result(received, uuid_value)
-        target_dir = os.path.join(estimation_inputs_dir, result_stem)
+        target_dir = os.path.join(estimation_artifacts_dir, result_stem)
         os.makedirs(target_dir, exist_ok=True)
         with open(os.path.join(target_dir, "compute_solver_papi.tgz"), "wb") as f:
             f.write(b"dummy")
 
         resp = client.get(
-            f"/api/query/estimation-inputs?uuid={uuid_value}",
+            f"/api/query/estimation-artifacts?uuid={uuid_value}",
             headers={"X-API-Key": API_KEY},
         )
         assert resp.status_code == 200
diff --git a/result_server/tests/test_audit_logging.py b/result_server/tests/test_audit_logging.py
index 94e730c..b8c5a3d 100644
--- a/result_server/tests/test_audit_logging.py
+++ b/result_server/tests/test_audit_logging.py
@@ -74,16 +74,16 @@ def create_invitation(self, email, affiliations):
 def _api_app():
     received = tempfile.mkdtemp()
     received_padata = tempfile.mkdtemp()
-    received_estimation_inputs = tempfile.mkdtemp()
+    received_estimation_artifacts = tempfile.mkdtemp()
     estimated = tempfile.mkdtemp()
     app = build_api_route_app(
         received_dir=received,
         received_padata_dir=received_padata,
-        received_estimation_inputs_dir=received_estimation_inputs,
+        received_estimation_artifacts_dir=received_estimation_artifacts,
         estimated_dir=estimated,
     )
     app.config["INGEST_KEYS"] = {API_KEY: "test-runner"}
-    return app, (received, received_padata, received_estimation_inputs, estimated)
+    return app, (received, received_padata, received_estimation_artifacts, estimated)
 
 
 def _portal_app():
diff --git a/result_server/tests/test_csrf.py b/result_server/tests/test_csrf.py
index d8f4968..fe344d6 100644
--- a/result_server/tests/test_csrf.py
+++ b/result_server/tests/test_csrf.py
@@ -111,13 +111,13 @@ def test_admin_post_with_invalid_csrf_token_is_rejected():
 def test_api_ingest_is_exempt_from_csrf():
     received = tempfile.mkdtemp()
     received_padata = tempfile.mkdtemp()
-    received_estimation_inputs = tempfile.mkdtemp()
+    received_estimation_artifacts = tempfile.mkdtemp()
     estimated = tempfile.mkdtemp()
     try:
         app = build_api_route_app(
             received_dir=received,
             received_padata_dir=received_padata,
-            received_estimation_inputs_dir=received_estimation_inputs,
+            received_estimation_artifacts_dir=received_estimation_artifacts,
             estimated_dir=estimated,
         )
         app.secret_key = "test-secret"
@@ -136,5 +136,5 @@ def test_api_ingest_is_exempt_from_csrf():
 
         assert resp.status_code == 200
     finally:
-        for path in (received, received_padata, received_estimation_inputs, estimated):
+        for path in (received, received_padata, received_estimation_artifacts, estimated):
             shutil.rmtree(path)
diff --git a/result_server/tests/test_rate_limit.py b/result_server/tests/test_rate_limit.py
index 24396c2..0daab8c 100644
--- a/result_server/tests/test_rate_limit.py
+++ b/result_server/tests/test_rate_limit.py
@@ -81,12 +81,12 @@ def create_invitation(self, email, affiliations):
 def _api_app():
     received = tempfile.mkdtemp()
     received_padata = tempfile.mkdtemp()
-    received_estimation_inputs = tempfile.mkdtemp()
+    received_estimation_artifacts = tempfile.mkdtemp()
     estimated = tempfile.mkdtemp()
     app = build_api_route_app(
         received_dir=received,
         received_padata_dir=received_padata,
-        received_estimation_inputs_dir=received_estimation_inputs,
+        received_estimation_artifacts_dir=received_estimation_artifacts,
         estimated_dir=estimated,
     )
     app.config["INGEST_KEYS"] = {
@@ -96,7 +96,7 @@ def _api_app():
     app.config["REDIS_CONN"] = FakeRedis()
     app.config["REDIS_PREFIX"] = "test:"
     app.config["RATE_LIMITS"] = {"api_ingest": 1, "api_query": 1}
-    return app, (received, received_padata, received_estimation_inputs, estimated)
+    return app, (received, received_padata, received_estimation_artifacts, estimated)
 
 
 def _portal_app():
diff --git a/result_server/tests/test_upload_limits.py b/result_server/tests/test_upload_limits.py
index 6171067..59078a1 100644
--- a/result_server/tests/test_upload_limits.py
+++ b/result_server/tests/test_upload_limits.py
@@ -20,16 +20,16 @@
 def _api_app():
     received = tempfile.mkdtemp()
     received_padata = tempfile.mkdtemp()
-    received_estimation_inputs = tempfile.mkdtemp()
+    received_estimation_artifacts = tempfile.mkdtemp()
     estimated = tempfile.mkdtemp()
     app = build_api_route_app(
         received_dir=received,
         received_padata_dir=received_padata,
-        received_estimation_inputs_dir=received_estimation_inputs,
+        received_estimation_artifacts_dir=received_estimation_artifacts,
         estimated_dir=estimated,
     )
     app.config["INGEST_KEYS"] = {API_KEY: "test-runner"}
-    return app, (received, received_padata, received_estimation_inputs, estimated)
+    return app, (received, received_padata, received_estimation_artifacts, estimated)
 
 
 def _cleanup(paths):
@@ -58,7 +58,7 @@ def test_padata_upload_over_max_content_length_returns_413():
         _cleanup(temp_dirs)
 
 
-def test_estimation_inputs_rejects_archive_member_over_limit():
+def test_estimation_artifacts_rejects_archive_member_over_limit():
     app, temp_dirs = _api_app()
     received = temp_dirs[0]
     app.config["MAX_ARCHIVE_MEMBER_SIZE"] = 3
@@ -78,7 +78,7 @@ def test_estimation_inputs_rejects_archive_member_over_limit():
     try:
         with app.test_client() as client:
             resp = client.post(
-                "/api/ingest/estimation-inputs",
+                "/api/ingest/estimation-artifacts",
                 data={"id": uuid_value, "file": (archive_bytes, "inputs.tgz")},
                 headers={"X-API-Key": API_KEY},
                 content_type="multipart/form-data",
diff --git a/scripts/bk_functions.sh b/scripts/bk_functions.sh
index 9dc8c91..e4f098d 100644
--- a/scripts/bk_functions.sh
+++ b/scripts/bk_functions.sh
@@ -908,6 +908,8 @@ bk_profiler_write_meta() {
           else
             _bk_meta_ncu_report_path=""
           fi
+          _bk_meta_ncu_raw_csv_path="raw/${_bk_meta_name}/profile_raw.csv"
+          _bk_meta_ncu_raw_csv_abs="${_bk_meta_stage_dir}/${_bk_meta_ncu_raw_csv_path}"
           ;;
         *)
           _bk_meta_text_path=""
@@ -965,6 +967,13 @@ bk_profiler_write_meta() {
         printf '        {"kind": "ncu_report", "path": "%s"}' "$_bk_meta_ncu_report_path"
         _bk_meta_has_report=1
       fi
+      if [ "${_bk_meta_tool}" = "ncu" ] && [ -f "${_bk_meta_ncu_raw_csv_abs:-}" ]; then
+        if [ "$_bk_meta_has_report" -eq 1 ]; then
+          printf ',\n'
+        fi
+        printf '        {"kind": "ncu_raw_csv", "path": "%s"}' "$_bk_meta_ncu_raw_csv_path"
+        _bk_meta_has_report=1
+      fi
       if [ "$_bk_meta_has_report" -eq 1 ]; then
         printf '\n'
       fi
@@ -1141,6 +1150,19 @@ bk_profiler() {
       else
         echo "bk_profiler[ncu]: failed ${_bk_ncu_rep_name} level=${_bk_profiler_level} status=${_bk_profiler_status}" >&2
       fi
+      case "${BK_PROFILER_NCU_RAW_CSV:-false}" in
+        1|true|TRUE|yes|YES|on|ON)
+          _bk_ncu_report_file=$(bk_profiler_find_ncu_report "$_bk_ncu_rep_dir" || true)
+          if [ -n "$_bk_ncu_report_file" ]; then
+            ncu --import "$_bk_ncu_report_file" \
+              --page raw \
+              --csv \
+              --print-units base \
+              --print-fp \
+              > "${_bk_ncu_rep_dir}/profile_raw.csv" 2> "${_bk_ncu_rep_dir}/profile_raw.csv.log" || true
+          fi
+          ;;
+      esac
       cp -R "$_bk_ncu_rep_dir" "$_bk_stage_dir/raw/${_bk_ncu_rep_name}"
       _bk_profiler_run_names="${_bk_ncu_rep_name}"
       _bk_profiler_run_events="${_bk_profiler_level}"
diff --git a/scripts/estimation/packages/instrumented_app_sections_dummy.sh b/scripts/estimation/packages/instrumented_app_sections_dummy.sh
index 2382abb..1549914 100644
--- a/scripts/estimation/packages/instrumented_app_sections_dummy.sh
+++ b/scripts/estimation/packages/instrumented_app_sections_dummy.sh
@@ -31,6 +31,7 @@ bk_estimation_package_metadata() {
     "quarter",
     "counter_papi_detailed",
     "trace_mpi_basic",
+    "gpu_kernel_mlp_v15",
     "logp"
   ],
   "supported_overlap_packages": [
diff --git a/scripts/estimation/prepare_gpu_mlp_ncu_input.py b/scripts/estimation/prepare_gpu_mlp_ncu_input.py
new file mode 100644
index 0000000..50a4a06
--- /dev/null
+++ b/scripts/estimation/prepare_gpu_mlp_ncu_input.py
@@ -0,0 +1,391 @@
+#!/usr/bin/env python3
+"""Prepare a PerfTools MLP_NN/v1.5 input CSV from an Nsight Compute archive.
+
+This is a small compatibility bridge for BenchKit.  It converts the wide
+Nsight Compute raw CSV exported from ``profile.ncu-rep`` into the CSV layout
+expected by PerfTools' ``MLP_NN/examples/prepare_data.py``, then fills the
+current v1.5 spec-sheet gaps that otherwise leave required SRC/TGT columns as
+NaN.
+"""
+
+from __future__ import annotations
+
+import argparse
+import csv
+import math
+import os
+import shutil
+import subprocess
+import sys
+import tarfile
+import tempfile
+import zipfile
+from pathlib import Path
+
+import pandas as pd
+
+
+SPEC_DEFAULTS = {
+    "A100": {
+        "GPU Maximum Warps Per Scheduler [warp]": 16,
+        "Theoretical Active Warps per SM [warp]": 64,
+        "Theoretical Active Warps Per Scheduler [warp]": 16,
+        "Shared Memory Configuration Size [byte]": 167936,
+        "Block Limit Warps [block]": 64,
+    },
+    "H100": {
+        "GPU Maximum Warps Per Scheduler [warp]": 16,
+        "Theoretical Active Warps per SM [warp]": 64,
+        "Theoretical Active Warps Per Scheduler [warp]": 16,
+        "Shared Memory Configuration Size [byte]": 233472,
+        "Block Limit Warps [block]": 64,
+    },
+    "GB200": {
+        "GPU Maximum Warps Per Scheduler [warp]": 16,
+        "Theoretical Active Warps per SM [warp]": 64,
+        "Theoretical Active Warps Per Scheduler [warp]": 16,
+        "Shared Memory Configuration Size [byte]": 233472,
+        "Block Limit Warps [block]": 64,
+    },
+    "GB10": {
+        "GPU Maximum Warps Per Scheduler [warp]": 16,
+        "Theoretical Active Warps per SM [warp]": 64,
+        "Theoretical Active Warps Per Scheduler [warp]": 16,
+        "Shared Memory Configuration Size [byte]": 101376,
+        "Block Limit Warps [block]": 64,
+    },
+}
+
+ALLOWED_NAN_COLUMNS = {"Warp Cycles Per Executed Instruction [cycle/inst]"}
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser()
+    input_group = parser.add_mutually_exclusive_group(required=True)
+    input_group.add_argument("--padata", help="BenchKit padata*.tgz archive")
+    input_group.add_argument("--raw-csv", help="Nsight Compute raw wide CSV")
+    parser.add_argument("--perftools-root", required=True)
+    parser.add_argument("--source-gpu", default="H100")
+    parser.add_argument("--kernel-count", type=int, default=20)
+    parser.add_argument("--out-csv", required=True)
+    parser.add_argument("--work-dir")
+    parser.add_argument("--keep-work", action="store_true")
+    parser.add_argument(
+        "--allow-nan",
+        action="append",
+        default=[],
+        help="Additional prepared-input column allowed to remain NaN",
+    )
+    return parser.parse_args()
+
+
+def safe_members(tgz: Path) -> list[tarfile.TarInfo]:
+    members: list[tarfile.TarInfo] = []
+    with tarfile.open(tgz, "r:gz") as archive:
+        for member in archive.getmembers():
+            member_path = Path(member.name)
+            if member_path.is_absolute() or ".." in member_path.parts:
+                raise SystemExit(f"unsafe padata member path: {member.name}")
+            members.append(member)
+    return members
+
+
+def extract_padata(tgz: Path, dest: Path) -> Path:
+    members = safe_members(tgz)
+    with tarfile.open(tgz, "r:gz") as archive:
+        archive.extractall(dest, members=members)
+
+    candidates = sorted(dest.rglob("profile_raw.csv"))
+    if candidates:
+        return candidates[0]
+
+    raise SystemExit(
+        f"{tgz} does not contain profile_raw.csv; enable BK_PROFILER_NCU_RAW_CSV=true"
+    )
+
+
+def strip_ncu_log_preamble(raw_csv: Path, clean_csv: Path) -> None:
+    lines = raw_csv.read_text(errors="replace").splitlines()
+    start = None
+    for idx, line in enumerate(lines):
+        if line.startswith('"ID","Process ID"'):
+            start = idx
+            break
+    if start is None:
+        raise SystemExit(f"no Nsight Compute CSV header found in {raw_csv}")
+    clean_csv.write_text("\n".join(lines[start:]) + "\n")
+
+
+def read_clean_raw_csv(clean_csv: Path) -> pd.DataFrame:
+    df = pd.read_csv(clean_csv, low_memory=False)
+    if "Kernel Name" not in df.columns:
+        raise SystemExit(f"raw CSV has no Kernel Name column: {clean_csv}")
+    return df[df["Kernel Name"].notna()].copy()
+
+
+def numeric(df: pd.DataFrame, column: str) -> pd.Series:
+    if column not in df.columns:
+        return pd.Series([pd.NA] * len(df), index=df.index)
+    return pd.to_numeric(
+        df[column].astype(str).str.replace(",", "", regex=False),
+        errors="coerce",
+    )
+
+
+def first_numeric(df: pd.DataFrame, *columns: str) -> pd.Series:
+    for column in columns:
+        series = numeric(df, column)
+        if series.notna().any():
+            return series
+    return pd.Series([pd.NA] * len(df), index=df.index)
+
+
+def build_wide_ncu_csv(raw_df: pd.DataFrame, out_csv: Path, source_gpu: str) -> None:
+    out = raw_df.copy()
+
+    duration_ns = first_numeric(raw_df, "gpu__time_duration.sum")
+    dram_bps = first_numeric(raw_df, "dram__bytes.sum.per_second")
+    if not dram_bps.notna().any():
+        dram_bps = first_numeric(raw_df, "dram__bytes.sum") / (duration_ns * 1e-9)
+
+    values = {
+        "Duration [ns]": duration_ns,
+        "Block Size": first_numeric(raw_df, "launch__block_size"),
+        "Grid Size": first_numeric(raw_df, "launch__grid_size"),
+        "Threads": first_numeric(raw_df, "launch__thread_count"),
+        "Registers Per Thread [register/thread]": first_numeric(
+            raw_df, "launch__registers_per_thread"
+        ),
+        "Static Shared Memory Per Block [byte/block]": first_numeric(
+            raw_df, "launch__shared_mem_per_block_static"
+        ),
+        "Dynamic Shared Memory Per Block [byte/block]": first_numeric(
+            raw_df, "launch__shared_mem_per_block_dynamic"
+        ),
+        "Shared Memory Per Block [byte/block]": first_numeric(
+            raw_df, "launch__shared_mem_per_block"
+        ),
+        "Memory Throughput [byte/s]": dram_bps,
+        "Achieved Occupancy [%]": first_numeric(
+            raw_df, "sm__warps_active.avg.pct_of_peak_sustained_active"
+        ),
+        "Achieved Active Warps Per SM [warp]": first_numeric(
+            raw_df, "sm__warps_active.avg.per_cycle_active"
+        ),
+        "Eligible Warps Per Scheduler [warp]": first_numeric(
+            raw_df, "smsp__warps_eligible.avg.per_cycle_active"
+        ),
+        "Compute (SM) Throughput [%]": first_numeric(
+            raw_df, "sm__throughput.avg.pct_of_peak_sustained_elapsed"
+        ),
+        "Memory Throughput [%]": first_numeric(
+            raw_df,
+            "gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed",
+            "gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed",
+            "FBSP.TriageCompute.dramc__throughput.avg.pct_of_peak_sustained_elapsed",
+        ),
+        "L1/TEX Cache Throughput [%]": first_numeric(
+            raw_df,
+            "l1tex__throughput.avg.pct_of_peak_sustained_active",
+            "l1tex__throughput.avg.pct_of_peak_sustained_elapsed",
+            "SM_A.TriageCompute.l1tex__throughput.avg.pct_of_peak_sustained_elapsed",
+        ),
+        "L2 Cache Throughput [%]": first_numeric(
+            raw_df,
+            "lts__throughput.avg.pct_of_peak_sustained_elapsed",
+            "LTS.TriageCompute.lts__throughput.avg.pct_of_peak_sustained_elapsed",
+        ),
+        "Waves Per SM": first_numeric(raw_df, "launch__waves_per_multiprocessor"),
+        "Elapsed Cycles [cycle]": first_numeric(raw_df, "sm__cycles_elapsed.avg"),
+        "Theoretical Active Warps per SM [warp]": pd.Series(
+            [64] * len(raw_df), index=raw_df.index
+        ),
+        "Block Limit Registers [block]": first_numeric(
+            raw_df, "launch__occupancy_limit_registers"
+        ),
+        "Block Limit Warps [block]": first_numeric(raw_df, "launch__occupancy_limit_warps"),
+        "Block Limit SM [block]": first_numeric(raw_df, "launch__occupancy_limit_blocks"),
+        "Block Limit Shared Mem [block]": first_numeric(
+            raw_df, "launch__occupancy_limit_shared_mem"
+        ),
+    }
+
+    for label, raw_name in {
+        "Stall Barrier [inst]": "barrier",
+        "Stall Branch Resolving [inst]": "branch_resolving",
+        "Stall Dispatch Stall [inst]": "dispatch_stall",
+        "Stall Drain [inst]": "drain",
+        "Stall LG Throttle [inst]": "lg_throttle",
+        "Stall Long Scoreboard [inst]": "long_scoreboard",
+        "Stall MIO Throttle [inst]": "mio_throttle",
+        "Stall Math Pipe Throttle [inst]": "math_pipe_throttle",
+        "Stall Membar [inst]": "membar",
+        "Stall Misc [inst]": "misc",
+        "Stall No Instruction [inst]": "no_instruction",
+        "Stall Not Selected [inst]": "not_selected",
+        "Stall Short Scoreboard [inst]": "short_scoreboard",
+        "Stall Sleeping [inst]": "sleeping",
+        "Stall Tex Throttle [inst]": "tex_throttle",
+        "Stall Wait [inst]": "wait",
+    }.items():
+        values[label] = first_numeric(
+            raw_df,
+            f"smsp__average_warps_issue_stalled_{raw_name}_per_issue_active.ratio",
+            f"smsp__pcsamp_warps_issue_stalled_{raw_name}",
+        )
+
+    for label, op in {
+        "Predicated-On FFMA Operations Per Cycle [inst]": "ffma",
+        "Predicated-On FADD Thread Instructions Executed Per Cycle [inst/cycle]": "fadd",
+        "Predicated-On FMUL Thread Instructions Executed Per Cycle [inst/cycle]": "fmul",
+        "Predicated-On DFMA Operations Per Cycle [inst]": "dfma",
+        "Predicated-On DADD Thread Instructions Executed Per Cycle [inst/cycle]": "dadd",
+        "Predicated-On DMUL Thread Instructions Executed Per Cycle [inst/cycle]": "dmul",
+    }.items():
+        values[label] = first_numeric(
+            raw_df, f"smsp__sass_thread_inst_executed_op_{op}_pred_on.avg.per_cycle_elapsed"
+        )
+
+    for label, series in values.items():
+        out[label] = series
+
+    out["SRC GPU"] = source_gpu
+    out_csv.parent.mkdir(parents=True, exist_ok=True)
+    out.to_csv(out_csv, index=False)
+
+
+def make_prepare_data_zip(source_gpu: str, wide_csv: Path, out_zip: Path) -> None:
+    out_zip.parent.mkdir(parents=True, exist_ok=True)
+    with zipfile.ZipFile(out_zip, "w", zipfile.ZIP_DEFLATED) as archive:
+        archive.write(wide_csv, f"{source_gpu}/benchkit_ncu.csv")
+
+
+def run_prepare_data(
+    perftools_root: Path,
+    raw_zip: Path,
+    source_gpu: str,
+    kernel_count: int,
+    out_csv: Path,
+) -> None:
+    prepare_data = perftools_root / "MLP_NN" / "examples" / "prepare_data.py"
+    if not prepare_data.is_file():
+        raise SystemExit(f"PerfTools prepare_data.py not found: {prepare_data}")
+
+    cmd = [
+        sys.executable,
+        str(prepare_data),
+        "--raw",
+        str(raw_zip),
+        "--src",
+        source_gpu,
+        "--n",
+        str(kernel_count),
+        "--out",
+        str(out_csv),
+    ]
+    subprocess.run(cmd, cwd=perftools_root, check=True)
+
+
+def fill_spec_defaults(df: pd.DataFrame) -> None:
+    for role, gpu_col in (("SRC", "src_gpu"), ("TGT", "tgt_gpu")):
+        if gpu_col not in df.columns:
+            continue
+        for row_idx, gpu_name in df[gpu_col].astype(str).items():
+            defaults = SPEC_DEFAULTS.get(gpu_name.upper())
+            if defaults is None:
+                continue
+            for suffix, value in defaults.items():
+                column = f"{role} {suffix}"
+                if column in df.columns and is_missing(df.at[row_idx, column]):
+                    df.at[row_idx, column] = value
+
+
+def is_missing(value: object) -> bool:
+    if value is None:
+        return True
+    try:
+        if isinstance(value, float) and math.isnan(value):
+            return True
+    except TypeError:
+        pass
+    return isinstance(value, str) and value.strip() == ""
+
+
+def finalize_prepared_input(
+    prepared_csv: Path,
+    raw_df: pd.DataFrame,
+    out_csv: Path,
+    allowed_nan: set[str],
+) -> None:
+    df = pd.read_csv(prepared_csv)
+
+    ipc = first_numeric(
+        raw_df,
+        "sm__inst_executed.avg.per_cycle_active",
+        "TPC.TriageCompute.sm__inst_executed_realtime.avg.per_cycle_active",
+    ).reset_index(drop=True)
+    if "Executed Ipc Active [inst/cycle]" in df.columns:
+        df["Executed Ipc Active [inst/cycle]"] = ipc.iloc[: len(df)].to_numpy()
+        mean_ipc = df["Executed Ipc Active [inst/cycle]"].mean()
+        df["Executed Ipc Active [inst/cycle]"] = df[
+            "Executed Ipc Active [inst/cycle]"
+        ].fillna(mean_ipc)
+
+    fill_spec_defaults(df)
+    out_csv.parent.mkdir(parents=True, exist_ok=True)
+    df.to_csv(out_csv, index=False, quoting=csv.QUOTE_MINIMAL)
+
+    nan_counts = df.isna().sum()
+    bad_columns = sorted(col for col, count in nan_counts.items() if count > 0)
+    unexpected = [col for col in bad_columns if col not in allowed_nan]
+    if unexpected:
+        formatted = "\n".join(f"  {col}: {int(nan_counts[col])}" for col in unexpected)
+        raise SystemExit(f"prepared input still has unexpected NaN columns:\n{formatted}")
+
+
+def main() -> None:
+    args = parse_args()
+    perftools_root = Path(args.perftools_root).resolve()
+    out_csv = Path(args.out_csv).resolve()
+    work_dir_owned = False
+    if args.work_dir:
+        work_dir = Path(args.work_dir).resolve()
+        work_dir.mkdir(parents=True, exist_ok=True)
+    else:
+        work_dir = Path(tempfile.mkdtemp(prefix="benchkit-gpu-mlp-"))
+        work_dir_owned = True
+
+    try:
+        if args.raw_csv:
+            raw_csv = Path(args.raw_csv).resolve()
+        else:
+            raw_csv = extract_padata(Path(args.padata).resolve(), work_dir / "padata")
+
+        clean_csv = work_dir / "profile_raw_clean.csv"
+        wide_csv = work_dir / "wide" / args.source_gpu / "benchkit_ncu.csv"
+        raw_zip = work_dir / "benchkit_ncu_wide.zip"
+        prepared_csv = work_dir / "perftools_prepared.csv"
+
+        strip_ncu_log_preamble(raw_csv, clean_csv)
+        raw_df = read_clean_raw_csv(clean_csv)
+        if raw_df.empty:
+            raise SystemExit(f"no kernel rows found in {raw_csv}")
+        kernel_count = min(max(args.kernel_count, 1), len(raw_df))
+
+        build_wide_ncu_csv(raw_df, wide_csv, args.source_gpu)
+        make_prepare_data_zip(args.source_gpu, wide_csv, raw_zip)
+        run_prepare_data(perftools_root, raw_zip, args.source_gpu, kernel_count, prepared_csv)
+        finalize_prepared_input(
+            prepared_csv,
+            raw_df,
+            out_csv,
+            allowed_nan=ALLOWED_NAN_COLUMNS | set(args.allow_nan),
+        )
+        print(f"wrote {out_csv}: {kernel_count} kernels")
+    finally:
+        if work_dir_owned and not args.keep_work:
+            shutil.rmtree(work_dir, ignore_errors=True)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/estimation/run.sh b/scripts/estimation/run.sh
index 7d1732f..15c5e19 100644
--- a/scripts/estimation/run.sh
+++ b/scripts/estimation/run.sh
@@ -12,12 +12,94 @@ set -euo pipefail
 code="$1"
 estimate_script="programs/${code}/estimate.sh"
 
+bk_estimation_bool_enabled() {
+  case "${1:-}" in
+    1|true|TRUE|yes|YES|on|ON) return 0 ;;
+    *) return 1 ;;
+  esac
+}
+
+bk_estimation_gpu_mlp_perftools_needed() {
+  if bk_estimation_bool_enabled "${BK_GPU_MLP_FETCH_PERFTOOLS:-false}"; then
+    return 0
+  fi
+
+  if [[ "${code:-}" == "genesis" ]] && bk_estimation_bool_enabled "${BK_GENESIS_GPU_MLP_PROFILE:-false}"; then
+    return 0
+  fi
+
+  if bk_estimation_bool_enabled "${BK_QWS_GPU_MLP_SMOKE:-false}"; then
+    case "${BK_QWS_GPU_MLP_SMOKE_MODE:-prediction}" in
+      perftools|input|predictor) return 0 ;;
+    esac
+  fi
+
+  return 1
+}
+
+bk_estimation_prepare_gpu_mlp_perftools() {
+  local repo="${BK_GPU_MLP_PERFTOOLS_REPO:-https://github.com/masaaki-kondo/PerfTools.git}"
+  local ref="${BK_GPU_MLP_PERFTOOLS_REF:-main}"
+  local root="${BK_GPU_MLP_PERFTOOLS_ROOT:-.benchkit_estimation_tools/PerfTools}"
+  local input_csv
+  local use_qws_example=0
+  local use_genesis_ncu=0
+
+  if ! bk_estimation_gpu_mlp_perftools_needed; then
+    return 0
+  fi
+
+  if bk_estimation_bool_enabled "${BK_QWS_GPU_MLP_SMOKE:-false}"; then
+    case "${BK_QWS_GPU_MLP_SMOKE_MODE:-prediction}" in
+      perftools|input|predictor) use_qws_example=1 ;;
+    esac
+  fi
+  if [[ "${code:-}" == "genesis" ]] && bk_estimation_bool_enabled "${BK_GENESIS_GPU_MLP_PROFILE:-false}"; then
+    use_genesis_ncu=1
+  fi
+
+  if [[ ! -f "${root}/MLP_NN/v1.5/predict_v15.py" ]]; then
+    if ! command -v git >/dev/null 2>&1; then
+      echo "ERROR: git is required to fetch PerfTools for GPU MLP estimation" >&2
+      return 1
+    fi
+
+    mkdir -p "$(dirname "$root")"
+    echo "Fetching PerfTools for GPU MLP estimation: ${repo} (${ref})"
+    git clone --depth 1 "$repo" "$root"
+    if [[ "$ref" != "main" && "$ref" != "master" ]]; then
+      git -C "$root" fetch --depth 1 origin "$ref" || true
+      git -C "$root" checkout "$ref"
+    fi
+  fi
+
+  export BK_GPU_MLP_PERFTOOLS_ROOT="$root"
+  export BK_GPU_MLP_OUTPUT_DIR="${BK_GPU_MLP_OUTPUT_DIR:-results/estimation_artifacts/gpu_kernel_mlp_v15}"
+
+  echo "GPU MLP estimator root: ${BK_GPU_MLP_PERFTOOLS_ROOT}"
+  if [[ "$use_genesis_ncu" -eq 1 ]]; then
+    export BK_GPU_MLP_ARTIFACT_MODE="${BK_GPU_MLP_ARTIFACT_MODE:-ncu}"
+    echo "GPU MLP estimator artifact mode: ${BK_GPU_MLP_ARTIFACT_MODE}"
+  elif [[ "$use_qws_example" -eq 1 ]]; then
+    input_csv="${BK_GPU_MLP_INPUT_CSV:-${root}/MLP_NN/examples/example_input_mixed-src_20kernels.csv}"
+    if [[ ! -f "$input_csv" ]]; then
+      echo "ERROR: PerfTools GPU MLP input CSV not found: ${input_csv}" >&2
+      return 1
+    fi
+    export BK_GPU_MLP_INPUT_CSV="$input_csv"
+    export BK_GPU_MLP_ARTIFACT_MODE="${BK_GPU_MLP_ARTIFACT_MODE:-input}"
+    echo "GPU MLP estimator input CSV: ${BK_GPU_MLP_INPUT_CSV}"
+  fi
+}
+
 # Check if the application has an estimate script
 if [[ ! -f "$estimate_script" ]]; then
   echo "WARNING: $estimate_script not found, skipping estimation"
   exit 0
 fi
 
+bk_estimation_prepare_gpu_mlp_perftools
+
 # Run estimation for each result JSON
 found=0
 for json_file in results/result[0-9]*.json; do
diff --git a/scripts/estimation/section_packages/gpu_kernel_mlp_v15.sh b/scripts/estimation/section_packages/gpu_kernel_mlp_v15.sh
new file mode 100644
index 0000000..93b826a
--- /dev/null
+++ b/scripts/estimation/section_packages/gpu_kernel_mlp_v15.sh
@@ -0,0 +1,520 @@
+#!/bin/bash
+# gpu_kernel_mlp_v15.sh - Section package for the PerfTools MLP_NN/v1.5 GPU estimator.
+
+bk_section_package_metadata_gpu_kernel_mlp_v15() {
+  cat <<'EOF'
+{
+  "name": "gpu_kernel_mlp_v15",
+  "fallback_target": "identity",
+  "source_system_scope": {
+    "kind": "benchmark_system",
+    "accepted_values": ["any"]
+  },
+  "target_system_scope": {
+    "accepted_values": ["any"]
+  },
+  "item_kind_scope": ["section"],
+  "required_result_fields": ["name", "time or bench_time"],
+  "required_artifact_kinds": [
+    "PerfTools MLP_NN/v1.5 prepared input CSV",
+    "precomputed prediction CSV",
+    "or BenchKit padata archive with Nsight Compute raw CSV"
+  ],
+  "acquisition_mode": "external",
+  "output_fields": [
+    "time",
+    "bench_time",
+    "scaling_method",
+    "metrics",
+    "package_applicability"
+  ],
+  "not_applicable_when": [
+    "item kind is not section",
+    "neither section artifact nor BK_GPU_MLP_INPUT_CSV/BK_GPU_MLP_PREDICTION_CSV is available",
+    "padata artifact mode is requested but the archive has no Nsight Compute raw CSV",
+    "PerfTools checkout is not available when running the external predictor",
+    "Python runtime for CSV parsing or external inference is not available",
+    "prediction CSV does not contain a recognized execution-time column"
+  ]
+}
+EOF
+}
+
+_bk_gpu_mlp_section_key() {
+  local section_name="$1"
+  printf '%s' "$section_name" | tr '[:lower:]' '[:upper:]' | tr -c 'A-Z0-9' '_'
+}
+
+_bk_gpu_mlp_section_var() {
+  local prefix="$1"
+  local section_name="$2"
+  local key
+
+  key=$(_bk_gpu_mlp_section_key "$section_name")
+  printf '%s_%s\n' "$prefix" "$key"
+}
+
+_bk_gpu_mlp_env_value() {
+  local var_name="$1"
+  eval "printf '%s\n' \"\${${var_name}:-}\""
+}
+
+_bk_gpu_mlp_perftools_root() {
+  printf '%s\n' "${BK_GPU_MLP_PERFTOOLS_ROOT:-${BK_PERFTOOLS_ROOT:-}}"
+}
+
+_bk_gpu_mlp_predictor() {
+  local root="$1"
+
+  if [[ -z "$root" ]]; then
+    printf '%s\n' ""
+    return 0
+  fi
+
+  printf '%s\n' "${root}/MLP_NN/v1.5/predict_v15.py"
+}
+
+_bk_gpu_mlp_python_exists() {
+  local python_bin="$1"
+
+  if [[ "$python_bin" == */* ]]; then
+    [[ -x "$python_bin" ]]
+    return $?
+  fi
+
+  command -v "$python_bin" >/dev/null 2>&1
+}
+
+_bk_gpu_mlp_abs_existing_path() {
+  local path="$1"
+  local dir
+  local base
+
+  if [[ -z "$path" ]]; then
+    printf '%s\n' ""
+    return 0
+  fi
+
+  if [[ "$path" == /* ]]; then
+    printf '%s\n' "$path"
+    return 0
+  fi
+
+  dir=$(dirname "$path")
+  base=$(basename "$path")
+  if [[ -d "$dir" ]]; then
+    (cd "$dir" && printf '%s/%s\n' "$PWD" "$base")
+  else
+    printf '%s/%s\n' "$PWD" "$path"
+  fi
+}
+
+_bk_gpu_mlp_first_artifact_path() {
+  local item_json="$1"
+
+  echo "$item_json" | jq -r '(.artifacts // [])[0].path // empty'
+}
+
+_bk_gpu_mlp_resolve_section_input_csv() {
+  local item_json="$1"
+  local section_name="$2"
+  local scoped_var
+  local value
+  local artifact_path
+
+  scoped_var=$(_bk_gpu_mlp_section_var "BK_GPU_MLP_INPUT_CSV" "$section_name")
+  value=$(_bk_gpu_mlp_env_value "$scoped_var")
+  if [[ -n "$value" ]]; then
+    printf '%s\n' "$value"
+    return 0
+  fi
+
+  if [[ -n "${BK_GPU_MLP_INPUT_CSV:-}" ]]; then
+    printf '%s\n' "$BK_GPU_MLP_INPUT_CSV"
+    return 0
+  fi
+
+  artifact_path=$(_bk_gpu_mlp_first_artifact_path "$item_json")
+  if [[ -n "$artifact_path" && "${BK_GPU_MLP_ARTIFACT_MODE:-input}" == "input" ]]; then
+    printf '%s\n' "$artifact_path"
+    return 0
+  fi
+
+  printf '%s\n' ""
+}
+
+_bk_gpu_mlp_artifact_mode() {
+  case "${BK_GPU_MLP_ARTIFACT_MODE:-input}" in
+    ncu|padata|profiler|profile) printf 'ncu\n' ;;
+    prediction) printf 'prediction\n' ;;
+    *) printf 'input\n' ;;
+  esac
+}
+
+_bk_gpu_mlp_resolve_section_ncu_archive() {
+  local item_json="$1"
+  local section_name="$2"
+  local scoped_var
+  local value
+  local artifact_path
+
+  scoped_var=$(_bk_gpu_mlp_section_var "BK_GPU_MLP_NCU_ARCHIVE" "$section_name")
+  value=$(_bk_gpu_mlp_env_value "$scoped_var")
+  if [[ -n "$value" ]]; then
+    printf '%s\n' "$value"
+    return 0
+  fi
+
+  if [[ -n "${BK_GPU_MLP_NCU_ARCHIVE:-}" ]]; then
+    printf '%s\n' "$BK_GPU_MLP_NCU_ARCHIVE"
+    return 0
+  fi
+
+  artifact_path=$(_bk_gpu_mlp_first_artifact_path "$item_json")
+  if [[ -n "$artifact_path" ]]; then
+    case "$(_bk_gpu_mlp_artifact_mode):${artifact_path}" in
+      ncu:*|*:*.tgz|*:*.tar.gz)
+        printf '%s\n' "$artifact_path"
+        return 0
+        ;;
+    esac
+  fi
+
+  printf '%s\n' ""
+}
+
+_bk_gpu_mlp_resolve_section_prediction_csv() {
+  local item_json="$1"
+  local section_name="$2"
+  local scoped_var
+  local value
+  local artifact_path
+
+  scoped_var=$(_bk_gpu_mlp_section_var "BK_GPU_MLP_PREDICTION_CSV" "$section_name")
+  value=$(_bk_gpu_mlp_env_value "$scoped_var")
+  if [[ -n "$value" ]]; then
+    printf '%s\n' "$value"
+    return 0
+  fi
+
+  if [[ -n "${BK_GPU_MLP_PREDICTION_CSV:-}" ]]; then
+    printf '%s\n' "$BK_GPU_MLP_PREDICTION_CSV"
+    return 0
+  fi
+
+  artifact_path=$(_bk_gpu_mlp_first_artifact_path "$item_json")
+  if [[ -n "$artifact_path" && "${BK_GPU_MLP_ARTIFACT_MODE:-input}" == "prediction" ]]; then
+    printf '%s\n' "$artifact_path"
+    return 0
+  fi
+
+  printf '%s\n' ""
+}
+
+_bk_gpu_mlp_section_slug() {
+  local section_name="$1"
+  printf '%s_%s_%s' "${est_code:-unknown}" "$section_name" "${est_uuid:-local}" |
+    tr -c 'A-Za-z0-9._-' '_'
+}
+
+bk_section_package_check_applicability_gpu_kernel_mlp_v15() {
+  local item_json="$1"
+  local item_kind="$2"
+  local section_name
+  local prediction_csv
+  local input_csv
+  local ncu_archive
+  local root
+  local predictor
+  local python_bin="${BK_GPU_MLP_PYTHON:-python3}"
+  local missing=()
+
+  if [[ "$item_kind" != "section" ]]; then
+    cat <<'EOF'
+{"status":"not_applicable","missing_inputs":["item_kind:section_required"]}
+EOF
+    return 1
+  fi
+
+  section_name=$(echo "$item_json" | jq -r '.name // "gpu_section"')
+  prediction_csv=$(_bk_gpu_mlp_resolve_section_prediction_csv "$item_json" "$section_name")
+  input_csv=$(_bk_gpu_mlp_resolve_section_input_csv "$item_json" "$section_name")
+  ncu_archive=$(_bk_gpu_mlp_resolve_section_ncu_archive "$item_json" "$section_name")
+
+  if ! _bk_gpu_mlp_python_exists "$python_bin"; then
+    missing+=("\"python:${python_bin}\"")
+  fi
+
+  if [[ -n "$prediction_csv" ]]; then
+    if [[ ! -f "$prediction_csv" ]]; then
+      missing+=("\"prediction_csv:${prediction_csv}\"")
+    fi
+  else
+    root=$(_bk_gpu_mlp_perftools_root)
+    predictor=$(_bk_gpu_mlp_predictor "$root")
+
+    if [[ -z "$input_csv" && -z "$ncu_archive" ]]; then
+      missing+=('"gpu_mlp_input_csv"')
+    fi
+    if [[ -n "$input_csv" && ! -f "$input_csv" ]]; then
+      missing+=("\"input_csv:${input_csv}\"")
+    fi
+    if [[ -n "$ncu_archive" && ! -f "$ncu_archive" ]]; then
+      missing+=("\"ncu_archive:${ncu_archive}\"")
+    fi
+    if [[ -z "$root" || ! -d "$root" ]]; then
+      missing+=('"BK_GPU_MLP_PERFTOOLS_ROOT"')
+    fi
+    if [[ -z "$predictor" || ! -f "$predictor" ]]; then
+      missing+=('"PerfTools MLP_NN/v1.5/predict_v15.py"')
+    fi
+  fi
+
+  if (( ${#missing[@]} > 0 )); then
+    printf '{"status":"not_applicable","missing_inputs":[%s]}\n' "$(IFS=,; echo "${missing[*]}")"
+    return 1
+  fi
+
+  cat <<'EOF'
+{"status":"applicable","missing_inputs":[]}
+EOF
+}
+
+_bk_gpu_mlp_parse_prediction_csv() {
+  local prediction_csv="$1"
+  local package_name="$2"
+  local model_version="$3"
+  local python_bin="${BK_GPU_MLP_PYTHON:-python3}"
+
+  "$python_bin" - "$prediction_csv" "$package_name" "$model_version" <<'PY'
+import csv
+import json
+import math
+import sys
+
+prediction_csv, package_name, model_version = sys.argv[1:4]
+
+time_columns = [
+    "Execution Time [ns]",
+    "O-Execution Time [ns]",
+    "O-Execution Time",
+    "Predicted Execution Time [ns]",
+    "predicted_execution_time_ns",
+]
+name_columns = ["kernel_name", "Kernel Name", "kernel", "Kernel", "name", "Name"]
+metric_columns = [
+    "Memory Throughput [%]",
+    "Achieved Occupancy",
+    "brk_memory",
+    "brk_pipeline_contention",
+    "brk_sync",
+    "brk_scheduling_overhead",
+    "t_mem_ns",
+    "t_comp_ns",
+    "t_roof_ns",
+    "efficiency_eta",
+]
+
+
+def cleaned_lines(path):
+    with open(path, newline="", encoding="utf-8-sig") as handle:
+        for line in handle:
+            if not line.strip() or line.lstrip().startswith("#"):
+                continue
+            yield line
+
+
+def as_number(value):
+    if value is None or value == "":
+        return None
+    try:
+        number = float(value)
+    except ValueError:
+        return None
+    if math.isnan(number) or math.isinf(number):
+        return None
+    return number
+
+
+reader = csv.DictReader(cleaned_lines(prediction_csv))
+if not reader.fieldnames:
+    raise SystemExit(f"prediction CSV has no header: {prediction_csv}")
+
+time_column = next((col for col in time_columns if col in reader.fieldnames), None)
+if time_column is None:
+    raise SystemExit(
+        "prediction CSV does not contain a supported execution-time column: "
+        + ", ".join(time_columns)
+    )
+
+kernels = []
+source_gpus = []
+target_gpus = []
+total_seconds = 0.0
+
+for idx, row in enumerate(reader, start=1):
+    predicted_ns = as_number(row.get(time_column))
+    if predicted_ns is None:
+        raise SystemExit(f"row {idx} has no numeric predicted execution time in {time_column}")
+
+    raw_name = next((row.get(col, "").strip() for col in name_columns if row.get(col, "").strip()), "")
+    source_gpu = (row.get("src_gpu") or row.get("source_gpu") or "").strip()
+    target_gpu = (row.get("tgt_gpu") or row.get("target_gpu") or "").strip()
+    if source_gpu:
+        source_gpus.append(source_gpu)
+    if target_gpu:
+        target_gpus.append(target_gpu)
+
+    seconds = predicted_ns / 1e9
+    total_seconds += seconds
+
+    metrics = {
+        key: as_number(row.get(key))
+        for key in metric_columns
+        if key in row and as_number(row.get(key)) is not None
+    }
+    kernel = {
+        "name": raw_name or f"kernel_{idx}",
+        "predicted_time_ns": predicted_ns,
+        "predicted_time": seconds,
+    }
+    if source_gpu:
+        kernel["source_gpu"] = source_gpu
+    if target_gpu:
+        kernel["target_gpu"] = target_gpu
+    if metrics:
+        kernel["metrics"] = metrics
+    kernels.append(kernel)
+
+print(json.dumps({
+    "time": total_seconds,
+    "metrics": {
+        "kernel_count": len(kernels),
+        "time_column": time_column,
+        "total_predicted_time_ns": total_seconds * 1e9,
+        "source_gpus": sorted(set(source_gpus)),
+        "target_gpus": sorted(set(target_gpus)),
+        "kernels": kernels,
+    },
+    "package_applicability": {
+        "status": "applicable",
+        "missing_inputs": [],
+    },
+    "model": {
+        "type": "cross_gpu_kernel_prediction_model",
+        "name": "PerfTools MLP_NN/v1.5",
+        "version": model_version,
+        "repository": "https://github.com/masaaki-kondo/PerfTools",
+    },
+    "estimation_package": package_name,
+}))
+PY
+}
+
+_bk_gpu_mlp_prepare_input_from_ncu() {
+  local ncu_archive="$1"
+  local section_name="$2"
+  local root="$3"
+  local output_dir="$4"
+  local slug="$5"
+  local python_bin="${BK_GPU_MLP_PYTHON:-python3}"
+  local source_gpu="${BK_GPU_MLP_SOURCE_GPU:-${BK_GPU_MLP_SRC_GPU:-H100}}"
+  local kernel_count="${BK_GPU_MLP_KERNEL_COUNT:-20}"
+  local prepared_csv="${output_dir}/${slug}_input.csv"
+  local script_path="scripts/estimation/prepare_gpu_mlp_ncu_input.py"
+  local archive_abs
+  local prepared_abs
+
+  archive_abs=$(_bk_gpu_mlp_abs_existing_path "$ncu_archive")
+  prepared_abs=$(_bk_gpu_mlp_abs_existing_path "$prepared_csv")
+
+  "$python_bin" "$script_path" \
+    --padata "$archive_abs" \
+    --perftools-root "$root" \
+    --source-gpu "$source_gpu" \
+    --kernel-count "$kernel_count" \
+    --out-csv "$prepared_abs" >&2
+
+  printf '%s\n' "$prepared_csv"
+}
+
+_bk_gpu_mlp_run_predictor() {
+  local item_json="$1"
+  local section_name="$2"
+  local root
+  local input_csv
+  local ncu_archive
+  local output_dir="${BK_GPU_MLP_OUTPUT_DIR:-results/estimation_artifacts/gpu_kernel_mlp_v15}"
+  local prediction_csv
+  local prediction_log
+  local input_csv_abs
+  local prediction_csv_abs
+  local prediction_log_abs
+  local python_bin="${BK_GPU_MLP_PYTHON:-python3}"
+  local slug
+
+  root=$(_bk_gpu_mlp_perftools_root)
+  input_csv=$(_bk_gpu_mlp_resolve_section_input_csv "$item_json" "$section_name")
+  ncu_archive=$(_bk_gpu_mlp_resolve_section_ncu_archive "$item_json" "$section_name")
+  slug=$(_bk_gpu_mlp_section_slug "$section_name")
+
+  mkdir -p "$output_dir"
+  if [[ -z "$input_csv" && -n "$ncu_archive" ]]; then
+    input_csv=$(_bk_gpu_mlp_prepare_input_from_ncu "$ncu_archive" "$section_name" "$root" "$output_dir" "$slug")
+  fi
+
+  prediction_csv="${output_dir}/${slug}_pred.csv"
+  prediction_log="${output_dir}/${slug}.log"
+  input_csv_abs=$(_bk_gpu_mlp_abs_existing_path "$input_csv")
+  prediction_csv_abs=$(_bk_gpu_mlp_abs_existing_path "$prediction_csv")
+  prediction_log_abs=$(_bk_gpu_mlp_abs_existing_path "$prediction_log")
+
+  (
+    cd "$root"
+    "$python_bin" MLP_NN/v1.5/predict_v15.py \
+      --csv "$input_csv_abs" \
+      --row "${BK_GPU_MLP_ROW:-all}" \
+      --out "$prediction_csv_abs" \
+      --log "$prediction_log_abs"
+  ) >/dev/null
+
+  printf '%s\n' "$prediction_csv"
+}
+
+bk_section_package_transform_gpu_kernel_mlp_v15() {
+  local item_json="$1"
+  local _target_nodes="$2"
+  local _bench_nodes="$3"
+  local _default_factor="$4"
+  local _item_kind="$5"
+  local section_name
+  local prediction_csv
+  local parsed_json
+  local package_name="gpu_kernel_mlp_v15"
+  local model_version="${BK_GPU_MLP_MODEL_VERSION:-v1.5}"
+
+  section_name=$(echo "$item_json" | jq -r '.name // "gpu_section"')
+  prediction_csv=$(_bk_gpu_mlp_resolve_section_prediction_csv "$item_json" "$section_name")
+
+  if [[ -z "$prediction_csv" ]]; then
+    prediction_csv=$(_bk_gpu_mlp_run_predictor "$item_json" "$section_name")
+  fi
+
+  parsed_json=$(_bk_gpu_mlp_parse_prediction_csv "$prediction_csv" "$package_name" "$model_version")
+
+  echo "$item_json" | jq -c \
+    --arg prediction_csv "$prediction_csv" \
+    --argjson parsed "$parsed_json" '
+    .
+    + {
+        time: $parsed.time,
+        bench_time: (.bench_time // .time // null),
+        scaling_method: "gpu-kernel-mlp-v1.5",
+        estimation_package: $parsed.estimation_package,
+        package_applicability: $parsed.package_applicability,
+        model: $parsed.model,
+        metrics: $parsed.metrics
+      }
+    | .artifacts = ((.artifacts // []) + [{kind: "gpu_mlp_prediction_csv", path: $prediction_csv}])
+  '
+}
diff --git a/scripts/job_functions.sh b/scripts/job_functions.sh
index 51cad84..8aa686e 100644
--- a/scripts/job_functions.sh
+++ b/scripts/job_functions.sh
@@ -278,12 +278,15 @@ emit_estimate_job() {
     local run_job="$3"
     local code="$4"
     local output="$5"
+    local estimate_runner_tag="${BK_ESTIMATE_RUNNER_TAG:-fncx-estimate-python}"
+
+    estimate_runner_tag=$(printf '%s' "$estimate_runner_tag" | sed 's/"/\\"/g')
 
     echo "
 ${job_prefix}_estimate:
   stage: estimate
   needs: [\"${depends_on}\"]
-  tags: [fncx-curl-jq]
+  tags: [\"${estimate_runner_tag}\"]
   environment:
     name: \$CI_COMMIT_BRANCH
   script:
diff --git a/scripts/matrix_generate.sh b/scripts/matrix_generate.sh
index 3f922b7..257e186 100644
--- a/scripts/matrix_generate.sh
+++ b/scripts/matrix_generate.sh
@@ -12,11 +12,30 @@ SYSTEM_FILE="config/system.csv"
 QUEUE_FILE="config/queue.csv"
 SYSTEM_INFO_FILE="config/system_info.csv"
 OUTPUT_FILE=".gitlab-ci.generated.yml"
+PARENT_PIPELINE_SOURCE="${CI_PIPELINE_SOURCE:-local}"
 
 source ./scripts/job_functions.sh
 
 CODE_FILTER=""
 SYSTEM_FILTER=""
+QWS_GPU_MLP_SMOKE="${BK_QWS_GPU_MLP_SMOKE:-true}"
+QWS_GPU_MLP_SMOKE=$(printf '%s' "$QWS_GPU_MLP_SMOKE" | sed 's/"/\\"/g')
+QWS_GPU_MLP_SMOKE_MODE="${BK_QWS_GPU_MLP_SMOKE_MODE:-perftools}"
+QWS_GPU_MLP_SMOKE_MODE=$(printf '%s' "$QWS_GPU_MLP_SMOKE_MODE" | sed 's/"/\\"/g')
+ESTIMATE_RUNNER_TAG="${BK_ESTIMATE_RUNNER_TAG:-fncx-estimate-python}"
+ESTIMATE_RUNNER_TAG=$(printf '%s' "$ESTIMATE_RUNNER_TAG" | sed 's/"/\\"/g')
+GPU_MLP_PERFTOOLS_REPO="${BK_GPU_MLP_PERFTOOLS_REPO:-https://github.com/masaaki-kondo/PerfTools.git}"
+GPU_MLP_PERFTOOLS_REPO=$(printf '%s' "$GPU_MLP_PERFTOOLS_REPO" | sed 's/"/\\"/g')
+GPU_MLP_PERFTOOLS_REF="${BK_GPU_MLP_PERFTOOLS_REF:-main}"
+GPU_MLP_PERFTOOLS_REF=$(printf '%s' "$GPU_MLP_PERFTOOLS_REF" | sed 's/"/\\"/g')
+GENESIS_GPU_MLP_PROFILE="${BK_GENESIS_GPU_MLP_PROFILE:-true}"
+GENESIS_GPU_MLP_PROFILE=$(printf '%s' "$GENESIS_GPU_MLP_PROFILE" | sed 's/"/\\"/g')
+GPU_MLP_NCU_LAUNCH_COUNT="${BK_GPU_MLP_NCU_LAUNCH_COUNT:-20}"
+GPU_MLP_NCU_LAUNCH_COUNT=$(printf '%s' "$GPU_MLP_NCU_LAUNCH_COUNT" | sed 's/"/\\"/g')
+GPU_MLP_SOURCE_GPU="${BK_GPU_MLP_SOURCE_GPU:-H100}"
+GPU_MLP_SOURCE_GPU=$(printf '%s' "$GPU_MLP_SOURCE_GPU" | sed 's/"/\\"/g')
+GPU_MLP_KERNEL_COUNT="${BK_GPU_MLP_KERNEL_COUNT:-20}"
+GPU_MLP_KERNEL_COUNT=$(printf '%s' "$GPU_MLP_KERNEL_COUNT" | sed 's/"/\\"/g')
 
 while [[ $# -gt 0 ]]; do
   case $1 in
@@ -39,7 +58,16 @@ stages:
   - send_estimate
 
 variables:
-  PARENT_PIPELINE_SOURCE: \"$CI_PIPELINE_SOURCE\"
+  PARENT_PIPELINE_SOURCE: \"$PARENT_PIPELINE_SOURCE\"
+  BK_QWS_GPU_MLP_SMOKE: \"$QWS_GPU_MLP_SMOKE\"
+  BK_QWS_GPU_MLP_SMOKE_MODE: \"$QWS_GPU_MLP_SMOKE_MODE\"
+  BK_ESTIMATE_RUNNER_TAG: \"$ESTIMATE_RUNNER_TAG\"
+  BK_GPU_MLP_PERFTOOLS_REPO: \"$GPU_MLP_PERFTOOLS_REPO\"
+  BK_GPU_MLP_PERFTOOLS_REF: \"$GPU_MLP_PERFTOOLS_REF\"
+  BK_GENESIS_GPU_MLP_PROFILE: \"$GENESIS_GPU_MLP_PROFILE\"
+  BK_GPU_MLP_NCU_LAUNCH_COUNT: \"$GPU_MLP_NCU_LAUNCH_COUNT\"
+  BK_GPU_MLP_SOURCE_GPU: \"$GPU_MLP_SOURCE_GPU\"
+  BK_GPU_MLP_KERNEL_COUNT: \"$GPU_MLP_KERNEL_COUNT\"
 " >> "$OUTPUT_FILE"
 
 
diff --git a/scripts/result_server/fetch_result_by_uuid.sh b/scripts/result_server/fetch_result_by_uuid.sh
index 49b847e..c5e7303 100644
--- a/scripts/result_server/fetch_result_by_uuid.sh
+++ b/scripts/result_server/fetch_result_by_uuid.sh
@@ -119,17 +119,25 @@ echo "Wrote re-estimation context to results/reestimation_context.json"
 
 set +e
 bk_result_server_download_to_file \
-  "/api/query/estimation-inputs?uuid=${resolved_result_uuid}" \
-  "results/estimation_inputs.tgz"
+  "/api/query/estimation-artifacts?uuid=${resolved_result_uuid}" \
+  "results/estimation_artifacts.tgz"
 download_exit=$?
+if [[ $download_exit -ne 0 ]]; then
+  rm -f "results/estimation_artifacts.tgz"
+  echo "New estimation artifact query endpoint failed; trying legacy estimation-inputs endpoint."
+  bk_result_server_download_to_file \
+    "/api/query/estimation-inputs?uuid=${resolved_result_uuid}" \
+    "results/estimation_artifacts.tgz"
+  download_exit=$?
+fi
 set -e
 
-if [[ $download_exit -eq 0 && -f "results/estimation_inputs.tgz" ]]; then
-  mkdir -p "results/estimation_inputs"
-  tar -xzf "results/estimation_inputs.tgz" -C "results/estimation_inputs"
-  rm -f "results/estimation_inputs.tgz"
-  echo "Restored estimation inputs to results/estimation_inputs/"
+if [[ $download_exit -eq 0 && -f "results/estimation_artifacts.tgz" ]]; then
+  mkdir -p "results/estimation_artifacts"
+  tar -xzf "results/estimation_artifacts.tgz" -C "results/estimation_artifacts"
+  rm -f "results/estimation_artifacts.tgz"
+  echo "Restored estimation artifacts to results/estimation_artifacts/"
 else
-  rm -f "results/estimation_inputs.tgz"
-  echo "No stored estimation inputs found for UUID: $resolved_result_uuid"
+  rm -f "results/estimation_artifacts.tgz"
+  echo "No stored estimation artifacts found for UUID: $resolved_result_uuid"
 fi
diff --git a/scripts/result_server/send_estimate.sh b/scripts/result_server/send_estimate.sh
index f706cda..1f08708 100644
--- a/scripts/result_server/send_estimate.sh
+++ b/scripts/result_server/send_estimate.sh
@@ -6,10 +6,86 @@ set -euo pipefail
 
 echo "Sending estimate results to server"
 
+upload_estimation_artifacts() {
+  local json_file="$1"
+  local source_uuid="$2"
+  local archive
+  local endpoint
+  local endpoints
+  local response
+  local upload_ok=0
+
+  if [[ ! -d "results/estimation_artifacts" ]] || ! compgen -G "results/estimation_artifacts/*" > /dev/null; then
+    echo "No estimation_artifacts directory found for $json_file. Skipping estimation artifact upload."
+    return 0
+  fi
+
+  if [[ -z "$source_uuid" || "$source_uuid" == "null" ]]; then
+    echo "WARNING: Could not resolve source result UUID for $json_file. Skipping estimation artifact upload." >&2
+    return 0
+  fi
+
+  archive="results/estimation_artifacts_${source_uuid}.tgz"
+  tar \
+    --exclude='*_prepare' \
+    --exclude='*_prepare/*' \
+    --exclude='*.ncu-rep' \
+    --exclude='profile_raw.csv' \
+    --exclude='padata*.tgz' \
+    --exclude='*.tgz' \
+    -C "results/estimation_artifacts" \
+    -czf "$archive" .
+  echo "Uploading $archive with source result UUID $source_uuid"
+
+  endpoints=("/api/ingest/estimation-artifacts" "/api/ingest/estimation-inputs")
+  for endpoint in "${endpoints[@]}"; do
+    if response=$(curl --fail -sS -X POST "${RESULT_SERVER}${endpoint}" \
+      -H "X-API-Key: ${RESULT_SERVER_KEY}" \
+      -F "id=${source_uuid}" \
+      -F "file=@${archive}" 2>&1); then
+      upload_ok=1
+      break
+    fi
+    if [[ "$endpoint" == "/api/ingest/estimation-artifacts" ]] && printf '%s\n' "$response" | grep -q '404'; then
+      echo "WARNING: ${endpoint} was not available; retrying legacy /api/ingest/estimation-inputs endpoint." >&2
+      continue
+    fi
+    break
+  done
+
+  if [[ "$upload_ok" -eq 1 ]]; then
+    if [[ -n "$response" ]]; then
+      echo "$response"
+    fi
+    rm -f "$archive"
+    echo "Uploaded estimation artifacts for $json_file"
+    return 0
+  fi
+
+  rm -f "$archive"
+  if printf '%s\n' "$response" | grep -q '413'; then
+    echo "WARNING: Skipping estimation artifact upload because the server rejected ${archive} as too large (HTTP 413)." >&2
+    echo "WARNING: Estimate JSON was already ingested; estimation artifacts remain available as GitLab artifacts." >&2
+    return 0
+  fi
+
+  echo "ERROR: Failed to upload estimation artifacts for ${json_file}" >&2
+  echo "$response" >&2
+  return 1
+}
+
 found=0
 for json_file in results/estimate*.json; do
   [[ ! -f "$json_file" ]] && continue
   found=1
+  source_uuid=$(jq -r '
+    .estimate_metadata.source_result_uuid
+    // .estimate_metadata.source_result.uuid
+    // .current_system.benchmark.uuid
+    // .current_system.uuid
+    // ._server_uuid
+    // empty
+  ' "$json_file")
   echo "Posting $json_file to ${RESULT_SERVER}/api/ingest/estimate"
   curl --fail -sS -X POST "${RESULT_SERVER}/api/ingest/estimate" \
     -H "X-API-Key: ${RESULT_SERVER_KEY}" \
@@ -17,6 +93,7 @@ for json_file in results/estimate*.json; do
     --data-binary @"$json_file"
   echo ""
   echo "Sent: $json_file"
+  upload_estimation_artifacts "$json_file" "$source_uuid"
 done
 
 if [[ "$found" -eq 0 ]]; then
diff --git a/scripts/result_server/send_results.sh b/scripts/result_server/send_results.sh
index dc7d597..bc36c17 100644
--- a/scripts/result_server/send_results.sh
+++ b/scripts/result_server/send_results.sh
@@ -58,6 +58,36 @@ build_profile_data_summary() {
   ' 2>/dev/null || true
 }
 
+upload_padata_archive() {
+  local tgz_file="$1"
+  local uuid="$2"
+  local timestamp="$3"
+  local response
+
+  echo "Uploading $tgz_file with UUID $uuid"
+  if response=$(curl --fail -sS -X POST "${RESULT_SERVER}/api/ingest/padata" \
+    -H "X-API-Key: ${RESULT_SERVER_KEY}" \
+    -F "id=${uuid}" \
+    -F "timestamp=${timestamp}" \
+    -F "file=@${tgz_file}" 2>&1); then
+    if [[ -n "$response" ]]; then
+      echo "$response"
+    fi
+    echo "Uploaded $tgz_file"
+    return 0
+  fi
+
+  if printf '%s\n' "$response" | grep -q '413'; then
+    echo "WARNING: Skipping padata upload because the server rejected ${tgz_file} as too large (HTTP 413)." >&2
+    echo "WARNING: Result JSON was already ingested; the padata archive remains available as a GitLab artifact for downstream jobs." >&2
+    return 0
+  fi
+
+  echo "ERROR: Failed to upload ${tgz_file}" >&2
+  echo "$response" >&2
+  return 1
+}
+
 # Loop over all result*.json files
 for json_file in results/result*.json; do
   [[ ! -f "$json_file" ]] && continue
@@ -136,31 +166,11 @@ for json_file in results/result*.json; do
   
   # Upload TGZ if it exists
   if [[ -f "$tgz_file" ]]; then
-    echo "Uploading $tgz_file with UUID $uuid"
-    curl --fail -sS -X POST "${RESULT_SERVER}/api/ingest/padata" \
-      -H "X-API-Key: ${RESULT_SERVER_KEY}" \
-      -F "id=${uuid}" \
-      -F "timestamp=${timestamp}" \
-      -F "file=@${tgz_file}"
-    echo "Uploaded $tgz_file"
+    upload_padata_archive "$tgz_file" "$uuid" "$timestamp"
   else
     echo "No matching TGZ found for $json_file (expected: $tgz_file). Skipping upload."
   fi
 
-  if [[ -d "results/estimation_inputs" ]] && compgen -G "results/estimation_inputs/*" > /dev/null; then
-    estimation_inputs_archive="results/estimation_inputs_${uuid}.tgz"
-    tar -C "results/estimation_inputs" -czf "$estimation_inputs_archive" .
-    echo "Uploading $estimation_inputs_archive with UUID $uuid"
-    curl --fail -sS -X POST "${RESULT_SERVER}/api/ingest/estimation-inputs" \
-      -H "X-API-Key: ${RESULT_SERVER_KEY}" \
-      -F "id=${uuid}" \
-      -F "file=@${estimation_inputs_archive}"
-    rm -f "$estimation_inputs_archive"
-    echo "Uploaded estimation inputs for $json_file"
-  else
-    echo "No estimation_inputs directory found for $json_file. Skipping estimation input upload."
-  fi
-
 done
 
 echo "Final result metadata manifest:"
diff --git a/scripts/test_estimate_submit.sh b/scripts/test_estimate_submit.sh
new file mode 100644
index 0000000..e605ce7
--- /dev/null
+++ b/scripts/test_estimate_submit.sh
@@ -0,0 +1,127 @@
+#!/bin/bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage:
+  scripts/test_estimate_submit.sh <code> <line_number>
+  scripts/test_estimate_submit.sh <code> <line_number> --estimate-only
+
+The first form submits a local scheduler job that runs the benchmark with
+GPU-MLP profiler settings and creates results/result*.json.  The second form
+runs the estimation step from the existing results directory.  When SIF or
+BK_ESTIMATE_APPTAINER_IMAGE is set, --estimate-only runs inside Apptainer.
+EOF
+}
+
+if [ "$#" -lt 2 ] || [ "$#" -gt 3 ]; then
+  usage
+  exit 1
+fi
+
+code="$1"
+list_csv_line_num="$2"
+mode="${3:-submit}"
+
+if ! [[ "$list_csv_line_num" =~ ^[0-9]+$ ]] || [ "$list_csv_line_num" -le 0 ]; then
+  echo "Error: <line_number> must be a positive integer" >&2
+  exit 1
+fi
+
+source ./scripts/job_functions.sh
+
+list_file="programs/${code}/list.csv"
+if [ ! -f "$list_file" ]; then
+  echo "Error: $list_file does not exist" >&2
+  exit 1
+fi
+
+line=$(tail -n +2 "$list_file" | sed -n "${list_csv_line_num}p")
+if [ -z "$line" ]; then
+  echo "Error: line $list_csv_line_num does not exist in $list_file" >&2
+  exit 1
+fi
+
+IFS=, read -r -a cols <<< "$line"
+system="${cols[0]}"
+enable="${cols[1]}"
+nodes="${cols[2]}"
+numproc_node="${cols[3]}"
+nthreads="${cols[4]}"
+elapse="${cols[5]}"
+
+if [[ "$enable" != "yes" ]]; then
+  echo "Error: selected line is disabled: $line" >&2
+  exit 1
+fi
+
+if [[ "$mode" == "--estimate-only" ]]; then
+  export BK_GPU_MLP_PERFTOOLS_ROOT="${BK_GPU_MLP_PERFTOOLS_ROOT:-${PERFTOOLS:-}}"
+  image="${BK_ESTIMATE_APPTAINER_IMAGE:-${SIF:-}}"
+  if [[ -n "$image" ]]; then
+    binds="${PWD}:${PWD},/tmp:/tmp"
+    if [[ -n "${BK_GPU_MLP_PERFTOOLS_ROOT:-}" ]]; then
+      binds="${binds},${BK_GPU_MLP_PERFTOOLS_ROOT}:${BK_GPU_MLP_PERFTOOLS_ROOT}"
+    fi
+    apptainer exec --bind "$binds" --pwd "$PWD" "$image" \
+      bash scripts/estimation/run.sh "$code"
+  else
+    bash scripts/estimation/run.sh "$code"
+  fi
+  exit 0
+fi
+
+if [[ "$mode" != "submit" ]]; then
+  usage
+  exit 1
+fi
+
+echo "Selected estimation test configuration:"
+echo "  code=$code"
+echo "  line=$list_csv_line_num"
+echo "  system=$system nodes=$nodes numproc_node=$numproc_node nthreads=$nthreads elapse=$elapse"
+
+cat > script.estimate.sh <<EOF
+#!/bin/bash
+set -euo pipefail
+cd "$PWD"
+
+rm -rf results
+mkdir -p results
+
+export BK_GENESIS_GPU_MLP_PROFILE="\${BK_GENESIS_GPU_MLP_PROFILE:-true}"
+export BK_GPU_MLP_NCU_LAUNCH_COUNT="\${BK_GPU_MLP_NCU_LAUNCH_COUNT:-20}"
+export BK_GPU_MLP_SOURCE_GPU="\${BK_GPU_MLP_SOURCE_GPU:-H100}"
+export BK_GPU_MLP_KERNEL_COUNT="\${BK_GPU_MLP_KERNEL_COUNT:-20}"
+
+bash programs/${code}/run.sh ${system} ${nodes} ${numproc_node} ${nthreads}
+bash scripts/result.sh ${code} ${system} local-estimate "" test_estimate_submit ""
+EOF
+
+chmod +x script.estimate.sh
+
+case "$system" in
+  MiyabiG)
+    group_name=$(groups | awk '{print $2}')
+    echo qsub -q debug-g -l select=${nodes}:mpiprocs=${numproc_node}:ompthreads=${nthreads} -l walltime=${elapse} -W group_list=${group_name} script.estimate.sh
+    qsub -q debug-g -l select=${nodes}:mpiprocs=${numproc_node}:ompthreads=${nthreads} -l walltime=${elapse} -W group_list=${group_name} script.estimate.sh
+    ;;
+  RC_GH200)
+    echo sbatch -p qc-gh200 -N "${nodes}" -t "${elapse}" --ntasks-per-node="${numproc_node}" --cpus-per-task="${nthreads}" --wrap="bash script.estimate.sh"
+    sbatch -p qc-gh200 -N "${nodes}" -t "${elapse}" --ntasks-per-node="${numproc_node}" --cpus-per-task="${nthreads}" --wrap="bash script.estimate.sh"
+    ;;
+  *)
+    echo "Error: test_estimate_submit currently supports MiyabiG and RC_GH200, got ${system}" >&2
+    exit 1
+    ;;
+esac
+
+cat <<EOF
+
+After the scheduler job finishes, run:
+
+  scripts/test_estimate_submit.sh ${code} ${list_csv_line_num} --estimate-only
+
+Set SIF or BK_ESTIMATE_APPTAINER_IMAGE to run the estimate step in Apptainer.
+Set PERFTOOLS or BK_GPU_MLP_PERFTOOLS_ROOT to the PerfTools checkout.
+EOF
diff --git a/scripts/tests/test_bk_profiler.sh b/scripts/tests/test_bk_profiler.sh
index 85150b7..3fe1b1a 100644
--- a/scripts/tests/test_bk_profiler.sh
+++ b/scripts/tests/test_bk_profiler.sh
@@ -184,6 +184,17 @@ mkdir -p "$ncu_detailed_extract"
 tar -xzf "$ncu_detailed_archive" -C "$ncu_detailed_extract"
 grep -q '"ncu_options": \["--target-processes", "all", "--set", "full", "--nvtx"\]' "${ncu_detailed_extract}/bk_profiler_artifact/meta.json"
 
+ncu_raw_csv_archive="${TMP_DIR}/ncu_raw_csv.tgz"
+ncu_raw_csv_extract="${TMP_DIR}/ncu_raw_csv_extract"
+ncu_raw_csv_raw="${TMP_DIR}/ncu_raw_csv_pa"
+export BK_PROFILER_NCU_RAW_CSV=true
+bk_profiler ncu --level single --archive "$ncu_raw_csv_archive" --raw-dir "$ncu_raw_csv_raw" -- bash -c 'printf "ncu raw csv target\n"'
+unset BK_PROFILER_NCU_RAW_CSV
+mkdir -p "$ncu_raw_csv_extract"
+tar -xzf "$ncu_raw_csv_archive" -C "$ncu_raw_csv_extract"
+test -f "${ncu_raw_csv_extract}/bk_profiler_artifact/raw/rep1/profile_raw.csv"
+grep -q '"kind": "ncu_raw_csv"' "${ncu_raw_csv_extract}/bk_profiler_artifact/meta.json"
+
 fapp_fail_archive="${TMP_DIR}/fapp_fail.tgz"
 fapp_fail_extract="${TMP_DIR}/fapp_fail_extract"
 fapp_fail_raw="${TMP_DIR}/fapp_fail_pa"
diff --git a/scripts/tests/test_estimation_gpu_kernel_mlp_v15.sh b/scripts/tests/test_estimation_gpu_kernel_mlp_v15.sh
new file mode 100644
index 0000000..5b0be3d
--- /dev/null
+++ b/scripts/tests/test_estimation_gpu_kernel_mlp_v15.sh
@@ -0,0 +1,138 @@
+#!/bin/bash
+set -euo pipefail
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+REPO_DIR=$(cd "${SCRIPT_DIR}/../.." && pwd)
+
+TMP_DIR=$(mktemp -d)
+trap 'rm -rf "${TMP_DIR}"' EXIT
+PREDICTION_FIXTURE="${REPO_DIR}/programs/qws/fixtures/gpu_kernel_mlp_v15_pred.csv"
+
+if ! command -v jq >/dev/null 2>&1; then
+  echo "jq not found; skipping gpu_kernel_mlp_v15 estimation test"
+  exit 0
+fi
+if ! command -v python3 >/dev/null 2>&1; then
+  echo "python3 not found; skipping gpu_kernel_mlp_v15 estimation test"
+  exit 0
+fi
+if [[ ! -f "$PREDICTION_FIXTURE" ]]; then
+  echo "prediction fixture not found: $PREDICTION_FIXTURE" >&2
+  exit 1
+fi
+
+cat > "${TMP_DIR}/breakdown.json" <<EOF
+{
+  "sections": [
+    {
+      "name": "gpu_kernel_region",
+      "bench_time": 0.009,
+      "estimation_package": "gpu_kernel_mlp_v15",
+      "artifacts": [
+        {"path": "${PREDICTION_FIXTURE}"}
+      ]
+    },
+    {
+      "name": "cpu_tail",
+      "bench_time": 0.001,
+      "estimation_package": "identity"
+    }
+  ],
+  "overlaps": []
+}
+EOF
+
+pushd "${REPO_DIR}" >/dev/null
+source scripts/estimation/common.sh
+source scripts/estimation/packages/instrumented_app_sections_dummy.sh
+
+export BK_GPU_MLP_ARTIFACT_MODE="prediction"
+export BK_GPU_MLP_PYTHON="python3"
+
+transformed=$(bk_top_level_transform_breakdown "$(cat "${TMP_DIR}/breakdown.json")" "1" "1" "1" "identity" "identity")
+popd >/dev/null
+
+echo "$transformed" | jq -e '
+  (.sections | length == 2) and
+  .sections[0].name == "gpu_kernel_region" and
+  .sections[0].time == 0.006 and
+  .sections[0].bench_time == 0.009 and
+  .sections[0].scaling_method == "gpu-kernel-mlp-v1.5" and
+  .sections[0].estimation_package == "gpu_kernel_mlp_v15" and
+  .sections[0].package_applicability.status == "applicable" and
+  .sections[0].metrics.kernel_count == 3 and
+  .sections[0].metrics.kernels[0].metrics."Memory Throughput [%]" == 48 and
+  .sections[1].time == 0.001
+' >/dev/null
+
+FAKE_PERFTOOLS="${TMP_DIR}/PerfTools"
+mkdir -p "${FAKE_PERFTOOLS}/MLP_NN/v1.5"
+cat > "${FAKE_PERFTOOLS}/MLP_NN/v1.5/predict_v15.py" <<'PY'
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--csv", required=True)
+parser.add_argument("--row", required=True)
+parser.add_argument("--out", required=True)
+parser.add_argument("--log")
+args = parser.parse_args()
+
+if args.row != "all":
+    raise SystemExit(f"unexpected row selector: {args.row}")
+with open(args.csv, encoding="utf-8") as handle:
+    if "probe_kernel" not in handle.read():
+        raise SystemExit("input CSV was not passed to fake predictor")
+
+with open(args.out, "w", encoding="utf-8") as handle:
+    handle.write("kernel_name,src_gpu,tgt_gpu,Execution Time [ns],Memory Throughput [%]\n")
+    handle.write("probe_kernel,A100,H100,4000000,51\n")
+
+if args.log:
+    with open(args.log, "w", encoding="utf-8") as handle:
+        handle.write("fake predictor called\n")
+PY
+
+cat > "${TMP_DIR}/input.csv" <<'EOF'
+kernel_name,src_gpu,tgt_gpu
+probe_kernel,A100,H100
+EOF
+
+cat > "${TMP_DIR}/breakdown_input.json" <<EOF
+{
+  "sections": [
+    {
+      "name": "gpu_kernel_region",
+      "bench_time": 0.011,
+      "estimation_package": "gpu_kernel_mlp_v15",
+      "artifacts": [
+        {"path": "${TMP_DIR}/input.csv"}
+      ]
+    }
+  ],
+  "overlaps": []
+}
+EOF
+
+pushd "${REPO_DIR}" >/dev/null
+export BK_GPU_MLP_ARTIFACT_MODE="input"
+export BK_GPU_MLP_PERFTOOLS_ROOT="${FAKE_PERFTOOLS}"
+export BK_GPU_MLP_OUTPUT_DIR="${TMP_DIR}/mlp_outputs"
+
+transformed_from_input=$(bk_top_level_transform_breakdown "$(cat "${TMP_DIR}/breakdown_input.json")" "1" "1" "1" "identity" "identity")
+popd >/dev/null
+
+echo "$transformed_from_input" | jq -e '
+  (.sections | length == 1) and
+  .sections[0].name == "gpu_kernel_region" and
+  .sections[0].time == 0.004 and
+  .sections[0].bench_time == 0.011 and
+  .sections[0].scaling_method == "gpu-kernel-mlp-v1.5" and
+  .sections[0].metrics.kernel_count == 1 and
+  .sections[0].metrics.kernels[0].name == "probe_kernel" and
+  .sections[0].artifacts[-1].kind == "gpu_mlp_prediction_csv"
+' >/dev/null
+
+test -f "${TMP_DIR}/mlp_outputs/unknown_gpu_kernel_region_local_pred.csv"
+test -f "${TMP_DIR}/mlp_outputs/unknown_gpu_kernel_region_local.log"
+
+echo "gpu_kernel_mlp_v15 section estimation test passed"
diff --git a/scripts/tests/test_genesis_gpu_mlp_estimation.sh b/scripts/tests/test_genesis_gpu_mlp_estimation.sh
new file mode 100644
index 0000000..e28c40b
--- /dev/null
+++ b/scripts/tests/test_genesis_gpu_mlp_estimation.sh
@@ -0,0 +1,41 @@
+#!/bin/bash
+set -euo pipefail
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+REPO_DIR=$(cd "${SCRIPT_DIR}/../.." && pwd)
+
+TMP_DIR=$(mktemp -d)
+trap 'rm -rf "${TMP_DIR}"' EXIT
+
+mkdir -p "${TMP_DIR}/programs" "${TMP_DIR}/scripts" "${TMP_DIR}/results"
+cp -R "${REPO_DIR}/programs/genesis" "${TMP_DIR}/programs/genesis"
+cp "${REPO_DIR}/scripts/bk_functions.sh" "${TMP_DIR}/scripts/bk_functions.sh"
+cp -R "${REPO_DIR}/scripts/estimation" "${TMP_DIR}/scripts/estimation"
+cp -R "${REPO_DIR}/scripts/result_server" "${TMP_DIR}/scripts/result_server"
+
+pushd "${TMP_DIR}" >/dev/null
+source programs/genesis/estimate.sh
+test "${BK_ESTIMATION_BASELINE_EXP}" = "p8"
+
+export BK_GENESIS_GPU_MLP_PROFILE=false
+genesis_emit_estimation_data_from_fom 10 > results/no_profile.result
+! grep -q '^SECTION:gpu_kernel_region ' results/no_profile.result
+
+export BK_GENESIS_GPU_MLP_PROFILE=true
+genesis_emit_estimation_data_from_fom 10 > results/no_archive.result 2> results/no_archive.err
+! grep -q '^SECTION:gpu_kernel_region ' results/no_archive.result
+grep -q 'profiler archive was not found' results/no_archive.err
+
+touch results/padata0.tgz
+genesis_emit_estimation_data_from_fom 10 > results/with_archive.result
+grep -q '^SECTION:gpu_kernel_region ' results/with_archive.result
+grep -q 'artifact:results/padata0.tgz' results/with_archive.result
+
+mkdir -p genesis_benchmark_input/npt/genesis2.0beta_3.5fs/apoa1
+GENESIS_BENCHKIT_ROOT="$PWD" \
+  bash -c 'source programs/genesis/estimate.sh; cd genesis_benchmark_input/npt/genesis2.0beta_3.5fs/apoa1; export BK_GENESIS_GPU_MLP_PROFILE=true; genesis_emit_estimation_data_from_fom 10' \
+  > results/from_subdir.result
+grep -q 'artifact:results/padata0.tgz' results/from_subdir.result
+popd >/dev/null
+
+echo "genesis gpu mlp estimation metadata test passed"
diff --git a/scripts/tests/test_qws_gpu_mlp_smoke_estimation.sh b/scripts/tests/test_qws_gpu_mlp_smoke_estimation.sh
new file mode 100644
index 0000000..41eb4d1
--- /dev/null
+++ b/scripts/tests/test_qws_gpu_mlp_smoke_estimation.sh
@@ -0,0 +1,42 @@
+#!/bin/bash
+set -euo pipefail
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+REPO_DIR=$(cd "${SCRIPT_DIR}/../.." && pwd)
+
+TMP_DIR=$(mktemp -d)
+trap 'rm -rf "${TMP_DIR}"' EXIT
+
+mkdir -p "${TMP_DIR}/programs" "${TMP_DIR}/scripts" "${TMP_DIR}/results" "${TMP_DIR}/qws"
+cp -R "${REPO_DIR}/programs/qws" "${TMP_DIR}/programs/qws"
+cp "${REPO_DIR}/scripts/bk_functions.sh" "${TMP_DIR}/scripts/bk_functions.sh"
+cp -R "${REPO_DIR}/scripts/estimation" "${TMP_DIR}/scripts/estimation"
+cp -R "${REPO_DIR}/scripts/result_server" "${TMP_DIR}/scripts/result_server"
+
+pushd "${TMP_DIR}" >/dev/null
+set -- results/result0.json
+export BK_QWS_GPU_MLP_SMOKE=true
+export BK_QWS_GPU_MLP_SMOKE_MODE=prediction
+source programs/qws/estimate.sh
+
+pushd qws >/dev/null
+qws_emit_estimation_data_from_fom 10 > ../results/result
+popd >/dev/null
+
+grep -q '^SECTION:gpu_kernel_region ' results/result
+test -f results/estimation_artifacts/qws_gpu_kernel_mlp_v15_pred.csv
+grep -q 'qws_smoke_kernel_0' results/estimation_artifacts/qws_gpu_kernel_mlp_v15_pred.csv
+
+rm -rf results
+mkdir -p results qws
+export BK_QWS_GPU_MLP_SMOKE_MODE=perftools
+pushd qws >/dev/null
+qws_emit_estimation_data_from_fom 10 > ../results/result
+popd >/dev/null
+
+grep -q '^SECTION:gpu_kernel_region ' results/result
+test -f results/estimation_artifacts/qws_gpu_kernel_mlp_v15_input.csv
+grep -q 'qws_smoke_uses_perftools_example' results/estimation_artifacts/qws_gpu_kernel_mlp_v15_input.csv
+popd >/dev/null
+
+echo "qws gpu mlp smoke estimation test passed"
diff --git a/scripts/tests/test_send_estimate_artifacts.sh b/scripts/tests/test_send_estimate_artifacts.sh
new file mode 100644
index 0000000..3d81a96
--- /dev/null
+++ b/scripts/tests/test_send_estimate_artifacts.sh
@@ -0,0 +1,103 @@
+#!/bin/bash
+set -euo pipefail
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+REPO_DIR=$(cd "${SCRIPT_DIR}/../.." && pwd)
+
+TMP_DIR=$(mktemp -d)
+trap 'rm -rf "${TMP_DIR}"' EXIT
+
+mkdir -p "${TMP_DIR}/bin" "${TMP_DIR}/results/estimation_artifacts/gpu_kernel_mlp_v15"
+mkdir -p "${TMP_DIR}/results/estimation_artifacts/gpu_kernel_mlp_v15/genesis_prepare/padata/raw/rep1"
+
+cat > "${TMP_DIR}/results/estimate0.json" <<'JSON'
+{
+  "code": "genesis",
+  "exp": "p8",
+  "current_system": {
+    "benchmark": {
+      "uuid": "11111111-2222-3333-4444-555555555555"
+    }
+  },
+  "estimate_metadata": {
+    "source_result_uuid": "11111111-2222-3333-4444-555555555555"
+  }
+}
+JSON
+
+cat > "${TMP_DIR}/results/estimation_artifacts/gpu_kernel_mlp_v15/input.csv" <<'EOF'
+kernel_name,Execution Time [ns]
+dummy,1
+EOF
+cat > "${TMP_DIR}/results/estimation_artifacts/gpu_kernel_mlp_v15/pred.csv" <<'EOF'
+kernel_name,Execution Time [ns]
+dummy,2
+EOF
+echo "predictor log" > "${TMP_DIR}/results/estimation_artifacts/gpu_kernel_mlp_v15/gpu_kernel_region.log"
+echo "raw report" > "${TMP_DIR}/results/estimation_artifacts/gpu_kernel_mlp_v15/genesis_prepare/padata/raw/rep1/profile.ncu-rep"
+echo "raw csv" > "${TMP_DIR}/results/estimation_artifacts/gpu_kernel_mlp_v15/genesis_prepare/padata/raw/rep1/profile_raw.csv"
+echo "padata duplicate" > "${TMP_DIR}/results/estimation_artifacts/gpu_kernel_mlp_v15/padata0.tgz"
+
+cat > "${TMP_DIR}/bin/curl" <<'EOF'
+#!/bin/bash
+set -euo pipefail
+
+printf '%s\n' "$*" >> "${CURL_LOG:?CURL_LOG is required}"
+
+archive=""
+for arg in "$@"; do
+  case "$arg" in
+    file=@*) archive="${arg#file=@}" ;;
+  esac
+done
+
+if printf '%s\n' "$*" | grep -q '/api/ingest/estimation-artifacts'; then
+  if [ "${FAKE_ESTIMATION_ARTIFACTS_NEW_STATUS:-200}" = "404" ]; then
+    echo "curl: (22) The requested URL returned error: 404" >&2
+    exit 22
+  fi
+fi
+
+if printf '%s\n' "$*" | grep -Eq '/api/ingest/estimation-(artifacts|inputs)'; then
+  if [ "${FAKE_ESTIMATION_ARTIFACTS_STATUS:-200}" = "413" ]; then
+    echo "curl: (22) The requested URL returned error: 413" >&2
+    exit 22
+  fi
+  test -n "$archive"
+  tar -tzf "$archive" > "${ESTIMATION_ARTIFACTS_TAR_LIST:?ESTIMATION_ARTIFACTS_TAR_LIST is required}"
+fi
+
+printf '%s\n' '{"status":"ok"}'
+EOF
+chmod +x "${TMP_DIR}/bin/curl"
+
+export PATH="${TMP_DIR}/bin:${PATH}"
+export CURL_LOG="${TMP_DIR}/curl.log"
+export ESTIMATION_ARTIFACTS_TAR_LIST="${TMP_DIR}/estimation_artifacts_tar_list.txt"
+export RESULT_SERVER="https://result.example.test"
+export RESULT_SERVER_KEY="dummy-key"
+
+cd "$TMP_DIR"
+bash "${REPO_DIR}/scripts/result_server/send_estimate.sh"
+
+grep -q '/api/ingest/estimate' "$CURL_LOG"
+grep -q '/api/ingest/estimation-artifacts' "$CURL_LOG"
+grep -q 'id=11111111-2222-3333-4444-555555555555' "$CURL_LOG"
+grep -q './gpu_kernel_mlp_v15/input.csv' "$ESTIMATION_ARTIFACTS_TAR_LIST"
+grep -q './gpu_kernel_mlp_v15/pred.csv' "$ESTIMATION_ARTIFACTS_TAR_LIST"
+grep -q './gpu_kernel_mlp_v15/gpu_kernel_region.log' "$ESTIMATION_ARTIFACTS_TAR_LIST"
+! grep -q 'profile.ncu-rep' "$ESTIMATION_ARTIFACTS_TAR_LIST"
+! grep -q 'profile_raw.csv' "$ESTIMATION_ARTIFACTS_TAR_LIST"
+! grep -q 'padata0.tgz' "$ESTIMATION_ARTIFACTS_TAR_LIST"
+
+rm -f "$CURL_LOG" "$ESTIMATION_ARTIFACTS_TAR_LIST"
+FAKE_ESTIMATION_ARTIFACTS_STATUS=413 bash "${REPO_DIR}/scripts/result_server/send_estimate.sh"
+grep -q '/api/ingest/estimate' "$CURL_LOG"
+grep -q '/api/ingest/estimation-artifacts' "$CURL_LOG"
+test ! -e results/estimation_artifacts_11111111-2222-3333-4444-555555555555.tgz
+
+rm -f "$CURL_LOG" "$ESTIMATION_ARTIFACTS_TAR_LIST"
+FAKE_ESTIMATION_ARTIFACTS_NEW_STATUS=404 bash "${REPO_DIR}/scripts/result_server/send_estimate.sh"
+grep -q '/api/ingest/estimation-artifacts' "$CURL_LOG"
+grep -q '/api/ingest/estimation-inputs' "$CURL_LOG"
+grep -q './gpu_kernel_mlp_v15/input.csv' "$ESTIMATION_ARTIFACTS_TAR_LIST"
diff --git a/scripts/tests/test_send_results_profile_data.sh b/scripts/tests/test_send_results_profile_data.sh
index 58c157d..ba0d8be 100644
--- a/scripts/tests/test_send_results_profile_data.sh
+++ b/scripts/tests/test_send_results_profile_data.sh
@@ -59,6 +59,10 @@ if printf '%s\n' "$*" | grep -q '/api/ingest/result'; then
   exit 0
 fi
 if printf '%s\n' "$*" | grep -q '/api/ingest/padata'; then
+  if [ "${FAKE_PADATA_STATUS:-200}" = "413" ]; then
+    echo "curl: (22) The requested URL returned error: 413" >&2
+    exit 22
+  fi
   printf '%s\n' '{"status":"uploaded"}'
   exit 0
 fi
@@ -213,4 +217,18 @@ grep -Eq '"ncu_report"' "${TMP_DIR}/results/result0.json"
 grep -q '"_server_uuid": "11111111-2222-3333-4444-555555555555"' "${TMP_DIR}/results/result0.json"
 grep -q '"result0.json"' "${TMP_DIR}/results/server_result_meta.json"
 
+mkdir -p "${TMP_DIR}/case413/results"
+cp "${TMP_DIR}/results/result0.json" "${TMP_DIR}/case413/results/result0.json"
+cp "${TMP_DIR}/results/padata0.tgz" "${TMP_DIR}/case413/results/padata0.tgz"
+
+export FAKE_PADATA_STATUS=413
+pushd "${TMP_DIR}/case413" >/dev/null
+bash "${REPO_DIR}/scripts/result_server/send_results.sh" > send_results_413.log 2>&1
+popd >/dev/null
+unset FAKE_PADATA_STATUS
+
+grep -q 'HTTP 413' "${TMP_DIR}/case413/send_results_413.log"
+grep -q 'All done.' "${TMP_DIR}/case413/send_results_413.log"
+grep -q '"_server_uuid": "11111111-2222-3333-4444-555555555555"' "${TMP_DIR}/case413/results/result0.json"
+
 echo "send_results profile_data test passed"