Skip to content

Telemetry enhancement#1677

Merged
minjieqiu merged 26 commits intodevelopfrom
feature/telemetry1
Feb 18, 2026
Merged

Telemetry enhancement#1677
minjieqiu merged 26 commits intodevelopfrom
feature/telemetry1

Conversation

@minjieqiu
Copy link

@minjieqiu minjieqiu commented Jan 29, 2026

Description.

This PR implement SOK Telemetry enhancement. ERD:
https://cisco-my.sharepoint.com/:w:/p/mqiu/IQBoVUuEEY1SR4rDjbja0iPuAeN5dxFG-K-ZPpvO6RoWJp0?e=n5R1Ow

What does this PR have in it?.

Periodically collect (once per day) and send SOK telemetry which includes:

  1. SOK telemetry.
    a. SOK version.
    b. CPU/Memory settings (limit and request) of containers including standalone, searchheadcluster, indexercluster,
    clustermaster, clustermanager, licensemaster and licensemanager.
    c. LincenseInfo (Splunk license ID and license type).
  2. Other component's telemetry which are submitted to SOK by adding key/value to the new telemetry configmap splunk-operator-manager-telemetry

Key Changes.

  • Created a new configmap splunk-operator-manager-telemetry
  • Create a new controller which reconciles on the telemetry configmap
  • Renamed the telemetry app to app_tel_for_sok

Highlight the updates in specific files

Testing and Verification.

Tested on s1, c3 and m4.

How did you test these changes? What automated tests are added?.
Added telemetry verification to existing s1, c3 and m4 tests.

Related Issues

Jira tickets, GitHub issues, Support tickets...
https://splunk.atlassian.net/browse/CSPL-4371.

PR Checklist

  • [✅ ] Code changes adhere to the project's coding standards.
  • [ ✅ ] Relevant unit and integration tests are included.
  • [✅ ] Documentation has been updated accordingly.
  • [✅ ] All tests pass locally.
  • [✅ ] The PR description follows the project's guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 29, 2026

CLA Assistant Lite bot CLA Assistant Lite bot All contributors have signed the COC ✍️ ✅

@minjieqiu
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@minjieqiu
Copy link
Author

I have read the Code of Conduct and I hereby accept the Terms

@coveralls
Copy link
Collaborator

coveralls commented Jan 29, 2026

Pull Request Test Coverage Report for Build 21975003576

Details

  • 372 of 465 (80.0%) changed or added relevant lines in 6 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 86.022%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/splunk/client/enterprise.go 13 15 86.67%
pkg/splunk/enterprise/names.go 0 6 0.0%
internal/controller/telemetry_controller.go 37 46 80.43%
pkg/splunk/enterprise/telemetry.go 316 392 80.61%
Files with Coverage Reduction New Missed Lines %
pkg/splunk/enterprise/afwscheduler.go 3 92.51%
Totals Coverage Status
Change from base Build 21948077040: -0.3%
Covered Lines: 11293
Relevant Lines: 13128

💛 - Coveralls

@minjieqiu minjieqiu marked this pull request as ready for review February 2, 2026 17:11
@minjieqiu minjieqiu changed the title [Draft]: Telemetry enhancement Telemetry enhancement Feb 2, 2026
@kasiakoziol
Copy link
Collaborator

I think it might be worth to add/update docs

"sigs.k8s.io/controller-runtime/pkg/reconcile"
)

var _ = Describe("Telemetry Controller", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have some controller test cases

scopedLog.Info("Updated last transmission time in configmap", "newStatus", cm.Data[telStatusKey])
}

func collectResourceTelData(resources corev1.ResourceRequirements, data map[string]string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we refactor this code to make it much easier to read, or use generics
an example

func collectDeploymentTelDataRefactored(ctx context.Context, client splcommon.ControllerClient, deploymentData map[string]interface{}) map[string][]splcommon.MetaObject {
	reqLogger := log.FromContext(ctx)
	scopedLog := reqLogger.WithName("collectDeploymentTelData")

	crWithTelAppList := make(map[string][]splcommon.MetaObject)
	scopedLog.Info("Start collecting deployment telemetry data")

	// Define all CR handlers in a slice
	handlers := []crListHandler{
		{kind: "Standalone", listFunc: listStandalones, checkTelApp: true},
		{kind: "LicenseManager", listFunc: listLicenseManagers, checkTelApp: true},
		{kind: "LicenseMaster", listFunc: listLicenseMasters, checkTelApp: true},
		{kind: "SearchHeadCluster", listFunc: listSearchHeadClusters, checkTelApp: true},
		{kind: "IndexerCluster", listFunc: listIndexerClusters, checkTelApp: false},
		{kind: "ClusterManager", listFunc: listClusterManagers, checkTelApp: true},
		{kind: "ClusterMaster", listFunc: listClusterMasters, checkTelApp: true},
		{kind: "MonitoringConsole", listFunc: listMonitoringConsoles, checkTelApp: false},
	}

	// Process each CR type using the same logic
	for _, handler := range handlers {
		processCRType(ctx, client, handler, deploymentData, crWithTelAppList, scopedLog)
	}

	return crWithTelAppList
}

// processCRType is the common processing logic for all CR types
func processCRType(
	ctx context.Context,
	client splcommon.ControllerClient,
	handler crListHandler,
	deploymentData map[string]interface{},
	crWithTelAppList map[string][]splcommon.MetaObject,
	scopedLog interface{}, // Using interface{} to avoid import issues, should be logr.Logger
) {
	items, err := handler.listFunc(ctx, client)
	if err != nil {
		// scopedLog.Error(err, "Failed to list objects", "kind", handler.kind)
		return
	}

	if len(items) == 0 {
		return
	}

	// Create per-kind data map
	perKindData := make(map[string]interface{})
	deploymentData[handler.kind] = perKindData

	// Process each item
	for _, item := range items {
		// scopedLog.Info("Collecting data", "kind", item.kind, "name", item.name, "namespace", item.namespace)

		crResourceData := make(map[string]string)
		perKindData[item.name] = crResourceData

		// Collect resource telemetry data
		if resources, ok := item.resources.(corev1.ResourceRequirements); ok {
			collectResourceTelData(resources, crResourceData)
		}

		// Add to telemetry app list if applicable
		if handler.checkTelApp && item.hasTelApp {
			crWithTelAppList[handler.kind] = append(crWithTelAppList[handler.kind], item.cr)
		} else if handler.checkTelApp && !item.hasTelApp {
			// scopedLog.Info("Telemetry app is not installed for this CR", "kind", item.kind, "name", item.name)
		}
	}
}

// List functions for each CR type - these extract the common pattern

func listStandalones(ctx context.Context, client splcommon.ControllerClient) ([]crItem, error) {
	var list enterpriseApi.StandaloneList
	err := client.List(ctx, &list)
	if err != nil {
		return nil, err
	}

	items := make([]crItem, 0, len(list.Items))
	for i := range list.Items {
		cr := &list.Items[i]
		items = append(items, crItem{
			name:      cr.GetName(),
			namespace: cr.GetNamespace(),
			kind:      cr.Kind,
			resources: cr.Spec.CommonSplunkSpec.Resources,
			hasTelApp: cr.Status.TelAppInstalled,
			cr:        cr,
		})
	}
	return items, nil
}

func listLicenseManagers(ctx context.Context, client splcommon.ControllerClient) ([]crItem, error) {
	var list enterpriseApi.LicenseManagerList
	err := client.List(ctx, &list)
	if err != nil {
		return nil, err
	}

	items := make([]crItem, 0, len(list.Items))
	for i := range list.Items {
		cr := &list.Items[i]
		items = append(items, crItem{
			name:      cr.GetName(),
			namespace: cr.GetNamespace(),
			kind:      cr.Kind,
			resources: cr.Spec.CommonSplunkSpec.Resources,
			hasTelApp: cr.Status.TelAppInstalled,
			cr:        cr,
		})
	}
	return items, nil
}

func listLicenseMasters(ctx context.Context, client splcommon.ControllerClient) ([]crItem, error) {
	var list enterpriseApiV3.LicenseMasterList
	err := client.List(ctx, &list)
	if err != nil {
		return nil, err
	}

	items := make([]crItem, 0, len(list.Items))
	for i := range list.Items {
		cr := &list.Items[i]
		items = append(items, crItem{
			name:      cr.GetName(),
			namespace: cr.GetNamespace(),
			kind:      cr.Kind,
			resources: cr.Spec.CommonSplunkSpec.Resources,
			hasTelApp: cr.Status.TelAppInstalled,
			cr:        cr,
		})
	}
	return items, nil
}

func listSearchHeadClusters(ctx context.Context, client splcommon.ControllerClient) ([]crItem, error) {
	var list enterpriseApi.SearchHeadClusterList
	err := client.List(ctx, &list)
	if err != nil {
		return nil, err
	}

	items := make([]crItem, 0, len(list.Items))
	for i := range list.Items {
		cr := &list.Items[i]
		items = append(items, crItem{
			name:      cr.GetName(),
			namespace: cr.GetNamespace(),
			kind:      cr.Kind,
			resources: cr.Spec.CommonSplunkSpec.Resources,
			hasTelApp: cr.Status.TelAppInstalled,
			cr:        cr,
		})
	}
	return items, nil
}

func listIndexerClusters(ctx context.Context, client splcommon.ControllerClient) ([]crItem, error) {
	var list enterpriseApi.IndexerClusterList
	err := client.List(ctx, &list)
	if err != nil {
		return nil, err
	}

	items := make([]crItem, 0, len(list.Items))
	for i := range list.Items {
		cr := &list.Items[i]
		items = append(items, crItem{
			name:      cr.GetName(),
			namespace: cr.GetNamespace(),
			kind:      cr.Kind,
			resources: cr.Spec.CommonSplunkSpec.Resources,
			hasTelApp: false, // IndexerClusters don't track TelAppInstalled
			cr:        cr,
		})
	}
	return items, nil
}

func listClusterManagers(ctx context.Context, client splcommon.ControllerClient) ([]crItem, error) {
	var list enterpriseApi.ClusterManagerList
	err := client.List(ctx, &list)
	if err != nil {
		return nil, err
	}

	items := make([]crItem, 0, len(list.Items))
	for i := range list.Items {
		cr := &list.Items[i]
		items = append(items, crItem{
			name:      cr.GetName(),
			namespace: cr.GetNamespace(),
			kind:      cr.Kind,
			resources: cr.Spec.CommonSplunkSpec.Resources,
			hasTelApp: cr.Status.TelAppInstalled,
			cr:        cr,
		})
	}
	return items, nil
}

func listClusterMasters(ctx context.Context, client splcommon.ControllerClient) ([]crItem, error) {
	var list enterpriseApiV3.ClusterMasterList
	err := client.List(ctx, &list)
	if err != nil {
		return nil, err
	}

	items := make([]crItem, 0, len(list.Items))
	for i := range list.Items {
		cr := &list.Items[i]
		items = append(items, crItem{
			name:      cr.GetName(),
			namespace: cr.GetNamespace(),
			kind:      cr.Kind,
			resources: cr.Spec.CommonSplunkSpec.Resources,
			hasTelApp: cr.Status.TelAppInstalled,
			cr:        cr,
		})
	}
	return items, nil
}

func listMonitoringConsoles(ctx context.Context, client splcommon.ControllerClient) ([]crItem, error) {
	var list enterpriseApi.MonitoringConsoleList
	err := client.List(ctx, &list)
	if err != nil {
		return nil, err
	}

	items := make([]crItem, 0, len(list.Items))
	for i := range list.Items {
		cr := &list.Items[i]
		items = append(items, crItem{
			name:      cr.GetName(),
			namespace: cr.GetNamespace(),
			kind:      cr.Kind,
			resources: cr.Spec.CommonSplunkSpec.Resources,
			hasTelApp: false, // MonitoringConsoles don't track TelAppInstalled
			cr:        cr,
		})
	}
	return items, nil
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the code suggestion. I have made the change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code has 47% test coverage lets try to move to 90%

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added more tests.

telAppReloadString = "curl -k -u admin:`cat /mnt/splunk-secrets/password` https://localhost:8089/services/apps/local/_reload"

// Name of the telemetry configmap: <namePrefix>-manager-telemetry
telConfigMapTemplateStr = "%smanager-telemetry"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this hardcoded?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This config map is not accessed by multiple CRs.


// SetupWithManager sets up the controller with the Manager.
func (r *TelemetryReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you be watching for CR resource creation and process them only when new CR is created

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you implement an event-driven approach where the telemetry controller watches the actual Splunk custom resources and only triggers reconciliation when:

  1. A new CR is created (Standalone, ClusterMaster, IndexerCluster, SearchHeadCluster, etc.)
  2. An existing CR is modified (configuration changes, scaling events)
  3. A CR is deleted (to track removal events)

Benefits of This Approach

1. Reduced Resource Consumption

  • No periodic reconciliation when nothing has changed
  • CPU and memory usage only when actual events occur
  • More efficient for clusters with stable configurations

2. Immediate Response

  • Telemetry collected immediately when CRs are created/modified
  • No waiting for the next 10-minute requeue cycle
  • More accurate timestamps for resource creation events

3. Better Alignment with Kubernetes Best Practices

  • Controllers should react to resource changes, not poll
  • Leverages Kubernetes watch mechanism efficiently
  • Reduces unnecessary API server load

4. Clearer Intent

  • The controller's purpose becomes explicit: "Send telemetry when Splunk resources change"
  • Easier to understand and maintain
  • Better for debugging (logs show which CR triggered telemetry)

Proposed Implementation Changes

Current Setup (from SetupWithManager):

func (r *TelemetryReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&corev1.ConfigMap{}).  // Watching ConfigMaps
        WithEventFilter(predicate.Funcs{
            CreateFunc: func(e event.CreateEvent) bool {
                return r.isTelemetryConfigMap(e.Object)
            },
            // ... more predicates
        }).
        WithOptions(controller.Options{
            MaxConcurrentReconciles: 1,
        }).
        Complete(r)
}

Suggested Alternative:

func (r *TelemetryReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        // Watch Splunk CRs directly
        For(&enterprisev4.Standalone{}).
        Owns(&enterprisev4.ClusterMaster{}).
        Owns(&enterprisev4.IndexerCluster{}).
        Owns(&enterprisev4.SearchHeadCluster{}).
        // ... other Splunk CRs
        WithEventFilter(predicate.Funcs{
            CreateFunc: func(e event.CreateEvent) bool {
                // Trigger on CR creation
                return true
            },
            UpdateFunc: func(e event.UpdateEvent) bool {
                // Optionally trigger on significant updates
                return shouldCollectTelemetry(e.ObjectOld, e.ObjectNew)
            },
            DeleteFunc: func(e event.DeleteEvent) bool {
                // Optionally track deletions
                return false
            },
        }).
        WithOptions(controller.Options{
            MaxConcurrentReconciles: 1,
        }).
        Complete(r)
}

Modified Reconcile Method:

func (r *TelemetryReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := r.Log.WithValues("telemetry", req.NamespacedName)

    // Fetch the actual Splunk CR that triggered this reconciliation
    // Determine CR type and get relevant telemetry data

    // Collect telemetry for THIS specific resource
    telemetryData := r.collectResourceTelemetry(ctx, req)

    // Send telemetry immediately (no requeue needed!)
    if err := r.applyTelemetryFn(ctx, telemetryData); err != nil {
        log.Error(err, "Failed to send telemetry")
        // Only requeue on actual errors, not as a periodic timer
        return ctrl.Result{Requeue: true}, err
    }

    // Done! No automatic requeue
    return ctrl.Result{}, nil
}

Additional Considerations

1. Rate Limiting

If watching CRs directly, consider:

  • Implementing rate limiting to avoid telemetry spam
  • Batching multiple CR events within a time window
  • Using a "debounce" mechanism for rapid successive changes

2. Daily Telemetry Requirement

The PR mentions "collecting and sending telemetry data once per day". If this is the actual requirement:

Option A: Use a CronJob instead of a controller

apiVersion: batch/v1
kind: CronJob
metadata:
  name: splunk-operator-telemetry
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: telemetry-collector
            # Collect and send telemetry

Option B: If controller is needed, add timestamp-based logic:

// Check last telemetry send time
lastSent := getLastTelemetrySendTime()
if time.Since(lastSent) < 24*time.Hour {
    // Skip telemetry, already sent today
    return ctrl.Result{}, nil
}

@minjieqiu minjieqiu merged commit d413ee8 into develop Feb 18, 2026
52 of 53 checks passed
@minjieqiu minjieqiu deleted the feature/telemetry1 branch February 18, 2026 15:13
@github-actions github-actions bot locked and limited conversation to collaborators Feb 18, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments