Skip to content

Udf dev 1.3#17435

Open
suyx1999 wants to merge 4 commits intoapache:dev/1.3from
suyx1999:udf-dev-1.3
Open

Udf dev 1.3#17435
suyx1999 wants to merge 4 commits intoapache:dev/1.3from
suyx1999:udf-dev-1.3

Conversation

@suyx1999
Copy link
Copy Markdown

@suyx1999 suyx1999 commented Apr 7, 2026

Update 3 UDFs: Percentile, Quantile and Cluster

Fix issues in Percentile and Quantile UDFs

Percentile

  • Fix out-of-bounds index issue
  • Add according unit tests

Quantile

  • Fix incorrect type conversion
  • Add according unit tests

Update Cluster UDF

  • This function takes a single input time series, splits it into non-overlapping contiguous subsequences (windows) of fixed length l, and clusters those subsequences to discover local patterns or segment structure.

Input series

  • One input series only.
  • Types: INT32 / INT64 / FLOAT / DOUBLE.
  • Points are read in time order; trailing samples that do not fill a full window are dropped (only ⌊n/l⌋ windows are used, where n is the number of valid points).

Parameters

Name Meaning Default Notes
l Subsequence (window) length (required) Positive integer; each window has l consecutive samples.
k Number of clusters (required) Integer ≥ 2.
method Clustering algorithm kmeans Optional: kmeans, kshape, medoidshape (case-insensitive). Defaults to k-means if omitted.
norm Z-score normalize each subsequence true Boolean; if true, each subsequence is standardized before clustering.
maxiter Maximum iterations 200 Positive integer.
output Output mode label label: one cluster id per window; centroid: concatenate the k centroid vectors in cluster order.
sample_rate Greedy sampling rate 0.3 Used only when method = medoidshape; must be in (0, 1].

method details

  • kmeans: k-means in Euclidean space (optionally after per-window normalization).
  • kshape: Assign by shape-based distance (SBD from normalized cross-correlation, NCC); centroids updated via SVD on the cluster matrix.
  • medoidshape: Coarse k-means with min(2k, number of windows) clusters, then greedy selection of k representative subsequences; sample_rate controls how many candidates are sampled each round.

Output series

Controlled by output:

output = label (default)

  • One output series, type INT32.
  • Number of points = number of full windows, ⌊n/l⌋.
  • Timestamp of each point = time of the first sample in that window; value = cluster id 0 … k−1.

output = centroid

  • One output series, type DOUBLE.
  • Number of points = k × l: for clusters 0 → k−1, emit the l components of each centroid in order (concatenated).
  • Timestamps are 0, 1, 2, … (placeholders only, no physical time meaning).

Copy link
Copy Markdown
Contributor

@waynextz waynextz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two little issues

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the version of dev/1.3 should be 1.3.x?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have restored the pom file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bat files are not in sbin/winodws folder in dev/1.3 branch.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have restored the bat file and added cluster registration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants