-
Notifications
You must be signed in to change notification settings - Fork 249
A80: Export TCP Telemetry from gRPC #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
AananthV
wants to merge
7
commits into
master
Choose a base branch
from
AananthV-A80
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
99bb6c1
A80: Export TCP Telemetry from gRPC
AananthV 82f606b
Incorporate formatting and other fixes from PR discussion
AananthV 9c75f06
Update A80-tcp-telemetry.md
AananthV dc2e106
Update A80-tcp-telemetry.md
AananthV 7c43505
Update A80-tcp-telemetry.md
AananthV 6fb558e
Update A80-tcp-telemetry.md
AananthV a93800c
Update A80-tcp-telemetry.md
AananthV File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| # A80: Export TCP Telemetry from gRPC | ||
|
|
||
| * Author(s): Aananth V (@aananthv), Nana Pang (@nanahpang), Yash Tibrewal (@yashykt), Yousuk Seung (@yousukseung) | ||
| * Approver(s): Craig Tiller (@ctiller), Mark Roth (@markdroth) | ||
| * Status: In Review | ||
| * Implemented in: C-Core | ||
| * Last updated: 2025-12-16 | ||
| * Discussion at: https://groups.google.com/g/grpc-io/c/MoLrWPsFB3s | ||
|
|
||
| ## Abstract | ||
|
|
||
| This document proposes collecting and exposing new TCP Endpoint Level Telemetry to gRPC for improved network analysis and debugging. | ||
|
|
||
| ## Background | ||
|
|
||
| The Linux Kernel exposes two telemetry hooks that can be used to collect TCP-level metrics. | ||
|
|
||
| 1. **TCP socket state** can be retrieved using the `getsockopt()` system call with `level` set to `IPPROTO_TCP` and `optname` set to `TCP_INFO`. The state is returned in a `struct tcp_info` which gives details about the TCP connection. At present, the machinery to collect such information is available on Linux 2.6 or later kernels. | ||
| 2. **Per-message transmission timestamps** can be collected from TCP sockets using the [SO\_TIMESTAMPING](https://docs.kernel.org/networking/timestamping.html) interface. At present, this is available on Linux 2.6 or later kernels. These timestamps can be very valuable for diagnosing network level issues and can be used to break down the time spent in the "network". | ||
|
|
||
| [gRFC A79][A79] provides a framework for adding non-per-call metrics in gRPC. This document uses that framework to expose the proposed TCP Latency metrics. | ||
|
|
||
| ### *Related Proposals:* | ||
|
|
||
| * [gRFC A79][A79]: gRPC Non-Per-Call Metrics Framework | ||
|
|
||
| [A79]: A79-non-per-call-metrics-architecture.md | ||
|
|
||
| ## Proposal | ||
|
|
||
| This document proposes exporting the following TCP metrics from gRPC to improve the network debugging capabilities for gRPC users. These metrics will be implemented inside the gRPC endpoint layer. All metrics have the following optional labels: | ||
| * network.local.address | ||
| * network.local.port | ||
| * network.peer.address | ||
| * network.peer.port | ||
|
|
||
| ### Per-Connection Metrics | ||
|
|
||
| | Name | Type | Unit | Description | | ||
| | ----- | :---- | :---- | :---- | | ||
| | grpc.tcp.min\_rtt | Histogram (floating-point) | s | TCP's current estimate of minimum round trip time (RTT). It can be used as an indication of the network health between two endpoints. Corresponds to `tcpi_min_rtt` from `struct tcp_info`. | | ||
| | grpc.tcp.delivery\_rate | Histogram (floating-point) | By/s | TCP’s most recent measure of the connection’s "non-app-limited" throughput. The term non-app-limited means that the link is saturated by the application. The delivery rate is only reported when it is non-app-limited. Corresponds to `tcpi_delivery_rate` from `tcp_info` when `tcpi_delivery_rate_app_limited` is `false`. | | ||
| | grpc.tcp.packets\_sent | Counter (integer) | {packet} | Total packets sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_data_segs_out` from `struct tcp_info`. | | ||
| | grpc.tcp.packets\_retransmitted | Counter (integer) | {packet} | Total packets sent by TCP except those sent for the first time. A packet may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_total_retrans` from `struct tcp_info`. | | ||
| | grpc.tcp.packets\_spurious\_retransmitted | Counter (integer) | {packet} | Total packets retransmitted by TCP that were later found to be unnecessary. These packets are acknowledged for the second time or more. Multiple spurious retransmissions for the same packet are counted multiple times. Corresponds to `tcpi_dsack_dups` from `struct tcp_info`. | | ||
| | grpc.tcp.recurring\_retransmits | Counter (integer) | {packet} | The number of times the latest TCP packet ( TCP sequence) was retransmitted due to expiration of TCP retransmission timer (RTO), and not acknowledged at the time the connection was closed. Corresponds to `tcpi_retransmits` at connection close time. | | ||
| | grpc.tcp.bytes\_sent | Counter (integer) | By | Total bytes sent by TCP including retransmissions and spurious retransmissions. Corresponds to `tcpi_bytes_sent` from `struct tcp_info`. | | ||
| | grpc.tcp.bytes\_retransmitted | Counter (integer) | By | Total bytes sent by TCP except those sent for the first time. A byte sequence may be retransmitted multiple times and will be counted multiple times as retransmitted. Retransmission counts include spurious retransmissions. Corresponds to `tcpi_bytes_retrans` from `struct tcp_info`. | | ||
|
|
||
| #### Suggested Metric Collection Algorithm | ||
|
|
||
| * Set TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL to a default of 5 minutes. | ||
| * Implementations can choose to make it configurable. | ||
| * For each new connected TCP socket, set an initial alarm of 10% to 110% (randomly selected) of TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL. | ||
| * When the alarm fires \- | ||
| * Use `getsockopt(TCP_INFO)` or equivalent method to retrieve and record connection metrics. | ||
| * Re-arm the alarm with 10% to 110% (randomly selected) of TCP\_CONNECTION\_METRICS\_RECORD\_INTERVAL and repeat. | ||
| * Before the socket is closed, cancel the alarm set above, and retrieve and record connection metrics, providing observability for short-lived connections as well. This will also allow collection of `grpc.tcp.recurring_retransmits` | ||
|
|
||
| ### Per-Connection Op Metrics | ||
|
|
||
| | Name | Type | Unit | Description | | ||
| | ----- | :---- | :---- | :---- | | ||
| | grpc.tcp.connections\_created | Counter (integer) | {connection} | Number of TCP connections created. | | ||
| | grpc.tcp.connection\_count | UpDownCounter (integer) | {connection} | Number of active TCP connections. | | ||
AananthV marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | grpc.tcp.syscall\_writes | Counter (integer) | {syscall} | The number of times we invoked the sendmsg (or sendmmsg) syscall and wrote data to the TCP socket. Measured at the endpoint level. | | ||
| | grpc.tcp.write\_size | Histogram (floating-point) | By | The number of bytes offered to each syscall\_write. Measured at the endpoint level. | | ||
| | grpc.tcp.syscall\_reads | Counter (integer) | {syscall} | The number of times we invoked the recvmsg (or recvmmsg or zero copy getsockopt) syscall and read data from the TCP socket. Measured at the endpoint level. | | ||
| | grpc.tcp.read\_size | Histogram (floating-point) | By | The number of bytes received by each syscall\_read. Measured at the endpoint level. | | ||
|
|
||
| #### Suggested Metric Collection Algorithm | ||
|
|
||
| * `grpc.tcp.connections_created` will be incremented when a connection is created. | ||
| * `grpc.tcp.connection_count` will be incremented when a connection is created and decremented when it is destroyed. | ||
| * The `grpc.tcp.syscall_writes` and `grpc.tcp.write_size` metrics will be updated whenever we write to the socket. | ||
| * The `grpc.tcp.syscall_reads` and `grpc.tcp.read_size` metrics will be updated whenever we read from the socket. | ||
|
|
||
| ### Per-Write Metrics | ||
|
|
||
| | Name | Type | Unit | Labels | Description | | ||
| | ----- | :---- | :---- | :---- | :---- | | ||
| | grpc.tcp.sender\_latency | Histogram (floating-point) | s | None | Time taken by the TCP socket to write the first byte of a write onto the NIC. This includes the latency incurred by traffic shaping, qdisc, throttling, and pacing at the sender. Corresponds to the time taken between the final `SCHED` timestamp and the `SENT` timestamp. Sampled periodically. | | ||
| | grpc.tcp.transfer\_latency | Histogram (floating-point) | s | grpc.transfer\_size<sup>1</sup> | Time taken to transmit the first size bytes of a write. Transfer latency is measured from when the first byte is handed to the NIC until TCP receives the acknowledgement for the last byte. Corresponds to the time taken between the `SENT` timestamp and the `ACKED` timestamp. Sampled periodically. | | ||
|
|
||
| <sup>1</sup> Since transfer latency is strongly affected by the write size, it is broken down into different buckets based on the size of the write. Further, we will only measure the latencies of certain benchmark sample sizes to get measurements that are unaffected by the write sizes. The proposed buckets are: | ||
|
|
||
| | Buffered Size | Benchmark Size | `grpc.transfer_size` (Bytes) | | ||
| | :---- | :---- | :---- | | ||
| | \[0, 1KiB) | Whole buffer | 1024 | | ||
AananthV marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | \[1KiB, 8KiB) | First 1KiB | 1024 | | ||
| | \[8KiB, 64KiB) | First 8KiB | 8196 | | ||
| | \[64KiB, 256KiB) | First 64KiB | 65536 | | ||
| | \[256KiB, 2MiB) | First 256KiB | 262144 | | ||
| | \[2MiB, \+inf) | First 2MiB | 2097152 | | ||
|
|
||
| Writes smaller than 1024 Bytes are labelled with `size=1024` to reduce cardinality. Further, their size does not have a big impact on transfer latency since they are able to fit inside a single TCP packet. | ||
|
|
||
| #### Suggested Metric Collection Algorithm | ||
|
|
||
| * Set TCP\_LATENCY\_RECORD\_PERIOD to a default of 1000 to denote a frequency of 1 per 1000 writes. | ||
| * Implementations can choose to make it configurable. | ||
| * For each new connected TCP socket, | ||
| * Set `writes_since_last_latency_measurement_` to a random integer in \[0, TCP\_LATENCY\_RECORD\_PERIOD). Increment this value for every write. | ||
| * Perform any prerequisites needed for the socket to support timestamping. | ||
| * For Linux TCP, this involves making a setsockopt(SO\_TIMESTAMPING) call with the flag value set to SOF\_TIMESTAMPING\_SOFTWARE | SOF\_TIMESTAMPING\_OPT\_ID | SOF\_TIMESTAMPING\_OPT\_TSONLY | SOF\_TIMESTAMPING\_OPT\_ID\_TCP | SOF\_TIMESTAMPING\_OPT\_STATS. | ||
| * If `writes_since_last_latency_measurement_` % TCP\_LATENCY\_RECORD\_PERIOD \= 0 | ||
| * Enable latency measurement for the write. | ||
| * For Linux TCP, this involves splitting the write into two chunks based on the buckets listed above and adding a SO\_TIMESTAMPING cmsg header to the sendmsg call with the flags SOF\_TIMESTAMPING\_TX\_SCHED | SOF\_TIMESTAMPING\_TX\_SOFTWARE on the sampled chunk and SOF\_TIMESTAMPING\_TX\_ACK on the other chunk. | ||
| * There is only one chunk if the write is less than 1024 bytes, in that case all three flags are set on this chunk. | ||
| * Set `writes_since_last_latency_measurement_` to 1 and repeat. | ||
|
|
||
AananthV marked this conversation as resolved.
Show resolved
Hide resolved
AananthV marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ### Metric Stability | ||
| All metrics added in this proposal will start as experimental. The long term goal will be to de-experimentalize them and potentially have some metrics be on by default, but the exact criteria for that change are TBD. We may also add new labels (eg: `grpc.lb.locality`, `grpc.lb.backend_service`) in the future. This gRFC will be amended when this happens. | ||
|
|
||
| ### Temporary environment variable protection | ||
| This proposal does not include any features enabled via external I/O, so it does not need environment variable protection. | ||
|
|
||
| ## Implementation | ||
|
|
||
| Will be implemented in C-Core to start with. Other languages may implement only a subset of the metrics based on Kernel API availability. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.