All you need to manage your network: Cumulus Linux, NetQ and WJH

Cumulus is a specialized network operating system designed to run on non-operating system switches with the ONIE installation environment. Created by Cumulus Networks for switches used in data centers and is designed to expand the functionality of switches through the use of standard Linux applications for orchestration, management, configuration and automation of operation.

Need to know. A network switch is an electronic device that connects multiple computers and/or other digital devices in a local area network and allows them to exchange data.

In other words, Comulus Linux is a technology that allows you to configure switches in networks and that provides an open source code. It provides switch management and configuration functionality, allowing administrators to configure various aspects of network equipment such as VLANs, routing, security, and network layer protocols. Comulus Linux also supports a wide range of switches from different vendors, giving you flexibility and choice when choosing hardware for your network infrastructure.

NetQ: Comulus Linux tool

NetQ is a network monitoring and analysis tool developed by Cumulus Networks, the company behind the development of the Cumulus Linux operating system. NetQ is tightly integrated with Cumulus Linux and provides detailed information about the network infrastructure running on Cumulus Linux. It collects real-time telemetry data, continuously checks network configurations and detects problems. NetQ provides analytical capabilities for network monitoring, troubleshooting, and network performance optimization. As a result, NetQ and Cumulus Linux work in synergy, giving network administrators complete control and deep understanding of how the Cumulus Linux network operates.

Example

Let's say you have a network based on the Cumulus Linux operating system that runs on multiple switches. You use NetQ along with Cumulus Linux to manage and monitor this network.

NetQ will collect data about the status of network devices such as switches, their configurations, port status, and data flows. This data will help you monitor network activity and identify any problems such as configuration errors, connection failures, or packet loss.

When a problem occurs, such as some ports on a switch not working, NetQ will provide you with information about the status of those ports, their configuration, and any errors that might be causing the problem. This will help you identify and fix the problem faster.

In addition, NetQ provides the ability to monitor network performance. It can analyze throughput, latency, and packet loss, allowing you to evaluate network performance and optimize network performance.

All of these NetQ features are integrated with Cumulus Linux, making network management and monitoring based on this operating system easier and more efficient. You can use NetQ to detect problems, check configuration, monitor performance, and analyze network traffic based on Cumulus Linux to help you keep your network stable, reliable, and secure.

WJH: How to Track Network Problems Quickly and Efficiently

Cumulus Linux integrates with NVIDIA Spectrum Switches, which provide the What Just Happened (WJH) streaming telemetry feature. WJH is a unique application-specific integrated circuit (ASIC) telemetry that allows for real-time and detailed monitoring of network flows within the switch. It analyzes packets passing through the switch and provides alerts on performance issues like packet drops, congestion, high latency, or misconfigurations.

Need to know. Telemetry is the process of collecting, measuring, and transmitting data about the operation or condition of a specific system or device. In the context of networks and information technology, telemetry is used to gather information about various parameters such as performance, resource utilization, network traffic, and other characteristics.

The WJH telemetry works by capturing information at the ASIC level when a packet is dropped. It provides detailed packet header information, including the 5/12 tuples, without including the payload. This allows for quick identification of the root cause of data-plane anomalies. WJH also provides descriptions of why, when, and where the packet was dropped, along with corrective action recommendations. It alerts on packet latency exceeding set thresholds and buffer utilization percentage crossing certain thresholds, helping to detect network bottlenecks and avoid overflow-related drops.

WJH monitors various categories of events, each with its own set of drop reasons and notifications. These include Layer 1 events such as cable issues and signal degradation, Layer 2 drops caused by VLAN misconfiguration or incorrect VLAN tags, Layer 3 (Router) drops related to routing issues, overlay (VXLAN) encapsulation or decapsulation errors, ACL drops indicating specific rules that dropped packets, congestion-related drops, and latency exceeding set thresholds.

By leveraging the WJH streaming telemetry data provided by NVIDIA Spectrum Switches, Cumulus Linux users can gain insights into the network's performance, detect and troubleshoot anomalies in real-time, and take proactive measures to optimize network operations.

Integration NetQ with WJH

The NetQ agent installed on the switch aggregates the WJH events by their categories and types and streams them to the NetQ server, either on-premises or as SaaS, using the gRPC protocol. You can then access the WJH data using the NetQ interface and command-line interface (CLI).

By utilizing NVIDIA NetQ, you can monitor and analyze the WJH events to identify performance issues, packet drops, congestion, latency, misconfigurations, and other anomalies in your network.

Above is the NetQ WJH dashboard. Nvidia has already released a detailed video tutorial where you can learn how to work with the received data from WJH in NetQ.

The dashboard interactively presents highly detailed WJH event information. You can easily examine event distribution through a pie chart and time-based graph. The dashboard also provides information on the top affected switches and drop type distribution.

The dashboard also includes a complete table listing all events by their reasons, detailed information, timestamps, and aggregated count. To view specific WJH data, you can filter the events by time, devices, drop types, and reasons and then export them into JSON or CSV files.

Conclusion

The integration of Cumulus Linux enables seamless aggregation and streaming of WJH events to the NetQ server, facilitating quick identification and resolution of network issues. Together, NetQ and Cumulus Linux empower organizations to optimize the performance of their AI infrastructure, ensuring reliable and efficient operations.

Powerful network infrastructure is as essential as a high-end GPU and storage systems in any AI deployment. It’s crucial to have superior telemetry methods to quickly identify the root causes of network issues when they happen.

NVIDIA What Just Happened brings a new aspect of network streaming telemetry by providing detailed and contextual information about packet drops and data plane anomalies. WJH reduces the troubleshooting time, and the time to root cause, enabling you to get the best out of your AI infrastructure.