Operational Innovations

ESnet’s network is operated as a highavailability network 24 hours per day, 365 days per year, and significant resources are devoted to ensuring its secure, continuous availability. While ESnet is required to deliver “three 9s” — 99.9% availability — ESnet routinely exceeds that standard. In fact, sites with redundant connections to ESnet often see 100% availability (zero downtime) over the course of a year. Our staff continues to work on enhancing the technical systems that benefit both our internal team and external production services, in relentless pursuit of performance and reliability. These improvements encompass tools for gathering, assessing, and studying network performance data, as well as automating the setup of devices on the production network. In 2022, ESnet’s operational innovations included (but were not limited to) the following projects;

Improving Service-Oriented Notifications

In CY2022, ESnet continued to improve its overall user satisfaction ratings, rising to 4.86 in CY2022 from 4.79 (out of 5) in CY2021. Captured via a survey of ESnet Site Coordinator Committee (ESCC) members that was conducted anonymously by the ESCC Chair, this rating has continually improved for several years, attributable to a focus on operational improvements.

Computer Systems Engineers Charles Shiflett and Rémy Doucet in ESnet’s Berkeley Lab data center.

This year, the ESnet Network Operations Center (NOC) upgraded the communications workflow to better prioritize outage response and communications based on impact and urgency. Additionally, for most of our IP and OSCARS services (OSCARS is an open -source, ESnet-developed advanced software system for reserving network resources), ESnet began sending targeted notifications to our site users that allow them to determine whether a planned outage or unplanned impairment will affect their services. ESnet also built new capabilities for visibility into site uptime. Next-gen operations will continue to be a key ESnet initiative in 2023, transitioning to service-based notifications, reducing the number of notifications sent to individual sites, and making the information in the notifications easier to understand.

Open-Source Visualization Tools

Stardust is a time-series data collection, analysis, and visualization platform developed by ESnet and drawing on the expertise gained from the National Science Foundation–funded NetSage project. The platform provides a scalable means to collect and process multiple network measurements, allowing users to create custom dashboards and visuals. It gives ESnet technical staff improved visibility into the performance and behavior of network services while fostering greater collaboration within the R&E community.

In 2022, the Stardust analytics team developed and made public several open-source tools for both data visualization and processing. The most notable were four visualization plug-ins submitted to and accepted by the Grafana platform: a bump chart, slope graph, chord diagram, and matrix. In addition to visualization plug-ins, ESnet also developed an open-source tool for managing Grafana dashboards named Grafana Dash-N-Grab (GDG), available on GitHub.

ESnet’s new matrix plugin can compare two sets of categorical data,
whereas existing Grafana plugins all required timeseries data on both
axes. This allows, for example, packet loss between hosts to be
visualized as shown.

Upgrading perfSONAR

perfSONAR hosts worldwide.

perfSONAR is a platform for end-to-end network performance measurement and monitoring, used by network performance engineers to help pinpoint and diagnose subtle network performance issues. ESnet is a key member of the highly successful perfSONAR collaboration, along with Internet2, Indiana University, GÉANT, The University of Michigan, and Rede Nacional de Ensino e Pesquisa (RNP). perfSONAR is influential in the Research and Education Network (NREN) community and is the de facto standard platform for network testing and measurement in science networking.

More than 2,000 registered perfSONAR hosts have been deployed on 400 networks in more than 50 countries. For example, the Worldwide Large Hadron Collider (LHC) Computing Grid (WLCG) project has intensive data transmission requirements; a perfSONAR “mesh” has been built between all LHC sites and is monitored with a unified dashboard. In 2022, ESnet performed a hardware upgrade of the 50 hosts it maintains, to enable testing of network speeds up to 100 Gbps, and shared the experience gained in tuning this new hardware with the R&E community.

Network Automation Innovations

One of the significant improvements represented by ESnet6 is its service orchestration and automation system. A key driver of this is the open-source Workflow Orchestrator tool, first developed in 2019 by the Netherlands R&E consortium SURF, which helps network administrators both automate (execute repetitive tasks reliably and easily) and orchestrate (add a layer of intelligence to tasks being automated and a complete audit log of changes). ESnet has contributed significantly to this open-source project and partnered with SURF to expand its adoption in the R&E community.

In this architecture, the Orchestrator functions as a quarterback, coordinating calls to various systems such as ESDB, NameSurfer (IPAM), Cisco NSO, and Ansible to implement configuration changes on network elements. In 2022, ESnet added several new workflows that automated the deployment of ESnet6 and migrated existing services to new network devices. In particular, the Internal Host Connectivity (IHC, used for managing direct connections to ESnet from servers) service workflows gained extra functionality. The team implemented the ability to configure BERT, Gateway, perfSONAR, and High Touch hosts. The Infrastructure and Networking teams extensively used the High Touch IHC workflows to push configurations from ESDB to newly deployed hosts.

ESnet also made many operational enhancements to the development process of the Orchestrator application: expansion of the automated test suite improved the quality of the code, and Jira automation increased the team’s velocity and ability to monitor workload.

The network service orchestration architecture.

Supporting IPv6 Adoption and Standardization

The new Internet protocol IPv6 was first introduced in 1998 to address the shortfall of unique IP addresses available under IPv4. Despite the improvements in efficiency and security that come with IPv6, adoption in the United States and around the world has been slow.

As the DOE laboratories’ scientific network, ESnet has naturally co-led both the DOE IPv6-only Implementation Team and the community of practice groups, providing forward momentum and input on migration off of IPv4 and developing transition mechanisms for that migration path. With the team, ESnet staff organized multiple community practice sessions and worked with many entities, both within the DOE and in the larger federal government space. ESnet team members presented at and participated in the larger USG Federal IPv6 Task Force, providing important input and updates on the recently published NSA IPv6 security guidelines. ESnet staff also authored three active Internet Engineering Task Force drafts, offering information about extensive testing, lab work, and operation deployments relevant
to the success of this effort.

Strengthening the Security Landscape

ESnet has built a strong security program with a rich history of expertise in network security monitoring. It is one of the key partners supporting the Zeek open-source project, and in response to the White House’s 2022 cybersecurity strategy, ESnet has launched its own Zero Trust program and developed a whitepaper that sets its security strategy for years to come. The security team was also essential in the successful completion of the ESnet6 project: it developed ESnet’s first security service as one of the key performance parameters, allowing ESnet to block and isolate traffic quickly on the WAN for customers. The team also released the open-source SCRAM (Security Capture and Release Automation Manager) tool, enabling our partner sites to better protect their networks.