You might be wondering - what is NRE all about? Why does networking need its own thing?
Easy answer - it doesn't.
If you read up on SRE you'll find that it isn't actually separate from DevOps either. Where DevOps is a set of principles, an SRE is a specific role and set of skills that align with those principles. You might say SRE is simply an implementation of DevOps.
In the same way, NRE is also an implementation of DevOps. An NRE's job role and day-to-day reality is not the same as a "traditional SRE", so it's meaningful to have the role of NRE to aspire to, which implements the same back-end principles that SRE does. NRE is not meant to put up more silo walls - it is an extremely tangible path for traditional Network Engineers to adopt DevOps practices.
WIP list of high-level NRE topics for diving into later. For now, the exercise is about making sure the list is reasonably complete and concise. This is an EXTREMELY WIP list. I'm open to merging, splitting, adding, deleting categories as needed to make it make more sense.
Python, automation tools (Ansible/StackStorm/Salt, etc)
Autonomous workflows. Software gets its inputs from other software (see event driven)
Everything is API-Driven. Not just network device APIs but also cloud service APIs,
Defining all of the above triggers, tests, workflows, configs, telemetry, policies as code. Treat the code as the source of truth
Workflows have to be:
- End-to-end (campus, DC, WAN, etc)
- Top-to-bottom (L2 - L7)
- Service-Level Indicators (SLI): X should be true...
- Service-Level Objectives (SLO): Y proportion of the time...
- Service-Level Agreements (SLA): Or else Z.
Metrics. Putting latency, bandwidth, reachability, etc as supporting metrics to SLI
Knowing the applications and users of the network is 100% crucial. You cannot calculate SLI, or MTBF/MTTR from any other perspective.
Everything is measured, and everything is actionable. Either by humans, or by machines.
Understanding triggers from #2 and tying them to workflows in #1
Building a culture of seeing a "new" issue, and creating autoremediation for it if possible.
Aim is to never log into the device to remediate issues
Tests that make assertions about how the network is working, and integrating this into the automation on top
- Network Testing (config linting, operational assertions like "rtr1 must have 3 bgp peers up")
- Application Testing (application performance over the network)
- Cloud-Native Applications
- Cloud Networking
- Distributed systems and applications, and their deployment models
Distributed applications and systems are becoming the new norm. NREs need to be extremely familiar with Layer 7. Understanding that resources are ephemeral and move often.
- Continuous Integration
- Continuous Delivery
- Continuous Improvement
- Canary Deployments (part of CD?) - making changes in a limited scope as a test, and rolling out slowly - all automated.
I think there's a lot that can be done for networking here, but there's not a lot of tooling or trust that this is even a good idea today. Lots of work needs done here to get networks ready for Chaos Testing