Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add standard for provider networks #572

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

kgube
Copy link

@kgube kgube commented Apr 19, 2024

This is a WIP for a standard covering provider networks and public IP allocation.
SovereignCloudStack/issues#166
SovereignCloudStack/issues#167

@kgube kgube marked this pull request as draft April 19, 2024 13:03
@kgube
Copy link
Author

kgube commented Apr 19, 2024

We should probably just focus the standard on external provider networks/subnets, because that is ultimately what facilitates public IP networking, as SovereignCloudStack/issues#167 and SovereignCloudStack/issues#166 are calling for.

I have been trying to find a way to meaningfully incorporate the other suggestions from #522 (and the comments), like OVN/L3-router-HA and API-extensions/plugins and port security and policies, but they are either orthogonal to the issues (like the HA-topic), or follow from the provider-network requirements (like port security and, at least partially, the API-extensions), or should be discussed somewhere else more broadly (also the HA topic, and policies, specifically codification of/deviation from default policies).

@kgube kgube changed the title Add standard for networking Add standard for provider networks May 3, 2024
@kgube kgube marked this pull request as ready for review June 5, 2024 08:46
@kgube kgube requested review from horazont and matfechner June 5, 2024 08:52
Copy link
Member

@horazont horazont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me. There are a few issues and open questions for me, which I noted below.

| RBAC | Role-based Access Control: A mechanism in the Network API to give projects limited access to resources owned by other projects. Typically used by CSPs to create provider networks. |
| Shared Network | Virtual network that is shared between projects in a way that allows direct attachment of servers. |
| External Network | Virtual network that is shared between projects in a way that only allows virtual routers to use it as external gateway. Typically used by CSPs to provide access to networks outside of the cloud environment. |
| Provider Network | Any shared or external network that is managed by the CSP. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenStack provider networks have a different property, namely that they're tied to non-virtualised networks. I think that property is relevant enough to include it in this definition.

(Or are we using the term "provider network" in a different sense here?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many places, including some Openstack documentation, I found the term provider network to be used (interchangeably with "external network") to mean CSP-managed networks that are tied to non-virtualized networks, because that is usually what they are used for.

I was looking at this from more of an RBAC-perspective though, where any network, regardless of it's provider:network_type and provider:physical_network can have an access_as_external rule, and where ownership and RBAC rules ultimately dictate how a network can be used in a project. In that context, it made more sense to me to define provider networks literally as "networks managed by the provider".
This will still cover the classical use-case of bridging into non-virtualized networks, but also use-cases like a CSP-managed shared network.

It's still a deviation from common Openstack terminology of course, and I can try rewording it to stay closer to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that context, it made more sense to me to define provider networks literally as "networks managed by the provider".

I agree. My main problem with the definition as written above is that the "managed by the CSP" gets overpowered by the "shared or external" attributes, which make me read that definition as if the key point was that the network was "shared or external", not "managed by the CSP".

Maybe: "A network managed by the CSP, typically connected to non-virtualized infrastructure and commonly made external or shared."

Copy link
Author

@kgube kgube Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The network being shared or external is kinda essential though, as that is what actually makes it a provider network to tenants, otherwise it's just invisible to them and might as well be non-existent.

I have changed it now to "A CSP-managed virtual network made available to projects as either shared or external, typically connected to non-virtualized infrastructure."


The Network API's Role Based Access Control (RBAC) extension can then be used to share it with other projects.
RBAC rules for networks support the two actions `access_as_external` and `access_as_shared`, and can be created automatically on `openstack network create` with the options `--external` and `--share`.
* `access_as_external` allows networks to be used as external gateway for virtual routers in the target projects. Such networks are in the following referred to as _external networks_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another notable thing access_as_external allows is creation of floating IPs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a mention of that.

Standards/scs-xxxx-v1-provider-network-standard.md Outdated Show resolved Hide resolved
Standards/scs-xxxx-v1-provider-network-standard.md Outdated Show resolved Hide resolved
@kgube
Copy link
Author

kgube commented Jun 6, 2024

The was a short discussion in the community call today on the question of mandatory internet access vs. air-gapped infrastructure that was raised in the IaaS call yesterday.
The feedback was that we also want to support standardization of private and air-gaped clouds, and thus we should not mandate internet-access. We should however find a way to communicate to users whether internet access is available.

I will try to reword the standard to reflect this.

@kgube kgube force-pushed the 166-167-networking-standard branch from 9541a83 to 1fc89c0 Compare June 10, 2024 11:01
@kgube
Copy link
Author

kgube commented Jun 20, 2024

I have been trying to extend this standard to also apply to private and air-gaped clouds, and it turns out that this is somewhat difficult.
A lot of the decision in the standard follow from requirements and limitations of public IP address allocation, which may not apply in a more isolated cloud.

The approach that I'm currently trying, is to make it more conditional and discuss specific use-cases, like:

  • If the CSP offers provider networks, there SHOULD only be one network available to projects per default, which will be called the default provider network. (For an air-gaped cloud, not having provider networks at all might be a valid choice)
  • If the CSP allows allocation of public IP addresses, they MUST be allocated and routed via the default provider network. (This would cover the current focus of the standard, while still allowing e.g. a private cloud behind a NAT)
  • If the CSP allows users network-access to their servers, this MUST happen via the default provider network. (This would cover the case where users in a private cloud can access their servers via private IPs, e.g. via VPN)

I'm still working out the details though.

An alternative would of course be to scope the standard strictly to clouds with public IPs.

@kgube kgube force-pushed the 166-167-networking-standard branch from e115ebf to 76601c2 Compare July 16, 2024 08:31
@kgube
Copy link
Author

kgube commented Jul 25, 2024

Finding a common ground regarding provider networks between public, private, and specifically air-gaped clouds is hard, even with individually scoped rules. There is some overlap, but the requirements are very different.

I have now given up on that and made some changes to the draft to limit the standard to clouds that provide public IPs.
This still has some issues, though, like the awkwardness of mandating public IPv6 support when the overall support of public IP addresses is not mandatory.

@mbuechse
Copy link
Contributor

@kgube Thanks for all this impressive work! I see that this is a difficult situation. Please do go into the meetings (Team IaaS) and raise awareness as soon as possible. Please try to get people to go into an in-depth break-out session with you.

@matfechner
Copy link
Contributor

@kgube overall it looks good to me, maybe it makes sense, besides neutron-dynamic-routing, to refer also to ovn-bgp-agent. because OVN will perhaps take over this task in the future

@kgube kgube force-pushed the 166-167-networking-standard branch from da9ec1a to 364c01a Compare August 27, 2024 13:27
@kgube kgube force-pushed the 166-167-networking-standard branch from 364c01a to 2e06a86 Compare August 27, 2024 13:33
Copy link
Contributor

@matfechner matfechner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Standards/scs-xxxx-v1-provider-network-standard.md Outdated Show resolved Hide resolved
| DHCP | Dynamic Host Configuration Protocol: Used to automatically configure hosts in a network with IP addresses, default routes, and other information such as DNS servers. |
| Prefix | IP address prefix of a given bit-length N, written as _ADDRESS/N_. Divides addresses into a network and a host part, a shorter prefix allows more hosts but takes up more address space. |
| NAT | Network Address Translation: mapping (usually public) IPv4 addresses onto a different (usually private) address space. May allows multiple hosts to share a public address by multiplexing TCP/UDP ports. |
| RBAC | Role-based Access Control: A mechanism in the Network API to give projects limited access to resources owned by other projects. Typically used by CSPs to create provider networks. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe a little bit nitpicking:
When I think of RBAC in OpenStack, I first think about user roles and access to projects. While we also have access to projects here, it has nothing to do with user roles and is a very Neutron specific thing.
Could we maybe change the wording a bit? Maybe to "network RBAC" like in the openstack CLI command?

Also this would have to be adapted overall in this standard.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to "Networking RBAC" now, because Networking is the name of the API, and there is less chance for confusion when talking about RBAC rules for networks.


## Conformance Tests

(TBD, current requirements should mostly be testable by API. Testing external routing is more tricky and will require external testing infrastructure of some sort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will review again, when tests are implemented.

@kgube
Copy link
Author

kgube commented Sep 12, 2024

While working on the conformance tests I stumbled on Openstacks network auto-allocation feature:
https://docs.openstack.org/neutron/latest/admin/config-auto-allocation.html

I'm not sure how I missed that previously, but I am going to update the standard to conform to the auto-allocation requirements.

This is great, actually, because it will greatly simplify conformance in the presence of multiple provider networks or subnet pools, which is currently discouraged but allowed: we can just mandate the SCS default resources to be marked as exclusive auto-allocation defaults.

CSPs **MUST** externally route any public IP addresses allocated from subnets of the standard provider network.
CSPs **MUST** provide dynamic routing for all project-allocated public IP-prefixes via the standard provider network.

By default, users **SHOULD** be prohibited by policy from creating Networking RBAC rules, to prevent the creation of faux provider networks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an implementation note that shows the necessary policy change?

@kgube
Copy link
Author

kgube commented Oct 7, 2024

There haven't been much updates here, even though I have continuously been trying to build compliance tests for this.
I have been restructuring the code back and forth, without ever really reaching a runnable state (and I don't like pushing code that is not at least able to run). Some aspects of the standard that are just hard to test, some of the requirements are too loosely defined, some are kinda circular, and some I am just personally unhappy with.

I will try to identify those issues and approach them more systematically:

The scope of the standard.

  • Currently, the requirements are limited to "CSPs [that] offer public IP addresses to projects". However, whether a CSP is supposed to be offering public IP addresses is impossible to determine for the compliance tests.
  • The current wording would also include private clouds that are offering public IPs for some selected projects, and would force them to offer public IPs to all of their projects.
  • Standardized IP allocation would also be useful for private clouds, offering private IPs reachable via VPN, which are currently excluded from the standard. This would also aid the development of compliance tests, as they could be run against a private test environment.

I now think that the better approach is to not generally mandate public IP addresses, but to define external IP addresses as addresses reachable by a user with access to the openstack API, such that e.g. automated tooling can talk directly to both API and VMs. CSPs offering public cloud services can then specifically be required to provide public IPs as external IPs. This will make the standard more useful and allow easier compliance testing, at least for the basic requirements.
The requirement for public IP addresses could be enabled by an optional CLI flag or environment variable for the test script.

Support for multiple provider networks and/or multiple shared subnet pools per IP version.

  • The standard does not forbid this (though its not recommended), and I'm not sure we should forbid this. However, compliance-testing would mean potentially trying multiple subnet-pool/provider-network combinations until we find one that meets our requirements.
  • This is mostly a hypothetical scenario, but if the standard allows it, the tests should be able to cover it. To test this scenario, we need to build the tests around this scenario, it is not something we can easily add later.
  • I tried adding a requirement for setting the is_default flags of the network auto-allocation feature on the mandated resources to circumvent this, but when discussing this in last week's IaaS call got the suggestion to leave that as a recommendation until the SCS reference implementation has support for this and providers have gotten a chance to test it.

There may be other parts of the standard that are not yet part of the reference implementation and may need a grace period (IPv6 dynamic routing?). It is much more useful to keep the requirement of the is_default flag in the standard and treat the whole standard as a recommendation until it has been properly verified. This is especially true because all of the requirements for auto-allocation (at least for IPv6) are already mandated by the standard, the only addition really is the flag.
So in the interest of getting done with the tests, that is how I will proceed.

Compliance tests are limited to the project perspective.

  • The Standard makes some global requirements, like the single standard provider network available to all projects, but the compliance tests can only check it from the perspective of a single project and don't know if it is actually available to all projects.
  • This is not a big problem, in fact it probably makes no difference to users if their standard provider network is different from the one of other users, as long as it performs the same function. But It might be cleaner if that was reflected in the wording of the standard.

This is not a huge issue and does not actually affect the tests themselves (because this is the only way to do them, in this regard).
I will approach this if I have time for it, and only when I'm done with the compliance tests.

Limited feedback on the standard.

  • Not really an issue of the standard itself, but something that has made me very cautious in working on the tests has been the limited feedback from CSPs on the standard.
  • Some of the decisions have been contentious in previous discussions, such as making IPv6 mandatory and IPv4 only recommended. It is not unlikely that parts of the requirements will change later, and the compliance tests with them.

There isn't really much I can do about this. I will try to build the compliance tests somewhat modular, but not more than aids general readability. There is no point in considering potential future test scenarios if we aren't there yet, and refactoring will be easier than preparing, in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants