Skip to content

Commit

Permalink
[DPE-3661] Add support for large deployments backup (#248)
Browse files Browse the repository at this point in the history
Implements [DPE-3661](https://warthogs.atlassian.net/browse/DPE-3661):
extend backup feature to support large deployment scenarios.

Currently, there are 4x type of scenarios:
1) Small deployments: the cluster performs all the different types of
node roles
2) Large deployments - orchestrator: the app is in charge of not only
its own application units but also to coordinate the across the
different clusters
3) Large deployments - failover orchestrator: very similar to (2), this
app must also publish its information in the peer relation, although all
the clusters will only listen to the active manager
4) Large deployments - data only: do not perform any management tasks
and should receive any relevant information via peer relation

For backups, clusters of type (3) and (4) have a special behavior: they
will receive the backup data via peer-cluster relation and should
refuse: i. to execute backup-related actions; and ii. to execute the
s3-relation events themselves. The latter avoids confusions, e.g. an
user inadvertently relates the cluster to different s3-integrators.

The implementation of (1) and (2) are very similar. 

It contains the same fix as:
#253

Adds following fixes related to testing in general:
1. `ContinuousWrites` is updated to hold the right count of documents in
`writes_value` internally
2. Adds an `is_burst` option to `ContinuousWrites`: a test may choose to
send 100-burst docs vs. doc-by-doc - `is_burst` defaults to `True`
3. The `ContinuousWrites` terminates its process as part of `stop`,
avoiding stranded process generating docs to
`ContinuousWrites.INDEX_NAME` post a given test
4. The `start_and_check_continuous_writes` updated to
`assert_start_and_check_continuous_writes`

# Implementation Details

For developers, there is no meaningful difference between small and
large deployments.
They both use the same backup_factory() to return the correct object for
their case.

The large deployments expands the original concept of OpenSearchBackup
to include other
juju applications that are not cluster_manager. This means a cluster may
be a data-only or
even a failover cluster-manager and still interacts with s3-integrator
at a certain level.

The baseline is that every unit in the cluster must import the S3
credentials. The main
orchestrator will share these credentials via the peer-cluster relation.
Failover and data
clusters will import that information from the peer-cluster relation.

To implement the points above without causing too much disruption to the
existing code,
a factory pattern has been adopted, where the main charm receives a
OpenSearchBackupBase
object that corresponds to its own case (cluster-manager, failover,
data, etc).
"""


[DPE-3661]:
https://warthogs.atlassian.net/browse/DPE-3661?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
  • Loading branch information
phvalguima authored May 2, 2024
1 parent c2c19f2 commit 3a85973
Show file tree
Hide file tree
Showing 14 changed files with 616 additions and 100 deletions.
4 changes: 4 additions & 0 deletions lib/charms/opensearch/v0/constants_charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,10 @@
)
PluginConfigError = "Unexpected error during plugin configuration, check the logs"
BackupSetupFailed = "Backup setup failed, check logs for details"
S3RelMissing = "Backup failover cluster missing S3 relation."
S3RelShouldNotExist = "This unit should not be related to S3"
S3RelDataIncomplete = "S3 relation data missing or incomplete."
S3RelUneligible = "Only orchestrator clusters should relate to S3."

# Wait status
RequestUnitServiceOps = "Requesting lock on operation: {}"
Expand Down
8 changes: 8 additions & 0 deletions lib/charms/opensearch/v0/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,13 +202,21 @@ def set_promotion_time(cls, values): # noqa: N805
return values


class S3RelDataCredentials(Model):
"""Model class for credentials passed on the PCluster relation."""

access_key: str
secret_key: str


class PeerClusterRelDataCredentials(Model):
"""Model class for credentials passed on the PCluster relation."""

admin_username: str
admin_password: str
admin_password_hash: str
admin_tls: Dict[str, Optional[str]]
s3: Optional[S3RelDataCredentials]


class PeerClusterRelData(Model):
Expand Down
217 changes: 203 additions & 14 deletions lib/charms/opensearch/v0/opensearch_backups.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,14 @@

"""OpenSearch Backup.
This file holds the implementation of the OpenSearchBackup class, as well as the state enum
and configuration.
This library holds the implementation of the OpenSearchBackup class, as well as the state enum
and configuration. It contains all the components for both small and large deployments.
###########################################################################################
#
# Small deployments
#
###########################################################################################
The OpenSearchBackup class listens to both relation changes from S3_RELATION and API calls
and responses. The OpenSearchBackupPlugin holds the configuration info. The classes together
Expand Down Expand Up @@ -41,12 +47,32 @@
class OpenSearchBaseCharm(CharmBase):
def __init__(...):
...
self.backup = OpenSearchBackup(self)
self.backup = OpenSearchBackupFactory(self)
###########################################################################################
#
# Large deployments
#
###########################################################################################
For developers, there is no meaningful difference between small and large deployments.
They both use the same backup_factory() to return the correct object for their case.
The large deployments expands the original concept of OpenSearchBackup to include other
juju applications that are not cluster_manager. This means a cluster may be a data-only or
even a failover cluster-manager and still interacts with s3-integrator at a certain level.
The baseline is that every unit in the cluster must import the S3 credentials. The main
orchestrator will share these credentials via the peer-cluster relation. Failover and data
clusters will import that information from the peer-cluster relation.
To implement the points above without causing too much disruption to the existing code,
a factory pattern has been adopted, where the main charm receives a OpenSearchBackupBase
object that corresponds to its own case (cluster-manager, failover, data, etc).
"""

import json
import logging
import typing
from datetime import datetime
from typing import Any, Dict, List, Optional, Set, Tuple

Expand All @@ -58,26 +84,31 @@ def __init__(...):
BackupInDisabling,
BackupSetupFailed,
BackupSetupStart,
PeerClusterRelationName,
PluginConfigError,
RestoreInProgress,
S3RelMissing,
S3RelShouldNotExist,
)
from charms.opensearch.v0.helper_cluster import ClusterState, IndexStateEnum
from charms.opensearch.v0.helper_enums import BaseStrEnum
from charms.opensearch.v0.models import DeploymentType, PeerClusterRelData
from charms.opensearch.v0.opensearch_exceptions import (
OpenSearchError,
OpenSearchHttpError,
OpenSearchNotFullyReadyError,
)
from charms.opensearch.v0.opensearch_locking import OpenSearchNodeLock
from charms.opensearch.v0.opensearch_plugins import OpenSearchBackupPlugin, PluginState
from ops.charm import ActionEvent
from charms.opensearch.v0.opensearch_plugins import (
OpenSearchBackupPlugin,
OpenSearchPluginConfig,
PluginState,
)
from ops.charm import ActionEvent, CharmBase
from ops.framework import EventBase, Object
from ops.model import BlockedStatus, MaintenanceStatus, WaitingStatus
from tenacity import RetryError, Retrying, stop_after_attempt, wait_fixed

if typing.TYPE_CHECKING:
from charms.opensearch.v0.opensearch_base_charm import OpenSearchBaseCharm

# The unique Charmhub library identifier, never change it
LIBID = "d301deee4d2c4c1b8e30cd3df8034be2"

Expand All @@ -94,6 +125,7 @@ def __init__(...):
# OpenSearch Backups
S3_RELATION = "s3-credentials"
S3_REPOSITORY = "s3-repository"
PEER_CLUSTER_S3_CONFIG_KEY = "s3_credentials"


S3_REPO_BASE_PATH = "/"
Expand Down Expand Up @@ -153,15 +185,140 @@ class BackupServiceState(BaseStrEnum):
SNAPSHOT_FAILED_UNKNOWN = "snapshot failed for unknown reason"


class OpenSearchBackup(Object):
class OpenSearchBackupBase(Object):
"""Works as parent for all backup classes.
This class does a smooth transition between orchestrator and non-orchestrator clusters.
"""

def __init__(self, charm: Object, relation_name: str = PeerClusterRelationName):
"""Initializes the opensearch backup base.
This class will not hold a s3_client object, as it is not intended to really
manage the relation besides waiting for the deployment description.
"""
super().__init__(charm, relation_name)
self.charm = charm

for event in [
self.charm.on[S3_RELATION].relation_created,
self.charm.on[S3_RELATION].relation_joined,
self.charm.on[S3_RELATION].relation_changed,
self.charm.on[S3_RELATION].relation_departed,
self.charm.on[S3_RELATION].relation_broken,
]:
self.framework.observe(event, self._on_s3_relation_event)
for event in [
self.charm.on.create_backup_action,
self.charm.on.list_backups_action,
self.charm.on.restore_action,
]:
self.framework.observe(event, self._on_s3_relation_action)

def _on_s3_relation_event(self, event: EventBase) -> None:
"""Defers the s3 relation events."""
logger.info("Deployment description not yet available, deferring s3 relation event")
event.defer()

def _on_s3_relation_action(self, event: EventBase) -> None:
"""No deployment description yet, fail any actions."""
logger.info("Deployment description not yet available, failing actions.")
event.fail("Failed: deployment description not yet available")


class OpenSearchNonOrchestratorClusterBackup(OpenSearchBackupBase):
"""Simpler implementation of backup relation for non-orchestrator clusters.
In a nutshell, non-orchstrator clusters should receive the backup information via
peer-cluster relation instead; and must fail any action or major s3-relation events.
"""

def __init__(self, charm: Object, relation_name: str = PeerClusterRelationName):
"""Manager of OpenSearch backup relations."""
super().__init__(charm, relation_name)
self.framework.observe(
self.charm.on[PeerClusterRelationName].relation_changed,
self._on_peer_relation_changed,
)
self.framework.observe(
self.charm.on[S3_RELATION].relation_broken, self._on_s3_relation_broken
)

def _on_peer_relation_changed(self, event) -> None:
"""Processes the non-orchestrator cluster events."""
if not self.charm.plugin_manager.check_plugin_manager_ready():
logger.warning("s3-changed: cluster not ready yet")
event.defer()
return

if not (data := event.relation.data.get(event.app)):
return
data = PeerClusterRelData.from_str(data["data"])
s3_credentials = data.credentials.s3
if not s3_credentials or not s3_credentials.access_key or not s3_credentials.secret_key:
# Just abandon this event, as the relation is not fully ready yet
return

# https://github.com/canonical/opensearch-operator/issues/252
# We need the repository-s3 to support two main relations: s3 OR peer-cluster
# Meanwhile, create the plugin manually and apply it
try:
plugin = OpenSearchPluginConfig(
secret_entries_to_del=[
"s3.client.default.access_key",
"s3.client.default.secret_key",
],
)
self.charm.plugin_manager.apply_config(plugin)
except OpenSearchError as e:
logger.warning(
f"s3-changed: failed disabling with {str(e)}\n"
"repository-s3 maybe it was not enabled yet"
)
# It must be able to enable the plugin
try:
plugin = OpenSearchPluginConfig(
secret_entries_to_add={
"s3.client.default.access_key": s3_credentials.access_key,
"s3.client.default.secret_key": s3_credentials.secret_key,
},
)
self.charm.plugin_manager.apply_config(plugin)
except OpenSearchError as e:
self.charm.status.set(BlockedStatus(S3RelMissing))
# There was an unexpected error, log it and block the unit
logger.error(e)
event.defer()
return
self.charm.status.clear(S3RelMissing)

def _on_s3_relation_event(self, event: EventBase) -> None:
"""Processes the non-orchestrator cluster events."""
self.charm.status.set(BlockedStatus(S3RelShouldNotExist), app=True)
logger.info("Non-orchestrator cluster, abandon s3 relation event")
return

def _on_s3_relation_broken(self, event: EventBase) -> None:
"""Processes the non-orchestrator cluster events."""
self.charm.status.clear(S3RelMissing)
self.charm.status.clear(S3RelShouldNotExist, app=True)
logger.info("Non-orchestrator cluster, abandon s3 relation event")
return

def _on_s3_relation_action(self, event: EventBase) -> None:
"""Deployment description available, non-orchestrator, fail any actions."""
event.fail("Failed: execute the action on the orchestrator cluster instead.")


class OpenSearchBackup(OpenSearchBackupBase):
"""Implements backup relation and API management."""

def __init__(self, charm: "OpenSearchBaseCharm"):
def __init__(self, charm: Object, relation_name: str = S3_RELATION):
"""Manager of OpenSearch backup relations."""
super().__init__(charm, S3_RELATION)
self.charm = charm
super().__init__(charm, relation_name)
self.s3_client = S3Requirer(self.charm, relation_name)

# s3 relation handles the config options for s3 backups
self.s3_client = S3Requirer(self.charm, S3_RELATION)
self.framework.observe(self.charm.on[S3_RELATION].relation_created, self._on_s3_created)
self.framework.observe(self.charm.on[S3_RELATION].relation_broken, self._on_s3_broken)
self.framework.observe(
Expand All @@ -171,6 +328,17 @@ def __init__(self, charm: "OpenSearchBaseCharm"):
self.framework.observe(self.charm.on.list_backups_action, self._on_list_backups_action)
self.framework.observe(self.charm.on.restore_action, self._on_restore_backup_action)

def _on_s3_relation_event(self, event: EventBase) -> None:
"""Overrides the parent method to process the s3 relation events, as we use s3_client.
We run the peer cluster orchestrator's refresh on every new s3 information.
"""
self.charm.peer_cluster_provider.refresh_relation_data(event)

def _on_s3_relation_action(self, event: EventBase) -> None:
"""Just overloads the base method, as we process each action in this class."""
pass

@property
def _plugin_status(self):
return self.charm.plugin_manager.get_plugin_status(OpenSearchBackupPlugin)
Expand Down Expand Up @@ -822,3 +990,24 @@ def get_snapshot_status(self, response: Dict[str, Any] | None) -> BackupServiceS
if "FAILED" in r_str:
return BackupServiceState.SNAPSHOT_FAILED_UNKNOWN
return BackupServiceState.SUCCESS


def backup(charm: CharmBase) -> OpenSearchBackupBase:
"""Implements the logic that returns the correct class according to the cluster type.
This class is solely responsible for the creation of the correct S3 client manager.
If this cluster is an orchestrator or failover cluster, then return the OpenSearchBackup.
Otherwise, return the OpenSearchNonOrchestratorBackup.
There is also the condition where the deployment description does not exist yet. In this
case, return the base class OpenSearchBackupBase. This class solely defers all s3-related
events until the deployment description is available and the actual S3 object is allocated.
"""
if not charm.opensearch_peer_cm.deployment_desc():
# Temporary condition: we are waiting for CM to show up and define which type
# of cluster are we. Once we have that defined, then we will process.
return OpenSearchBackupBase(charm)
elif charm.opensearch_peer_cm.deployment_desc().typ == DeploymentType.MAIN_ORCHESTRATOR:
return OpenSearchBackup(charm)
return OpenSearchNonOrchestratorClusterBackup(charm)
26 changes: 19 additions & 7 deletions lib/charms/opensearch/v0/opensearch_base_charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
generate_password,
)
from charms.opensearch.v0.models import DeploymentDescription, DeploymentType
from charms.opensearch.v0.opensearch_backups import OpenSearchBackup
from charms.opensearch.v0.opensearch_backups import backup
from charms.opensearch.v0.opensearch_config import OpenSearchConfig
from charms.opensearch.v0.opensearch_distro import OpenSearchDistribution
from charms.opensearch.v0.opensearch_exceptions import (
Expand Down Expand Up @@ -89,7 +89,10 @@
from charms.opensearch.v0.opensearch_relation_provider import OpenSearchProvider
from charms.opensearch.v0.opensearch_secrets import OpenSearchSecrets
from charms.opensearch.v0.opensearch_tls import OpenSearchTLS
from charms.opensearch.v0.opensearch_users import OpenSearchUserManager
from charms.opensearch.v0.opensearch_users import (
OpenSearchUserManager,
OpenSearchUserMgmtError,
)
from charms.tls_certificates_interface.v3.tls_certificates import (
CertificateAvailableEvent,
)
Expand Down Expand Up @@ -208,7 +211,7 @@ def __init__(self, *args, distro: Type[OpenSearchDistribution] = None):
)

self.plugin_manager = OpenSearchPluginManager(self)
self.backup = OpenSearchBackup(self)
self.backup = backup(self)

self.user_manager = OpenSearchUserManager(self)
self.opensearch_provider = OpenSearchProvider(self)
Expand Down Expand Up @@ -828,7 +831,16 @@ def _start_opensearch(self, event: _StartOpenSearch) -> None: # noqa: C901
if self.opensearch.is_started():
try:
self._post_start_init(event)
except (OpenSearchHttpError, OpenSearchNotFullyReadyError):
except (
OpenSearchHttpError,
OpenSearchNotFullyReadyError,
):
event.defer()
except OpenSearchUserMgmtError as e:
# Either generic start failure or cluster is not read to create the internal users
logger.warning(e)
self.node_lock.release()
self.status.set(BlockedStatus(ServiceStartError))
event.defer()
return

Expand Down Expand Up @@ -873,10 +885,10 @@ def _start_opensearch(self, event: _StartOpenSearch) -> None: # noqa: C901
)
)
self._post_start_init(event)
except (OpenSearchStartTimeoutError, OpenSearchNotFullyReadyError):
except (OpenSearchHttpError, OpenSearchStartTimeoutError, OpenSearchNotFullyReadyError):
event.defer()
except OpenSearchStartError as e:
logger.exception(e)
except (OpenSearchStartError, OpenSearchUserMgmtError) as e:
logger.warning(e)
self.node_lock.release()
self.status.set(BlockedStatus(ServiceStartError))
event.defer()
Expand Down
Loading

0 comments on commit 3a85973

Please sign in to comment.