[DPE-4931] fix locking with unassigned shards (#387)

## Issue This PR addresses the issues #327 and #324. When Opensearch is in the process of shutting down, the operator currently does not wait for data to be moved away from the stopping unit. This may result in shards not being assigned and could cause loss of data. In cases where the index `.charm_node_lock` is impacted, the operator can no longer acquire the lock to start or stop Opensearch. This will result in `503` errors in the logfile. The behavior can be seen in this CI run, with some additional logging information added for debugging: https://github.com/canonical/opensearch-operator/actions/runs/10269420444/job/28415611890?pr=387 ``` Shards before relocation: [...{'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '8.7kb', 'ip': '10.19.29.239', 'node': 'opensearch-1.e42'}] Shards after relocation: [... {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'RELOCATING', 'docs': '10', 'store': '61.5kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42 -> 10.19.29.239 yt3jiuSZRTCY8NnoeIni5w opensearch-1.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}] ``` Shortly after, the error is there: ``` unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: https://10.19.29.239:9200 "GET /.charm_node_lock/_source/0 HTTP/11" 503 287 unit-opensearch-0: 16:25:11 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Error checking which unit has OpenSearch lock Traceback (most recent call last): ... File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 263, in acquired unit = self._unit_with_lock(host) File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 225, in _unit_with_lock document_data = self._opensearch.request( File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 306, in request raise OpenSearchHttpError( charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503 self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503} unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: Lock to start opensearch not acquired. Will retry next event ``` ## Solution When stopping Opensearch, the operator should wait for the shards relocation to be completed. This should be happening right after adding the currently stopping unit to the allocation exclusions. The check should be blocking, meaning that Opensearch must not stop until the relocation is finished. This will look something like this: ``` unit-opensearch-0: 10:07:28 DEBUG unit.opensearch/0.juju-log Shards before relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.228', 'node': 'opensearch-0.ccc'}, ...] [...] unit-opensearch-0: 10:07:32 DEBUG unit.opensearch/0.juju-log Shards after relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.28', 'node': 'opensearch-1.ccc'} ...] ``` To check if there are still some moving shards, the API `_cluster/health` can be queried for `"relocating_shards"`. If these are not `0`, the process of stopping should be halted. Depending on the amount of data, this can take quite some time. A reasonable maximum waiting time of 15 minutes has been added, after that an error will be raised.
canonical · Aug 9, 2024 · a308db8 · a308db8
1 parent 9b619d4
commit a308db8
Show file tree

Hide file tree

Showing 2 changed files with 23 additions and 22 deletions.
diff --git a/lib/charms/opensearch/v0/opensearch_base_charm.py b/lib/charms/opensearch/v0/opensearch_base_charm.py
@@ -1022,7 +1022,8 @@ def _stop_opensearch(self, *, restart=False) -> None:
             except OpenSearchHttpError:
                 logger.debug("Failed to get online nodes, voting and alloc exclusions not added")
 
-        # TODO: should block until all shards move addressed in PR DPE-2234
+        # block until all primary shards are moved away from the unit that is stopping
+        self.health.wait_for_shards_relocation()
 
         # 2. stop the service
         self.opensearch.stop()

diff --git a/lib/charms/opensearch/v0/opensearch_health.py b/lib/charms/opensearch/v0/opensearch_health.py
@@ -112,34 +112,34 @@ def get(  # noqa: C901
             logger.error(e)  # means the status was reported as an int (i.e: 503)
             return HealthColors.UNKNOWN
 
-        if status != HealthColors.YELLOW:
-            return status
-
-        try:
-            logger.debug(
-                f"\n\nHealth: {status} -- Shards: {ClusterState.shards(self._opensearch, host, verbose=True)}\n\n"
-            )
-            logger.debug(
-                f"Allocation explanations: {ClusterState.allocation_explain(self._opensearch, host)}\n\n"
-            )
-        except OpenSearchHttpError:
-            pass
-
         # we differentiate between a temp yellow (moving shards) and a permanent
         # one (such as: missing replicas)
-        if response["initializing_shards"] > 0 or response["relocating_shards"] > 0:
+        if status in [HealthColors.GREEN, HealthColors.YELLOW] and (
+            response["initializing_shards"] > 0 or response["relocating_shards"] > 0
+        ):
+            try:
+                logger.debug(
+                    f"\n\nHealth: {status} -- Shards: {ClusterState.shards(self._opensearch, host, verbose=True)}\n\n"
+                )
+                logger.debug(
+                    f"Allocation explanations: {ClusterState.allocation_explain(self._opensearch, host)}\n\n"
+                )
+            except OpenSearchHttpError:
+                pass
             return HealthColors.YELLOW_TEMP
-        return HealthColors.YELLOW
 
-    @retry(stop=stop_after_attempt(15), wait=wait_fixed(5), reraise=True)
+        return status
+
+    @retry(stop=stop_after_attempt(90), wait=wait_fixed(10), reraise=True)
     def wait_for_shards_relocation(self) -> None:
         """Blocking function until the shards relocation completes in the cluster."""
-        if self.get(wait_for_green_first=True) != HealthColors.YELLOW_TEMP:
-            return
+        health = self.get(local_app_only=False)
 
-        # we throw an error because various operations should NOT start while data
-        # is being relocated. Examples are: simple stop, unit removal, upgrade
-        raise OpenSearchHAError("Shards haven't completed relocating.")
+        if health == HealthColors.YELLOW_TEMP:
+            logger.info("Shards still moving before stopping Opensearch.")
+            # we throw an error because various operations should NOT start while data
+            # is being relocated. Examples are: simple stop, unit removal, upgrade
+            raise OpenSearchHAError("Shards haven't completed relocating.")
 
     def _apply_for_app(self, status: str) -> None:
         """Cluster wide / app status."""