Skip to content

Commit

Permalink
[DPE-4931] fix locking with unassigned shards (#387)
Browse files Browse the repository at this point in the history
## Issue
This PR addresses the issues
#327 and
#324.

When Opensearch is in the process of shutting down, the operator
currently does not wait for data to be moved away from the stopping
unit. This may result in shards not being assigned and could cause loss
of data. In cases where the index `.charm_node_lock` is impacted, the
operator can no longer acquire the lock to start or stop Opensearch.
This will result in `503` errors in the logfile.

The behavior can be seen in this CI run, with some additional logging
information added for debugging:
https://github.com/canonical/opensearch-operator/actions/runs/10269420444/job/28415611890?pr=387
```
Shards before relocation: [...{'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'r', 'state': 'STARTED', 'docs': '1', 'store': '8.7kb', 'ip': '10.19.29.239', 'node': 'opensearch-1.e42'}]

Shards after relocation: [... {'index': '.opendistro_security', 'shard': '0', 'prirep': 'p', 'state': 'RELOCATING', 'docs': '10', 'store': '61.5kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42 -> 10.19.29.239 yt3jiuSZRTCY8NnoeIni5w opensearch-1.e42'}, {'index': '.charm_node_lock', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '31.6kb', 'ip': '10.19.29.219', 'node': 'opensearch-0.e42'}]
```

Shortly after, the error is there:
```
unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: https://10.19.29.239:9200 "GET /.charm_node_lock/_source/0 HTTP/11" 503 287
unit-opensearch-0: 16:25:11 ERROR unit.opensearch/0.juju-log opensearch-peers:1: Error checking which unit has OpenSearch lock
Traceback (most recent call last):
...
File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 263, in acquired
    unit = self._unit_with_lock(host)
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_locking.py", line 225, in _unit_with_lock
    document_data = self._opensearch.request(
  File "/var/lib/juju/agents/unit-opensearch-0/charm/lib/charms/opensearch/v0/opensearch_distro.py", line 306, in request
    raise OpenSearchHttpError(
charms.opensearch.v0.opensearch_exceptions.OpenSearchHttpError: HTTP error self.response_code=503
self.response_body={'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}], 'type': 'no_shard_available_action_exception', 'reason': 'No shard available for [get [.charm_node_lock][0]: routing [null]]'}, 'status': 503}
unit-opensearch-0: 16:25:11 DEBUG unit.opensearch/0.juju-log opensearch-peers:1: Lock to start opensearch not acquired. Will retry next event
```

## Solution
When stopping Opensearch, the operator should wait for the shards
relocation to be completed. This should be happening right after adding
the currently stopping unit to the allocation exclusions. The check
should be blocking, meaning that Opensearch must not stop until the
relocation is finished.

This will look something like this:
```
unit-opensearch-0: 10:07:28 DEBUG unit.opensearch/0.juju-log Shards before relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.228', 'node': 'opensearch-0.ccc'}, ...]

[...]

unit-opensearch-0: 10:07:32 DEBUG unit.opensearch/0.juju-log Shards after relocation: [... {'index': '.plugins-ml-config', 'shard': '0', 'prirep': 'p', 'state': 'STARTED', 'docs': '1', 'store': '3.9kb', 'ip': '10.18.168.28', 'node': 'opensearch-1.ccc'} ...]
```
To check if there are still some moving shards, the API
`_cluster/health` can be queried for `"relocating_shards"`. If these are
not `0`, the process of stopping should be halted. Depending on the
amount of data, this can take quite some time. A reasonable maximum
waiting time of 15 minutes has been added, after that an error will be
raised.
  • Loading branch information
reneradoi authored Aug 9, 2024
1 parent 9b619d4 commit a308db8
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 22 deletions.
3 changes: 2 additions & 1 deletion lib/charms/opensearch/v0/opensearch_base_charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -1022,7 +1022,8 @@ def _stop_opensearch(self, *, restart=False) -> None:
except OpenSearchHttpError:
logger.debug("Failed to get online nodes, voting and alloc exclusions not added")

# TODO: should block until all shards move addressed in PR DPE-2234
# block until all primary shards are moved away from the unit that is stopping
self.health.wait_for_shards_relocation()

# 2. stop the service
self.opensearch.stop()
Expand Down
42 changes: 21 additions & 21 deletions lib/charms/opensearch/v0/opensearch_health.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,34 +112,34 @@ def get( # noqa: C901
logger.error(e) # means the status was reported as an int (i.e: 503)
return HealthColors.UNKNOWN

if status != HealthColors.YELLOW:
return status

try:
logger.debug(
f"\n\nHealth: {status} -- Shards: {ClusterState.shards(self._opensearch, host, verbose=True)}\n\n"
)
logger.debug(
f"Allocation explanations: {ClusterState.allocation_explain(self._opensearch, host)}\n\n"
)
except OpenSearchHttpError:
pass

# we differentiate between a temp yellow (moving shards) and a permanent
# one (such as: missing replicas)
if response["initializing_shards"] > 0 or response["relocating_shards"] > 0:
if status in [HealthColors.GREEN, HealthColors.YELLOW] and (
response["initializing_shards"] > 0 or response["relocating_shards"] > 0
):
try:
logger.debug(
f"\n\nHealth: {status} -- Shards: {ClusterState.shards(self._opensearch, host, verbose=True)}\n\n"
)
logger.debug(
f"Allocation explanations: {ClusterState.allocation_explain(self._opensearch, host)}\n\n"
)
except OpenSearchHttpError:
pass
return HealthColors.YELLOW_TEMP
return HealthColors.YELLOW

@retry(stop=stop_after_attempt(15), wait=wait_fixed(5), reraise=True)
return status

@retry(stop=stop_after_attempt(90), wait=wait_fixed(10), reraise=True)
def wait_for_shards_relocation(self) -> None:
"""Blocking function until the shards relocation completes in the cluster."""
if self.get(wait_for_green_first=True) != HealthColors.YELLOW_TEMP:
return
health = self.get(local_app_only=False)

# we throw an error because various operations should NOT start while data
# is being relocated. Examples are: simple stop, unit removal, upgrade
raise OpenSearchHAError("Shards haven't completed relocating.")
if health == HealthColors.YELLOW_TEMP:
logger.info("Shards still moving before stopping Opensearch.")
# we throw an error because various operations should NOT start while data
# is being relocated. Examples are: simple stop, unit removal, upgrade
raise OpenSearchHAError("Shards haven't completed relocating.")

def _apply_for_app(self, status: str) -> None:
"""Cluster wide / app status."""
Expand Down

0 comments on commit a308db8

Please sign in to comment.