Synchronization issue when the model is just launched #170

kpouget · 2023-10-24T11:59:27Z

Describe the bug

There is a synchronization issue at the launch of the Pod with the current images:

the containers get all Ready:

flan-t5-small-gpu-predictor-00001-deployment-6768c548d8-8btqc   4/4     Running   0          41s

the model appears as Loaded in the inference service:

  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Loaded

but the model takes several extra seconds to be able to serve requests:

HOST=...
METHOD=caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
while true; do
  GRPCURL_DATA=$(echo "{'max_new_tokens': 25, 'min_new_tokens': 25, 'text': 'At what temperature does liquid Nitrogen boil?'}" | sed "s/'/\"/g")
  grpcurl  -insecure  -d "$GRPCURL_DATA"  -H "mm-model-id: flan-t5-small-caikit"  $HOST  $METHOD
  sleep 1
done

ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
{
  "generated_text": "74 degrees F.C., a temperature of 74 degrees F.C., a temperature of ",
  "generated_tokens": "25",
  "finish_reason": "MAX_TOKENS",
  "producer_id": {
    "name": "Text Generation",
    "version": "0.1.0"
  },
  "input_token_count": "10"
}

in the transformer-container logs, we can see this error:

{"channel": "GP-SERVICR-I", "exception": null, "level": "warning", "log_code": "<RUN49049070W>", "message": "<_InactiveRpcError of RPC that terminated with:
\tstatus = StatusCode.UNAVAILABLE
\tdetails = \"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused\"
\tdebug_error_string = \"UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused {created_time:\"2023-10-24T11:48:51.016344787+00:00\", grpc_status:14}\"
>", "model_id": "flan-t5-small-caikit", "num_indent": 0, "stack_trace": "Traceback (most recent call last):
  File \"/caikit/lib/python3.9/site-packages/caikit/runtime/servicers/global_predict_servicer.py\", line 283, in _handle_predict_exceptions
    yield
  File \"/caikit/lib/python3.9/site-packages/caikit/runtime/servicers/global_predict_servicer.py\", line 260, in predict_model
    response = work.do()
  File \"/caikit/lib/python3.9/site-packages/caikit/runtime/work_management/abortable_action.py\", line 118, in do
    return self.__work_thread.get_or_throw()
  File \"/caikit/lib/python3.9/site-packages/caikit/core/toolkit/destroyable_thread.py\", line 188, in get_or_throw
    raise self.__runnable_exception
  File \"/caikit/lib/python3.9/site-packages/caikit/core/toolkit/destroyable_thread.py\", line 124, in run
    self.__runnable_result = self.runnable_func(
  File \"/caikit/lib/python3.9/site-packages/caikit_nlp/modules/text_generation/text_generation_tgis.py\", line 237, in run
    return self.tgis_generation_client.unary_generate(
  File \"/caikit/lib/python3.9/site-packages/caikit_nlp/toolkit/text_generation/tgis_utils.py\", line 315, in unary_generate
    batch_response = self.tgis_client.Generate(request)
  File \"/caikit/lib64/python3.9/site-packages/grpc/_channel.py\", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File \"/caikit/lib64/python3.9/site-packages/grpc/_channel.py\", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
\tstatus = StatusCode.UNAVAILABLE
\tdetails = \"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused\"
\tdebug_error_string = \"UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused {created_time:\"2023-10-24T11:48:51.016344787+00:00\", grpc_status:14}\"
>
", "thread_id": 140123215742720, "timestamp": "2023-10-24T11:48:51.017178"}

Platform

quay.io/opendatahub/text-generation-inference@sha256:0e3d00961fed95a8f8b12ed7ce50305acbbfe37ee33d37e81ba9e7ed71c73b69
quay.io/opendatahub/caikit-tgis-serving@sha256:adb8d1153b900e304fbcc934189c68cffea035d4b82848446c72c3d5554ee0ca

Sample Code

caikit_tgit_config.yaml.log
inference_service.yaml.log
serving_runtime.yaml.log

The text was updated successfully, but these errors were encountered:

dtrifiro · 2023-11-28T11:08:14Z

This will be fixed when #156 is closed

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

dtrifiro self-assigned this Nov 28, 2023

dtrifiro transferred this issue from opendatahub-io/caikit-tgis-backend Nov 28, 2023

heyselbi mentioned this issue Nov 30, 2023

add readiness probe on TGIS container (caikit+tgis) #156

Open

dtrifiro mentioned this issue Dec 22, 2023

tgis-standalone/caikit-standalone manifests #186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronization issue when the model is just launched #170

Synchronization issue when the model is just launched #170

kpouget commented Oct 24, 2023

This comment was marked as resolved.

This comment was marked as outdated.

dtrifiro commented Nov 28, 2023

Synchronization issue when the model is just launched #170

Synchronization issue when the model is just launched #170

Comments

kpouget commented Oct 24, 2023

Describe the bug

Platform

Sample Code

This comment was marked as resolved.

This comment was marked as outdated.

dtrifiro commented Nov 28, 2023