Bug when calculating correctness_by_metadata mean #2089

lailanelkoussy · 2024-12-13T16:16:33Z

Issue Type

Bug

Source

source

Giskard Library Version

2.16.0

OS Platform and Distribution

Ubuntu 24.04.1

Python version

3.11

Installed python packages

aiohappyeyeballs==2.4.4
aiohttp==3.11.10
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.7.0
appdirs==1.4.4
asttokens==3.0.0
attrs==24.2.0
bert-score==0.3.13
bokeh==3.4.3
cachetools==5.5.0
certifi==2024.8.30
chardet==5.2.0
charset-normalizer==3.4.0
click==8.1.7
cloudpickle==3.1.0
colorama==0.4.6
contourpy==1.3.1
cycler==0.12.1
databricks-sdk==0.39.0
dataclasses-json==0.6.7
datasets==3.2.0
decorator==5.1.1
Deprecated==1.2.15
dill==0.3.8
distro==1.9.0
docopt==0.6.2
eval_type_backport==0.2.0
evaluate==0.4.3
executing==2.1.0
faiss-cpu==1.8.0
filelock==3.16.1
fonttools==4.55.3
frozenlist==1.5.0
fsspec==2024.9.0
giskard==2.16.0
gitdb==4.0.11
GitPython==3.1.43
google-auth==2.37.0
greenlet==3.1.1
griffe==0.48.0
h11==0.14.0
httpcore==1.0.7
httpx==0.27.2
httpx-sse==0.4.0
huggingface-hub==0.26.5
idna==3.10
importlib_metadata==8.5.0
ipython==8.30.0
jedi==0.19.2
Jinja2==3.1.4
jiter==0.8.2
joblib==1.4.2
jsonpatch==1.33
jsonpath-python==1.0.6
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
kiwisolver==1.4.7
langchain==0.3.11
langchain-community==0.3.11
langchain-core==0.3.24
langchain-openai==0.2.12
langchain-text-splitters==0.3.2
langdetect==1.0.9
langsmith==0.2.3
litellm==1.50.4
llvmlite==0.43.0
Markdown==3.7
MarkupSafe==3.0.2
marshmallow==3.23.1
matplotlib==3.9.4
matplotlib-inline==0.1.7
mistralai==1.2.5
mixpanel==4.10.1
mlflow-skinny==2.19.0
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.4.2
num2words==0.5.13
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
openai==1.57.3
opentelemetry-api==1.29.0
opentelemetry-sdk==1.29.0
opentelemetry-semantic-conventions==0.50b0
orjson==3.10.12
packaging==24.2
pandas==2.2.3
parso==0.8.4
pexpect==4.9.0
pillow==11.0.0
pip==24.2
prompt_toolkit==3.0.48
propcache==0.2.1
protobuf==5.29.1
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==18.1.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pydantic==2.10.3
pydantic_core==2.27.1
pydantic-settings==2.7.0
Pygments==2.18.0
pynndescent==0.5.13
pyparsing==3.2.0
pysbd==0.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.2
PyYAML==6.0.2
ragas==0.2.8
referencing==0.35.1
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
rpds-py==0.22.3
rsa==4.9
safetensors==0.4.5
scikit-learn==1.6.0
scipy==1.11.4
setuptools==75.1.0
six==1.17.0
smmap==5.0.1
sniffio==1.3.1
SQLAlchemy==2.0.36
sqlparse==0.5.3
stack-data==0.6.3
sympy==1.13.1
tenacity==9.0.0
threadpoolctl==3.5.0
tiktoken==0.8.0
tokenizers==0.21.0
torch==2.5.1
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.47.0
triton==3.1.0
typing_extensions==4.12.2
typing-inspect==0.9.0
tzdata==2024.2
umap-learn==0.5.7
urllib3==2.2.3
wcwidth==0.2.13
wheel==0.44.0
wrapt==1.17.0
xxhash==3.5.0
xyzservices==2024.9.0
yarl==1.18.3
zipp==3.21.0
zstandard==0.23.0

Current Behaviour?

I am trying to evaluate my RAG system by using the "evaluate" method

Standalone code OR list down the steps to reproduce the issue

1. I first created the test set using this script: 

 
import giskard
from dotenv import load_dotenv
from giskard.rag import generate_testset, KnowledgeBase
import pandas as pd

load_dotenv()

print('setting llm and embedding models')
giskard.llm.set_llm_model("mistral/mistral-large-latest")
giskard.llm.set_embedding_model("mistral/mistral-embed")

print('Reading data')
df = pd.read_csv("test_data/DataTest.ChunksTest.csv")

df = df.drop(df[df['page_content'].str.len() > 1024].index)
df['page_content'] = df['page_content'].astype(str)

knowledge_base = KnowledgeBase.from_pandas(df, columns=['page_content','metadata.source'])

print('Creating test set')
testset = generate_testset(
    knowledge_base,
    num_questions=10,
    language='fr', 

    ,
)
print('Saving test set')
testset.save("my_testset_10.jsonl")

2. Then trying to evaluate as such: 

```python
from giskard.rag import evaluate, AgentAnswer, QATestset, KnowledgeBase
import pandas as pd

#Imports specific to my use case
from Betty import BettyBot
from ChatModel import ChatModel
from dotenv import load_dotenv

#Loads relevant API keys
load_dotenv()


chat_model = ChatModel()

df = pd.read_csv("test_data/DataTest.ChunksTest.csv")
df = df.drop(df[df['page_content'].str.len() > 1024].index)

knowledge_base = KnowledgeBase.from_pandas(df, columns=['page_content','metadata.source'])
loaded_testset = QATestset.load("my_testset_10.jsonl")


# Wrap your RAG model
def get_answer_fn(question: str, history=None):
    """A function representing your RAG agent."""

    messages = history if history else []
    messages.append({"role": "user", "content": question})

    # function which retrieved document from index in the form of a list of dicts with 'page_content' key
    documents_dicts = search_documents(question)

    documents = concat_documents_for_prompting(documents_dicts)

    # string answer of the model
    answer = chat_model.question_model(context=documents, query=question)
   

    #List of string document contents
    docs_content = [doc['page_content'] for doc in documents_dicts]

  
    return AgentAnswer(
        message=answer,
        documents=docs_content
    )

# Run the evaluation and get a report
#I am purposely avoiding using ragas metrics because they have a bug that is in their latest version
report = evaluate(get_answer_fn, testset=loaded_testset, knowledge_base=knowledge_base)
report.to_html("rag_eval_report.html")

I get an error at the time of calculating the mean of the metric (done internally by evaluate)



### Relevant log output

```shell
Asking questions to the agent: 100%|██████████████████████████████████████████████████████████| 10/10 [01:31<00:00,  9.16s/it]
CorrectnessMetric evaluation: 100%|███████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.05s/it]
Traceback (most recent call last):
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1942, in _agg_py_fallback
    res_values = self._grouper.agg_series(ser, alt, preserve_dtype=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/ops.py", line 864, in agg_series
    result = self._aggregate_series_pure_python(obj, func)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/ops.py", line 885, in _aggregate_series_pure_python
    res = func(group)
          ^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 2454, in <lambda>
    alt=lambda x: Series(x, copy=False).mean(numeric_only=numeric_only),
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/series.py", line 6549, in mean
    return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/generic.py", line 12420, in mean
    return self._stat_function(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/generic.py", line 12377, in _stat_function
    return self._reduce(
           ^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/series.py", line 6457, in _reduce
    return op(delegate, skipna=skipna, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/nanops.py", line 147, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/nanops.py", line 404, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/nanops.py", line 719, in nanmean
    the_sum = values.sum(axis, dtype=dtype_sum)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/numpy/core/_methods.py", line 49, in _sum
    return umr_sum(a, axis, dtype, out, keepdims, initial, where)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'bool' and 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ali-belaich/DataspellProjects/ModelService/tests/giskar/evaluate_rag.py", line 89, in <module>
    report = evaluate(get_answer_fn, testset=loaded_testset, knowledge_base=knowledge_base)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/giskard/rag/evaluate.py", line 110, in evaluate
    report.correctness_by_question_type().to_dict()[metrics[0].name],
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/giskard/rag/report.py", line 276, in correctness_by_question_type
    correctness = self._correctness_by_metadata("question_type")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/giskard/rag/report.py", line 323, in _correctness_by_metadata
    .mean()
     ^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 2452, in mean
    result = self._cython_agg_general(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1998, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/internals/base.py", line 367, in grouped_reduce
    res = func(arr)
          ^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1995, in array_func
    result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1946, in _agg_py_fallback
    raise type(err)(msg) from err
TypeError: agg function failed [how->mean,dtype->object]

The text was updated successfully, but these errors were encountered:

henchaves self-assigned this Dec 13, 2024

henchaves added the bug Something isn't working label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug when calculating correctness_by_metadata mean #2089

Bug when calculating correctness_by_metadata mean #2089

lailanelkoussy commented Dec 13, 2024

Bug when calculating correctness_by_metadata mean #2089

Bug when calculating correctness_by_metadata mean #2089

Comments

lailanelkoussy commented Dec 13, 2024

Issue Type

Source

Giskard Library Version

OS Platform and Distribution

Python version

Installed python packages

Current Behaviour?

Standalone code OR list down the steps to reproduce the issue