Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when calculating correctness_by_metadata mean #2089

Open
lailanelkoussy opened this issue Dec 13, 2024 · 0 comments
Open

Bug when calculating correctness_by_metadata mean #2089

lailanelkoussy opened this issue Dec 13, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@lailanelkoussy
Copy link

Issue Type

Bug

Source

source

Giskard Library Version

2.16.0

OS Platform and Distribution

Ubuntu 24.04.1

Python version

3.11

Installed python packages

aiohappyeyeballs==2.4.4
aiohttp==3.11.10
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.7.0
appdirs==1.4.4
asttokens==3.0.0
attrs==24.2.0
bert-score==0.3.13
bokeh==3.4.3
cachetools==5.5.0
certifi==2024.8.30
chardet==5.2.0
charset-normalizer==3.4.0
click==8.1.7
cloudpickle==3.1.0
colorama==0.4.6
contourpy==1.3.1
cycler==0.12.1
databricks-sdk==0.39.0
dataclasses-json==0.6.7
datasets==3.2.0
decorator==5.1.1
Deprecated==1.2.15
dill==0.3.8
distro==1.9.0
docopt==0.6.2
eval_type_backport==0.2.0
evaluate==0.4.3
executing==2.1.0
faiss-cpu==1.8.0
filelock==3.16.1
fonttools==4.55.3
frozenlist==1.5.0
fsspec==2024.9.0
giskard==2.16.0
gitdb==4.0.11
GitPython==3.1.43
google-auth==2.37.0
greenlet==3.1.1
griffe==0.48.0
h11==0.14.0
httpcore==1.0.7
httpx==0.27.2
httpx-sse==0.4.0
huggingface-hub==0.26.5
idna==3.10
importlib_metadata==8.5.0
ipython==8.30.0
jedi==0.19.2
Jinja2==3.1.4
jiter==0.8.2
joblib==1.4.2
jsonpatch==1.33
jsonpath-python==1.0.6
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
kiwisolver==1.4.7
langchain==0.3.11
langchain-community==0.3.11
langchain-core==0.3.24
langchain-openai==0.2.12
langchain-text-splitters==0.3.2
langdetect==1.0.9
langsmith==0.2.3
litellm==1.50.4
llvmlite==0.43.0
Markdown==3.7
MarkupSafe==3.0.2
marshmallow==3.23.1
matplotlib==3.9.4
matplotlib-inline==0.1.7
mistralai==1.2.5
mixpanel==4.10.1
mlflow-skinny==2.19.0
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.4.2
num2words==0.5.13
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
openai==1.57.3
opentelemetry-api==1.29.0
opentelemetry-sdk==1.29.0
opentelemetry-semantic-conventions==0.50b0
orjson==3.10.12
packaging==24.2
pandas==2.2.3
parso==0.8.4
pexpect==4.9.0
pillow==11.0.0
pip==24.2
prompt_toolkit==3.0.48
propcache==0.2.1
protobuf==5.29.1
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==18.1.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pydantic==2.10.3
pydantic_core==2.27.1
pydantic-settings==2.7.0
Pygments==2.18.0
pynndescent==0.5.13
pyparsing==3.2.0
pysbd==0.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.2
PyYAML==6.0.2
ragas==0.2.8
referencing==0.35.1
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
rpds-py==0.22.3
rsa==4.9
safetensors==0.4.5
scikit-learn==1.6.0
scipy==1.11.4
setuptools==75.1.0
six==1.17.0
smmap==5.0.1
sniffio==1.3.1
SQLAlchemy==2.0.36
sqlparse==0.5.3
stack-data==0.6.3
sympy==1.13.1
tenacity==9.0.0
threadpoolctl==3.5.0
tiktoken==0.8.0
tokenizers==0.21.0
torch==2.5.1
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.47.0
triton==3.1.0
typing_extensions==4.12.2
typing-inspect==0.9.0
tzdata==2024.2
umap-learn==0.5.7
urllib3==2.2.3
wcwidth==0.2.13
wheel==0.44.0
wrapt==1.17.0
xxhash==3.5.0
xyzservices==2024.9.0
yarl==1.18.3
zipp==3.21.0
zstandard==0.23.0

Current Behaviour?

I am trying to evaluate my RAG system by using the "evaluate" method

Standalone code OR list down the steps to reproduce the issue

1. I first created the test set using this script: 

 
import giskard
from dotenv import load_dotenv
from giskard.rag import generate_testset, KnowledgeBase
import pandas as pd

load_dotenv()

print('setting llm and embedding models')
giskard.llm.set_llm_model("mistral/mistral-large-latest")
giskard.llm.set_embedding_model("mistral/mistral-embed")

print('Reading data')
df = pd.read_csv("test_data/DataTest.ChunksTest.csv")

df = df.drop(df[df['page_content'].str.len() > 1024].index)
df['page_content'] = df['page_content'].astype(str)

knowledge_base = KnowledgeBase.from_pandas(df, columns=['page_content','metadata.source'])

print('Creating test set')
testset = generate_testset(
    knowledge_base,
    num_questions=10,
    language='fr', 

    ,
)
print('Saving test set')
testset.save("my_testset_10.jsonl")

2. Then trying to evaluate as such: 

```python
from giskard.rag import evaluate, AgentAnswer, QATestset, KnowledgeBase
import pandas as pd

#Imports specific to my use case
from Betty import BettyBot
from ChatModel import ChatModel
from dotenv import load_dotenv

#Loads relevant API keys
load_dotenv()


chat_model = ChatModel()

df = pd.read_csv("test_data/DataTest.ChunksTest.csv")
df = df.drop(df[df['page_content'].str.len() > 1024].index)

knowledge_base = KnowledgeBase.from_pandas(df, columns=['page_content','metadata.source'])
loaded_testset = QATestset.load("my_testset_10.jsonl")


# Wrap your RAG model
def get_answer_fn(question: str, history=None):
    """A function representing your RAG agent."""

    messages = history if history else []
    messages.append({"role": "user", "content": question})

    # function which retrieved document from index in the form of a list of dicts with 'page_content' key
    documents_dicts = search_documents(question)

    documents = concat_documents_for_prompting(documents_dicts)

    # string answer of the model
    answer = chat_model.question_model(context=documents, query=question)
   

    #List of string document contents
    docs_content = [doc['page_content'] for doc in documents_dicts]

  
    return AgentAnswer(
        message=answer,
        documents=docs_content
    )

# Run the evaluation and get a report
#I am purposely avoiding using ragas metrics because they have a bug that is in their latest version
report = evaluate(get_answer_fn, testset=loaded_testset, knowledge_base=knowledge_base)
report.to_html("rag_eval_report.html")

I get an error at the time of calculating the mean of the metric (done internally by evaluate)



### Relevant log output

```shell
Asking questions to the agent: 100%|██████████████████████████████████████████████████████████| 10/10 [01:31<00:00,  9.16s/it]
CorrectnessMetric evaluation: 100%|███████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.05s/it]
Traceback (most recent call last):
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1942, in _agg_py_fallback
    res_values = self._grouper.agg_series(ser, alt, preserve_dtype=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/ops.py", line 864, in agg_series
    result = self._aggregate_series_pure_python(obj, func)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/ops.py", line 885, in _aggregate_series_pure_python
    res = func(group)
          ^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 2454, in <lambda>
    alt=lambda x: Series(x, copy=False).mean(numeric_only=numeric_only),
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/series.py", line 6549, in mean
    return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/generic.py", line 12420, in mean
    return self._stat_function(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/generic.py", line 12377, in _stat_function
    return self._reduce(
           ^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/series.py", line 6457, in _reduce
    return op(delegate, skipna=skipna, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/nanops.py", line 147, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/nanops.py", line 404, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/nanops.py", line 719, in nanmean
    the_sum = values.sum(axis, dtype=dtype_sum)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/numpy/core/_methods.py", line 49, in _sum
    return umr_sum(a, axis, dtype, out, keepdims, initial, where)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for +: 'bool' and 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ali-belaich/DataspellProjects/ModelService/tests/giskar/evaluate_rag.py", line 89, in <module>
    report = evaluate(get_answer_fn, testset=loaded_testset, knowledge_base=knowledge_base)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/giskard/rag/evaluate.py", line 110, in evaluate
    report.correctness_by_question_type().to_dict()[metrics[0].name],
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/giskard/rag/report.py", line 276, in correctness_by_question_type
    correctness = self._correctness_by_metadata("question_type")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/giskard/rag/report.py", line 323, in _correctness_by_metadata
    .mean()
     ^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 2452, in mean
    result = self._cython_agg_general(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1998, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/internals/base.py", line 367, in grouped_reduce
    res = func(arr)
          ^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1995, in array_func
    result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ali-belaich/anaconda3/envs/BETTY-MODEL-GISKARD/lib/python3.11/site-packages/pandas/core/groupby/groupby.py", line 1946, in _agg_py_fallback
    raise type(err)(msg) from err
TypeError: agg function failed [how->mean,dtype->object]
@henchaves henchaves self-assigned this Dec 13, 2024
@henchaves henchaves added the bug Something isn't working label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants