Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clinical trials docs and bugfixes #819

Merged
merged 9 commits into from
Jan 17, 2025
163 changes: 163 additions & 0 deletions docs/tutorials/querying_with_clinical_trials.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# PaperQA2 for Clinical Trials

PaperQA2 now natively supports querying clinical trials in addition to any documents supplied by the user. It
uses a new tool, the aptly named `clinical_trials_search` tool. Users don't have to provide any clinical
trials to the tool itself, it uses the `clinicaltrials.gov` API to retrieve them on the fly. As of
January 2025, the tool is not enabled by default, but it's easy to configure. Here's an example
where we query only clinical trials, without using any documents:

```python
from paperqa import Settings, agent_query

answer_response = await agent_query(
query="What drugs have been found to effectively treat Ulcerative Colitis?",
settings=Settings.from_name("search_only_clinical_trials"),
)

print(answer_response.session.answer)
```

### Output

Several drugs have been found to effectively treat Ulcerative Colitis (UC),
targeting different mechanisms of the disease.

Golimumab, a tumor necrosis factor (TNF) inhibitor marketed as Simponi®, has demonstrated efficacy
in treating moderate-to-severe UC. Administered subcutaneously, it was shown to maintain clinical
response through Week 54 in patients, as assessed by the Partial Mayo Score (NCT02092285).

Mesalazine, an anti-inflammatory drug, is commonly used for UC treatment. In a study comparing
mesalazine enemas to faecal microbiota transplantation (FMT) for left-sided UC,
mesalazine enemas (4g daily) were effective in inducing clinical remission (Mayo score ≤ 2) (NCT03104036).

Antibiotics have also shown potential in UC management. A combination of doxycycline,
amoxicillin, and metronidazole induced remission in 60-70% of patients with moderate-to-severe
UC in prior studies. These antibiotics are thought to alter gut microbiota, reducing pathobionts
and promoting beneficial bacteria (NCT02217722, NCT03986996).

Roflumilast, a phosphodiesterase-4 (PDE4) inhibitor, is being investigated for mild-to-moderate UC.
Preliminary findings suggest it may improve disease severity and biochemical markers when
added to conventional treatments (NCT05684484).

These treatments highlight diverse therapeutic approaches, including immunosuppression,
microbiota modulation, and anti-inflammatory mechanisms.

You can see the in-line citations for each clinical trial used as a response for each query. If you'd like
to see more data on the specific contexts that were used to answer the query:

```python
print(answer_response.session.contexts)
```

[Context(context='The excerpt mentions that a search on ClinicalTrials.gov for clinical trials related to drugs
treating Ulcerative Colitis yielded 689 trials. However, it does not provide specific information about which
drugs have been found effective for treating Ulcerative Colitis.', text=Text(text='', name=...

Using `Settings.from_name('search_only_clinical_trials')` is a shortcut, but note that you can easily
add `clinical_trial_search` into any custom `Settings` by just explicitly naming it as a tool:

```python
from pathlib import Path
from paperqa import Settings, agent_query, AgentSetting
from paperqa.agents.tools import DEFAULT_TOOL_NAMES

# you can start with the default list of PaperQA tools
print(DEFAULT_TOOL_NAMES)
# >>> ['paper_search', 'gather_evidence', 'gen_answer', 'reset', 'complete'],

# we can start with a directory with a potentially useful paper in it
print(list(Path("my_papers").iterdir()))

# now let's query using standard tools + clinical_trials
answer_response = await agent_query(
query="What drugs have been found to effectively treat Ulcerative Colitis?",
settings=Settings(
paper_directory="my_papers",
agent={"tool_names": DEFAULT_TOOL_NAMES + ["clinical_trials_search"]},
),
)

# let's check out the formatted answer (with references included)
print(answer_response.session.formatted_answer)
```

Question: What drugs have been found to effectively treat Ulcerative Colitis?

Several drugs have been found effective in treating Ulcerative Colitis (UC), with treatment
strategies varying based on disease severity and extent. For mild-to-moderate UC, 5-aminosalicylic
acid (5-ASA) is the first-line therapy. Topical 5-ASA, such as mesalazine suppositories (1 g/day),
is effective for proctitis or distal colitis, inducing remission in 31-80% of patients. Oral mesalazine
at higher doses (e.g., 4.8 g/day) can accelerate clinical improvement in more extensive disease
(meier2011currenttreatmentof pages 1-2; meier2011currenttreatmentof pages 3-4).

For moderate-to-severe cases, corticosteroids are commonly used. Oral steroids like prednisolone
(40-60 mg/day) or intravenous steroids such as methylprednisolone (60 mg/day) and hydrocortisone
(400 mg/day) are standard for inducing remission (meier2011currenttreatmentof pages 3-4). Tumor
necrosis factor (TNF)-α blockers, such as infliximab, are effective for steroid-refractory cases
(meier2011currenttreatmentof pages 2-3; meier2011currenttreatmentof pages 3-4).

Immunosuppressive agents, including azathioprine and 6-mercaptopurine, are used for maintenance
therapy in steroid-dependent or refractory cases (meier2011currenttreatmentof pages 2-3;
meier2011currenttreatmentof pages 3-4). Antibiotics, such as combinations of penicillin,
tetracycline, and metronidazole, have shown promise in altering the microbiota and inducing
remission in some patients, though their efficacy varies (NCT02217722).

References

1. (meier2011currenttreatmentof pages 2-3): Johannes Meier and Andreas Sturm. Current treatment
of ulcerative colitis. World journal of gastroenterology, 17 27:3204-12, 2011.
URL: https://doi.org/10.3748/wjg.v17.i27.3204, doi:10.3748/wjg.v17.i27.3204.

2. (meier2011currenttreatmentof pages 3-4): Johannes Meier and Andreas Sturm. Current treatment
of ulcerative colitis. World journal of gastroenterology, 17 27:3204-12, 2011. URL:
https://doi.org/10.3748/wjg.v17.i27.3204, doi:10.3748/wjg.v17.i27.3204.

3. (NCT02217722): Prof. Arie Levine. Use of the Ulcerative Colitis Diet for Induction of
Remission. Prof. Arie Levine. 2014. ClinicalTrials.gov Identifier: NCT02217722

4. (meier2011currenttreatmentof pages 1-2): Johannes Meier and Andreas Sturm. Current
treatment of ulcerative colitis. World journal of gastroenterology, 17 27:3204-12, 2011.
URL: https://doi.org/10.3748/wjg.v17.i27.3204, doi:10.3748/wjg.v17.i27.3204.

We now see both papers and clinical trials cited in our response. For convenience, we have a
`Settings.from_name` that works as well:

```python
from paperqa import Settings, agent_query

answer_response = await agent_query(
query="What drugs have been found to effectively treat Ulcerative Colitis?",
settings=Settings.from_name("clinical_trials"),
)
```

And, this works with the `pqa` cli as well:

```bash
>>> pqa --settings 'search_only_clinical_trials' ask 'what is Ibuprofen effective at treating?'
```

...
[13:29:50] Completing 'what is Ibuprofen effective at treating?' as 'certain'.
Answer: Ibuprofen is a non-steroidal anti-inflammatory drug (NSAID) effective
in treating various conditions, including pain, inflammation, and fever.
It is widely used for tension-type
headaches, with studies showing that ibuprofen sodium provides significant
pain relief and reduces pain intensity compared to standard ibuprofen and placebo
over a 3-hour period (NCT01362491).
Intravenous ibuprofen is effective in managing postoperative pain, particularly
in orthopedic surgeries, and helps control the inflammatory process. When combined
with opioids, it reduces opioid
consumption and associated side effects, making it a key component of
multimodal analgesia (NCT05401916, NCT01773005).

Ibuprofen is also effective in pediatric populations as a first-line
anti-inflammatory and antipyretic agent due to its relatively
low adverse effects compared to other NSAIDs (NCT01478022).
Additionally, it has been studied for its potential use in managing
chronic periodontitis through subgingival irrigation with a 2% ibuprofen
mouthwash, which reduces periodontal pocket depth and
bleeding on probing, improving periodontal health (NCT02538237).

These findings highlight ibuprofen's versatility in treating pain, inflammation,
fever, and specific conditions like tension headaches, postoperative pain, and periodontal diseases.
6 changes: 5 additions & 1 deletion paperqa/agents/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
from .models import AgentStatus, AnswerResponse, SimpleProfiler
from .search import SearchDocumentStorage, SearchIndex, get_directory_index
from .tools import (
DEFAULT_TOOL_NAMES,
Complete,
EnvironmentState,
GatherEvidence,
Expand Down Expand Up @@ -117,7 +118,10 @@ async def run_agent(
)

# Build the index once here, and then all tools won't need to rebuild it
await get_directory_index(settings=settings)
# only build if the a search tool is requested
if PaperSearch.TOOL_FN_NAME in (settings.agent.tool_names or DEFAULT_TOOL_NAMES):
await get_directory_index(settings=settings)

if isinstance(agent_type, str) and agent_type.lower() == FAKE_AGENT_TYPE:
session, agent_status = await run_fake_agent(
query, settings, docs, **runner_kwargs
Expand Down
21 changes: 21 additions & 0 deletions paperqa/configs/clinical_trials.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"answer": {
"evidence_k": 15,
"answer_max_sources": 5,
"max_concurrent_requests": 10
},
"agent": {
"tool_names": [
"gather_evidence",
"search_papers",
"gen_answer",
"clinical_trials_search",
"complete"
]
},
"parsing": {
"use_doc_details": true,
"chunk_size": 9000,
"overlap": 750
}
}
20 changes: 20 additions & 0 deletions paperqa/configs/search_only_clinical_trials.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"answer": {
"evidence_k": 15,
"answer_max_sources": 5,
"max_concurrent_requests": 10
},
"agent": {
"tool_names": [
"gather_evidence",
"gen_answer",
"clinical_trials_search",
"complete"
]
},
"parsing": {
"use_doc_details": true,
"chunk_size": 9000,
"overlap": 750
}
}
76 changes: 50 additions & 26 deletions paperqa/sources/clinical_trials.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from paperqa.docs import Docs
from paperqa.settings import Settings
from paperqa.types import DocDetails, Embeddable, Text
from paperqa.utils import gather_with_concurrency
from paperqa.utils import gather_with_concurrency, logging_filters

logger = logging.getLogger(__name__)

Expand All @@ -29,45 +29,61 @@
SEARCH_PAGE_SIZE = 1000
TRIAL_API_FIELDS = "protocolSection,derivedSection"
DOWNLOAD_CONCURRENCY = 20
TRIAL_CHAR_TRUNCATION_SIZE = 30_000 # larger will prevent embeddings from working
TRIAL_CHAR_TRUNCATION_SIZE = 28_000 # stay under 8k tokens for embeddings context limit
MALFORMATTED_QUERY_STATUS: int = 400


class CookieWarningFilter(logging.Filter):
"""Filters out invalid cookie warning.

clincialtrials.gov always sends an x-enc header which aiohttp parsers can't handle
"""

def filter(self, record):
return "Can not load response cookies" not in record.getMessage()


@retry(
stop=stop_after_attempt(3),
wait=wait_incrementing(0.1, 0.1),
retry=retry_if_exception_type(ClientResponseError),
)
async def api_search_clinical_trials(query: str, session: ClientSession) -> dict:
async with session.get(
STUDIES_API_URL,
params={
"query.term": query,
"fields": SEARCH_API_FIELDS,
"pageSize": SEARCH_PAGE_SIZE,
"countTotal": "true",
"sort": "@relevance",
},
) as response:
if response.status == MALFORMATTED_QUERY_STATUS:
# the 400s from clinicaltrials.gov are not JSON
raise HTTPBadRequest(reason=await response.text())
response.raise_for_status()
return await response.json()

with logging_filters(loggers={"aiohttp.client"}, filters={CookieWarningFilter}):
async with (
session.get(
STUDIES_API_URL,
params={
"query.term": query,
"fields": SEARCH_API_FIELDS,
"pageSize": SEARCH_PAGE_SIZE,
"countTotal": "true",
"sort": "@relevance",
},
) as response,
):
if response.status == MALFORMATTED_QUERY_STATUS:
# the 400s from clinicaltrials.gov are not JSON
raise HTTPBadRequest(reason=await response.text())
response.raise_for_status()
return await response.json()


@retry(
stop=stop_after_attempt(3),
wait=wait_incrementing(0.1, 0.1),
)
async def api_get_clinical_trial(nct_id: str, session: ClientSession) -> dict | None:
with suppress(ClientResponseError):
async with session.get(
f"{STUDIES_API_URL}/{nct_id}", params={"fields": TRIAL_API_FIELDS}
) as response:
response.raise_for_status()
return await response.json()
return None
with logging_filters(loggers={"aiohttp.client"}, filters={CookieWarningFilter}):
with suppress(ClientResponseError):
async with session.get(
f"{STUDIES_API_URL}/{nct_id}",
params={"fields": TRIAL_API_FIELDS},
) as response:
response.raise_for_status()
return await response.json()
return None


async def search_retrieve_clinical_trials(
Expand Down Expand Up @@ -234,16 +250,20 @@ async def add_clinical_trials_to_docs(
tuple[int, int, str | None]:
Total number of trials found, number of trials added, and error message if any.
"""
session = aiohttp.ClientSession() if session is None else session
# Cookies are not needed, and malformed via clinicaltrials.gov
_session = aiohttp.ClientSession() if session is None else session

logger.info(f"Querying clinical trials for: {query}.")

try:
trials, total_result_count = await search_retrieve_clinical_trials(
query, session, limit, offset
query, _session, limit, offset
)
except Exception as e:
logger.warning(f"Failed to retrieve clinical trials for query: {query}.")
# close session if it was ephemeral
if session is None:
await _session.close()
return (0, 0, str(e))

logger.info(f"Successfully found {len(trials)} trials.")
Expand Down Expand Up @@ -300,6 +320,10 @@ async def add_clinical_trials_to_docs(
settings=settings,
)

# close session if it was ephemeral
if session is None:
await _session.close()

return (total_result_count, len(docs.texts) - inital_docs_size, None)


Expand Down
23 changes: 23 additions & 0 deletions paperqa/utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import asyncio
import contextlib
import hashlib
import logging
import logging.config
Expand Down Expand Up @@ -519,3 +520,25 @@ def extract_thought(content: str | None) -> str:
"peer-review": "misc", # No direct equivalent, so 'misc' is used
"other": "article", # Assume an article if we don't know the type
}


@contextlib.contextmanager
def logging_filters(loggers: set[str], filters: set[type[logging.Filter]]):
mskarlin marked this conversation as resolved.
Show resolved Hide resolved
"""Temporarily add a filter to each specified logger."""
filters_added: dict[str, list[logging.Filter]] = {}
try:
for logger_name in loggers:
log_to_filter = logging.getLogger(logger_name)
for log_filter in filters:
_filter = log_filter()
log_to_filter.addFilter(_filter)
if logger_name not in filters_added:
filters_added[logger_name] = [_filter]
else:
filters_added[logger_name] += [_filter]
yield
finally:
for logger_name, log_filters_to_remove in filters_added.items():
log_with_filter = logging.getLogger(logger_name)
for log_filter_to_remove in log_filters_to_remove:
log_with_filter.removeFilter(log_filter_to_remove)
Loading