Skip to content

Instantly share code, notes, and snippets.

@jerowe
Last active December 5, 2024 18:40
Show Gist options
  • Save jerowe/6e69055b10e070850c48b563a1b2a5e9 to your computer and use it in GitHub Desktop.
Save jerowe/6e69055b10e070850c48b563a1b2a5e9 to your computer and use it in GitHub Desktop.
Opentargets data load and query
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "efb1d763-68f4-4b50-8e51-d8f26be168c0",
"metadata": {},
"source": [
"# RAG Experiments with LLMs for Drug Discovery\n",
"\n",
"RAG Experiments with Opentargets data.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6d2ab485-7cb4-43eb-9e85-ccdbe3ba57d5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:12:17] </span><span style=\"color: #808000; text-decoration-color: #808000\">WARNING </span> USER_AGENT environment variable not set, consider setting it to identify your <a href=\"file:///opt/conda/lib/python3.11/site-packages/langchain_community/utils/user_agent.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">user_agent.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/langchain_community/utils/user_agent.py#11\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">11</span></a>\n",
"<span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span> requests. <span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\"> </span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:12:17]\u001b[0m\u001b[2;36m \u001b[0m\u001b[33mWARNING \u001b[0m USER_AGENT environment variable not set, consider setting it to identify your \u001b]8;id=802329;file:///opt/conda/lib/python3.11/site-packages/langchain_community/utils/user_agent.py\u001b\\\u001b[2muser_agent.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=972358;file:///opt/conda/lib/python3.11/site-packages/langchain_community/utils/user_agent.py#11\u001b\\\u001b[2m11\u001b[0m\u001b]8;;\u001b\\\n",
"\u001b[2;36m \u001b[0m requests. \u001b[2m \u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import logging\n",
"import os\n",
"from typing import Optional, List, Dict, Any\n",
"import glob\n",
"import boto3\n",
"import json\n",
"import funcy\n",
"from IPython.display import Markdown, display\n",
"import pyarrow as pa\n",
"import pandas as pd\n",
"\n",
"from aws_bedrock_utilities.models.base import BedrockBase, RAGResults\n",
"from aws_bedrock_utilities.models.pgvector_knowledgebase import BedrockPGWrapper\n",
"\n",
"from pprint import pprint\n",
"import time\n",
"import logging\n",
"from rich.logging import RichHandler"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d482cfd0-bfbf-4a25-afe8-60a63af9b538",
"metadata": {},
"outputs": [],
"source": [
"FORMAT = \"%(message)s\"\n",
"logging.basicConfig(\n",
" level=\"INFO\", format=FORMAT, datefmt=\"[%X]\", handlers=[RichHandler()]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a3cfb3de-03a6-4637-8dfd-208c61718113",
"metadata": {},
"outputs": [],
"source": [
"os.environ['POSTGRES_USER'] = 'postgres'"
]
},
{
"cell_type": "markdown",
"id": "b3048bf3-a0a7-414c-b6a5-d4e252a78256",
"metadata": {},
"source": [
"### Structuring Your Queries\n",
"\n",
"You'll need to first have the collection name you're querying along with your queries.\n",
"\n",
"I always recommend running a few QA queries. Ask the obvious questions in several different ways.\n",
"\n",
"You'll also want to adjust the `MAX_DOCS_RETURNED` based on your time constraints and how many articles are in your knowledgebase. The LLM will search until it hits that maximum, and then stops. You'll need to increase that number for an exhaustive search."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "723f54b9-a3fa-4e3d-b9b7-aa56635f41cd",
"metadata": {},
"outputs": [],
"source": [
"# Make sure to keep the collection name consistent!\n",
"#COLLECTION_NAME = \"RA_LITERATURE\"\n",
"COLLECTION_NAME = \"opentargets\"\n",
"MAX_DOCS_RETURNED = 100\n",
"FETCH_K = 5_000"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "348c6260-a7f1-4891-99fd-b3fcb6b1d52f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\"> </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=605355;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=143396;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.11/site-packages/aws_bedrock_utilities/models/pgvector_knowledgebase.py:188: LangChainDeprecationWarning: The class `BedrockEmbeddings` was deprecated in LangChain 0.2.11 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-aws package and should be used instead. To use it run `pip install -U :class:`~langchain-aws` and import as `from :class:`~langchain_aws import BedrockEmbeddings``.\n",
" self.bedrock_embeddings = BedrockEmbeddings(\n"
]
}
],
"source": [
"p = BedrockPGWrapper(collection_name=COLLECTION_NAME)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "be343ecf-2fa8-4065-a490-99b0f45dc6ce",
"metadata": {},
"outputs": [],
"source": [
"#model = \"anthropic.claude-3-sonnet-20240229-v1:0\"\n",
"model = \"anthropic.claude-3-haiku-20240307-v1:0\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "54dcdfaf-7114-452c-8567-2b77ae57206e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:12:18] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:12:18]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=594576;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=996302;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:13:01] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:13:01]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=27589;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=275914;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:13:21] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:13:21]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=513787;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=376218;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:13:45] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:13:45]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=890787;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=271352;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:14:07] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:14:07]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=537912;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=685984;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:14:32] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:14:32]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=644582;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=150273;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:14:52] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:14:52]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=308972;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=487158;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"queries = [\n",
" \"Tell me about T cell–derived cytokines in relation to rheumatoid arthritis and provide citations and article titles\",\n",
" \"Tell me about PTPN22 in relation to rheumatoid arthritis.\",\n",
" \"Tell me about the findings of GWAS studies in rheumatoid arthritis.\",\n",
" \"Tell me about HLA-DRB1 in relation to rheumatoid arthritis\",\n",
" \"Tell me something interesting about this dataset.\",\n",
" \"Tell me about the association between ENSG00000105675 diseases\",\n",
" \"Tell me about all associated data with diseaseId MONDO_0013154\",\n",
"]\n",
"ai_responses = []\n",
"\n",
"for query in queries:\n",
" answer = p.run_kb_chat(query=query, collection_name=COLLECTION_NAME, model_id=model, search_kwargs={'k': MAX_DOCS_RETURNED, 'fetch_k': FETCH_K })\n",
" ai_responses.append(answer)\n",
" time.sleep(1)\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4002f6be-0518-4bd2-b739-b931189fbf12",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['source_documents', 'result', 'query'])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer.keys()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8b0510ac-e922-4055-b034-c8635d9fd240",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(answer['source_documents'])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "f9d450ca-20fb-4cd7-b79a-0be5b8738b1c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(ai_responses)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "93933142-7cac-490a-bea2-dcf00074ab9f",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"### Query \n",
"**Tell me about T cell–derived cytokines in relation to rheumatoid arthritis and provide citations and article titles**\n",
"\n",
"### Response\n",
"T cell-derived cytokines play a key role in the pathogenesis of rheumatoid arthritis (RA). Some key findings from the literature:\n",
"\n",
"1. Interleukin-17 (IL-17): IL-17 is a pro-inflammatory cytokine produced by T helper 17 (Th17) cells. Elevated levels of IL-17 have been found in the synovial fluid and serum of RA patients compared to healthy controls (Chabaud et al. 1999, \"Enhancing effect of IL-17 on IL-1-induced IL-6 and leukemia inhibitory factor production by rheumatoid arthritis synoviocytes\").\n",
"\n",
"2. Tumor necrosis factor-alpha (TNF-α): TNF-α is a key cytokine produced by T cells and other immune cells. Increased levels of TNF-α are found in the synovial fluid and serum of RA patients and contribute to joint inflammation and destruction (Feldmann et al. 1996, \"Role of cytokines in rheumatoid arthritis\").\n",
"\n",
"3. Interferon-gamma (IFN-γ): IFN-γ is produced by T helper 1 (Th1) cells and has been implicated in the pathogenesis of RA. Elevated levels of IFN-γ have been detected in the synovial fluid of RA patients (Canete et al. 2003, \"Differential Th1/Th2 cytokine patterns in chronic arthritis: interferon-gamma is highly expressed in synovium of rheumatoid arthritis compared with seronegative spondylarthropathies\").\n",
"\n",
"4. Interleukin-6 (IL-6): IL-6 is a pleiotropic cytokine produced by various cell types, including T cells. Increased levels of IL-6 have been found in the synovial fluid and serum of RA patients and contribute to joint inflammation and bone destruction (Hashizume et al. 2011, \"The role of interleukin-6 in the pathogenesis of rheumatoid arthritis\").\n",
"\n",
"These T cell-derived cytokines play a crucial role in the pathogenesis of RA by promoting inflammation, joint destruction, and autoimmunity. Targeting these cytokines has been a successful therapeutic approach in the management of RA, as evidenced by the development of biologic therapies such as anti-TNF, anti-IL-6, and anti-IL-17 agents.\n",
"\n",
"--- \n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### Query \n",
"**Tell me about PTPN22 in relation to rheumatoid arthritis.**\n",
"\n",
"### Response\n",
"PTPN22 (protein tyrosine phosphatase non-receptor type 22) is a gene that has been associated with an increased risk of developing rheumatoid arthritis. Some key points about the relationship between PTPN22 and rheumatoid arthritis:\n",
"\n",
"- Genetic variants in the PTPN22 gene, particularly the R620W variant, have been consistently identified as a risk factor for rheumatoid arthritis in multiple studies.\n",
"- The PTPN22 R620W variant is estimated to increase the risk of developing rheumatoid arthritis by around 1.5 to 2 times compared to individuals without the variant.\n",
"- Approximately 10-15% of rheumatoid arthritis patients carry the PTPN22 R620W variant, compared to around 5-7% of the general population.\n",
"- PTPN22 plays a role in regulating T-cell activation, and the R620W variant is thought to lead to dysregulation of the immune system, contributing to the development of autoimmune diseases like rheumatoid arthritis.\n",
"\n",
"In summary, genetic variations in the PTPN22 gene, particularly the R620W variant, are considered an established genetic risk factor for rheumatoid arthritis, increasing the risk of developing the disease by around 1.5 to 2 times in affected individuals.\n",
"\n",
"--- \n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### Query \n",
"**Tell me about the findings of GWAS studies in rheumatoid arthritis.**\n",
"\n",
"### Response\n",
"According to the information provided, several genetic associations have been identified through genome-wide association studies (GWAS) for rheumatoid arthritis (RA):\n",
"\n",
"1. The gene ENSG00000069329 has a high genetic association score of 0.7657582601 with evidence from 114 studies.\n",
"2. Variants in the genes ENSG00000242515, ENSG00000244122, ENSG00000244474, ENSG00000288702, and ENSG00000288705 have all been associated with RA, with a genetic association score of 0.7962204752 and evidence from 4 studies.\n",
"3. The gene ENSG00000036828 has a genetic association score of 0.5471377179 with RA, based on 1 study.\n",
"4. The genes ENSG00000143921, ENSG00000005471, ENSG00000138075, and ENSG00000143921 have been found to affect pathways related to RA, with a score of 0.6079307976 and evidence from 1 study each.\n",
"\n",
"Overall, the provided information highlights several genetic loci that have been implicated in rheumatoid arthritis through GWAS, with varying levels of statistical evidence supporting their association.\n",
"\n",
"--- \n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### Query \n",
"**Tell me about HLA-DRB1 in relation to rheumatoid arthritis**\n",
"\n",
"### Response\n",
"The HLA-DRB1 gene is strongly associated with an increased risk of developing rheumatoid arthritis. Specific alleles of the HLA-DRB1 gene, such as HLA-DRB1*04:01 and HLA-DRB1*04:04, have been found to confer a significantly higher risk of rheumatoid arthritis.\n",
"\n",
"Studies have shown that individuals who carry the HLA-DRB1*04:01 or HLA-DRB1*04:04 alleles have a 3-5 fold increased risk of developing rheumatoid arthritis compared to individuals without these alleles. Additionally, the presence of these high-risk HLA-DRB1 alleles is associated with more severe disease and earlier onset of rheumatoid arthritis.\n",
"\n",
"Approximately 60-70% of individuals with rheumatoid arthritis carry one of the high-risk HLA-DRB1 alleles, compared to only 30-40% of the general population. This genetic association is one of the strongest known risk factors for rheumatoid arthritis, highlighting the important role of the HLA-DRB1 gene in the development and progression of this autoimmune disease.\n",
"\n",
"--- \n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### Query \n",
"**Tell me something interesting about this dataset.**\n",
"\n",
"### Response\n",
"Here are some interesting observations about the dataset:\n",
"\n",
"1. The dataset contains information about 2 diseases, \"OTAR_0000003\" and \"HP_0002019\", and their associations with 1,326 different target genes.\n",
"\n",
"2. The target gene with the highest cumulative score across both diseases is ENSG00000070019, with a total score of 0.5660489059 and 32 pieces of evidence.\n",
"\n",
"3. The target gene with the most evidence across both diseases is ENSG00000198793, with 153 pieces of evidence, although its cumulative score is relatively lower at 0.0405485057.\n",
"\n",
"4. There are 75 target genes that have evidence for the \"OTAR_0000003\" disease, but no evidence for the \"HP_0002019\" disease, suggesting they may be more specifically associated with the \"OTAR_0000003\" disease.\n",
"\n",
"5. The dataset includes information from multiple data types, including literature, known drugs, and animal models. The literature data type has the most evidence, with 144 total pieces of evidence across the two diseases.\n",
"\n",
"6. The target gene with the highest literature-based score for \"OTAR_0000003\" is ENSG00000121957, with a score of 0.6495036853 based on 9 pieces of evidence.\n",
"\n",
"In summary, this dataset provides a detailed view of the genetic associations for two diseases, highlighting the target genes with the strongest evidence and the data types contributing the most information.\n",
"\n",
"--- \n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### Query \n",
"**Tell me about the association between ENSG00000105675 diseases**\n",
"\n",
"### Response\n",
"Unfortunately, I do not have any information about the association between the gene ENSG00000105675 and any diseases. The provided context does not contain any data related to this gene or its disease associations. Without additional information, I cannot provide a specific answer about the relationship between ENSG00000105675 and any diseases.\n",
"\n",
"--- \n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/markdown": [
"\n",
"### Query \n",
"**Tell me about all associated data with diseaseId MONDO_0013154**\n",
"\n",
"### Response\n",
"Unfortunately, there is no information provided in the given context about the disease with ID MONDO_0013154. The context only contains information about other diseases and their associated gene targets. Without any data on MONDO_0013154, I do not have enough information to provide a detailed response about this disease.\n",
"\n",
"--- \n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for answer in ai_responses:\n",
" t = Markdown(f\"\"\"\n",
"### Query \n",
"**{answer['query']}**\n",
"\n",
"### Response\n",
"{answer['result']}\n",
"\n",
"--- \n",
" \"\"\")\n",
" display(t)\n"
]
},
{
"cell_type": "markdown",
"id": "1b50742b-3121-4f73-b385-3c6fb9b3544f",
"metadata": {},
"source": [
"## No Information Provided\n",
"\n",
"Now, this is interesting. When I give the LLM just an ID and tell it to go forth, then its not able to. I think this is because having only the ID isn't passing the similarity threshold. I think this would be a case for breaking the RAG up into several steps. First, run a full text search, return those records, and pass those to the RAG using a contextual search. That approach is outside the scope of this notebook.\n",
"\n",
"Let's try with a more represented disease."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "3920eb67-4204-4970-aef9-61eccf2f0595",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:15:12] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:15:12]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=335029;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=112443;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"queries = [\n",
" \"Tell me about all associated data with diseaseId MONDO_0004979\",\n",
"]\n",
"ai_responses = []\n",
"\n",
"for query in queries:\n",
" answer = p.run_kb_chat(query=query, collection_name=COLLECTION_NAME, model_id=model, search_kwargs={'k': MAX_DOCS_RETURNED, 'fetch_k': FETCH_K })\n",
" ai_responses.append(answer)\n",
" time.sleep(1)\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "eabfb523-50f9-4479-875e-07d448a1d988",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"### Query \n",
"**Tell me about all associated data with diseaseId MONDO_0004979**\n",
"\n",
"### Response\n",
"Based on the provided context, here are the key details about the disease with ID MONDO_0004979:\n",
"\n",
"- There are 26 target genes associated with this disease, including ENSG00000084207 (score 0.8588680111, evidenceCount 112), ENSG00000085662 (score 0.1589561579, evidenceCount 9), and ENSG00000084674 (score 0.3709170002, evidenceCount 8).\n",
"- The target genes with the highest association scores are ENSG00000084207 (score 0.8588680111), ENSG00000085265 (score 0.5714549498), and ENSG00000085662 (score 0.1589561579).\n",
"- The target genes with the highest evidence counts are ENSG00000084207 (evidenceCount 112), ENSG00000083457 (evidenceCount 21), and ENSG00000085662 (evidenceCount 9).\n",
"- Overall, the data suggests that this disease has relatively strong associations with several target genes, particularly ENSG00000084207, which has a high association score and evidence count.\n",
"\n",
"--- \n",
" "
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for answer in ai_responses:\n",
" t = Markdown(f\"\"\"\n",
"### Query \n",
"**{answer['query']}**\n",
"\n",
"### Response\n",
"{answer['result']}\n",
"\n",
"--- \n",
" \"\"\")\n",
" display(t)\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "321d0e07-899e-4c5c-8e66-fce865a05fa4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #7fbfbf; text-decoration-color: #7fbfbf\">[17:15:32] </span><span style=\"color: #000080; text-decoration-color: #000080\">INFO </span> Found credentials in environment variables. <a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">credentials.py</span></a><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">:</span><a href=\"file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\" target=\"_blank\"><span style=\"color: #7f7f7f; text-decoration-color: #7f7f7f\">1147</span></a>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[2;36m[17:15:32]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m Found credentials in environment variables. \u001b]8;id=617628;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py\u001b\\\u001b[2mcredentials.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=298055;file:///opt/conda/lib/python3.11/site-packages/botocore/credentials.py#1147\u001b\\\u001b[2m1147\u001b[0m\u001b]8;;\u001b\\\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"diseaseId\": \"MONDO_0004979\",\n",
" \"targetGenes\": [\n",
" {\n",
" \"targetId\": \"ENSG00000066336\",\n",
" \"score\": 0.0306468608,\n",
" \"evidenceCount\": 9\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000066379\",\n",
" \"score\": 0.0044349583,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000066405\",\n",
" \"score\": 0.0505578736,\n",
" \"evidenceCount\": 11\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000066422\",\n",
" \"score\": 0.0336205232,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000066468\",\n",
" \"score\": 0.0771189963,\n",
" \"evidenceCount\": 4\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000067082\",\n",
" \"score\": 0.0081307568,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000067182\",\n",
" \"score\": 0.056238595,\n",
" \"evidenceCount\": 21\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000067208\",\n",
" \"score\": 0.2020186209,\n",
" \"evidenceCount\": 2\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000067225\",\n",
" \"score\": 0.0488590728,\n",
" \"evidenceCount\": 5\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000067560\",\n",
" \"score\": 0.0695231231,\n",
" \"evidenceCount\": 58\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000067606\",\n",
" \"score\": 0.016302578,\n",
" \"evidenceCount\": 3\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000067798\",\n",
" \"score\": 0.005174118,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000067900\",\n",
" \"score\": 0.0075787128,\n",
" \"evidenceCount\": 9\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000068024\",\n",
" \"score\": 0.0355242084,\n",
" \"evidenceCount\": 7\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000068028\",\n",
" \"score\": 0.0014783194,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000068078\",\n",
" \"score\": 0.3384006777,\n",
" \"evidenceCount\": 12\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000068079\",\n",
" \"score\": 0.0014783194,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000068305\",\n",
" \"score\": 0.0029566388,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000068366\",\n",
" \"score\": 0.0127299728,\n",
" \"evidenceCount\": 3\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000068796\",\n",
" \"score\": 0.0073915971,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000068903\",\n",
" \"score\": 0.0075455887,\n",
" \"evidenceCount\": 4\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000069011\",\n",
" \"score\": 0.0110873956,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00000069020\",\n",
" \"score\": 0.0118265553,\n",
" \"evidenceCount\": 1\n",
" },\n",
" {\n",
" \"targetId\": \"ENSG00\n"
]
}
],
"source": [
"queries = [\n",
" \"Tell me about all associated target genes with their scores for diseaseId MONDO_0004979. Format the response in JSON.\",\n",
"]\n",
"ai_responses = []\n",
"\n",
"for query in queries:\n",
" answer = p.run_kb_chat(query=query, collection_name=COLLECTION_NAME, model_id=model, search_kwargs={'k': MAX_DOCS_RETURNED, 'fetch_k': FETCH_K })\n",
" ai_responses.append(answer)\n",
" time.sleep(1)\n",
"\n",
"print(ai_responses[0]['result'])"
]
},
{
"cell_type": "markdown",
"id": "bf11f609-8581-41d5-bcdc-888240b4de66",
"metadata": {},
"source": [
"This response gets cut off, but we do get the general information. The LLM understands enough about the data structures and their relationships to infer without any need for a graph database or schema!"
]
},
{
"cell_type": "markdown",
"id": "910a23a5-29b3-4f10-8474-bad0251b30a4",
"metadata": {},
"source": [
"## Wrap Up\n",
"\n",
"There you have it! We created a knowledgebase on the cheap, used AWS Bedrock to load the embeddings, and then used a Claude LLM to run our queries."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment