Last active
July 4, 2020 21:37
-
-
Save callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f to your computer and use it in GitHub Desktop.
GeneMania_DataProcessingPipeline.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "GeneMania_DataProcessingPipeline.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"mount_file_id": "https://gist.github.com/callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f#file-genemania_dataprocessingpipeline-ipynb", | |
"authorship_tag": "ABX9TyNFzvUgJ5ESgR7n1f2BgIf7", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/callahantiff/871f6bcbdbd6603d1eb19a38ddc7321f/genemania_dataprocessingpipeline.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "1cBdztLei-TT", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<img align=\"right\" width=\"300\" alt=\"Screen Shot 2019-12-12 at 21 59 22\" src=\"https://user-images.githubusercontent.com/8030363/70771518-9d1f5980-1d2e-11ea-9201-d5aade3fe376.png\">\n", | |
"\n", | |
"## Gene Mania Data Processing Pipeline\n", | |
"**Creation Date:** `06/16/20` \n", | |
"**Contact Notebook Author:** [`TJCallahan`](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&[email protected]) \n", | |
"\n", | |
"<br>\n", | |
"\n", | |
"### Data \n", | |
"**Release:** `2017-Mar-14` \n", | |
"**Downloaded URL:** [`COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt`](http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt) \n", | |
"**PubMed ID:** [`20576703`](https://pubmed.ncbi.nlm.nih.gov/20576703/) \n", | |
"**Description:** This file contains `3` columns, where each row represents an edge. Within an edge, the first two columns contain a single **[`Ensembl Gene`](https://uswest.ensembl.org/index.html)** identifier and the third column contains a float representing a weight. Please note the following details copied from GeneMANIA's [`Data Archive`](http://pages.genemania.org/data/) page: \n", | |
"- Each interacting pair of genes will be present exactly once in the file (symmetric interactions are not included) \n", | |
"- Non-interacting genes are not present \n", | |
"- No assumptions are made regarding the order of the records in the file or the order of genes in a record\n", | |
"\n", | |
"<br>\n", | |
"\n", | |
"### Purpose \n", | |
"The goal of this notebook is to provide a reproducible workflow for downloading gene-gene interaction data from **[`GeneMANIA`](http://pages.genemania.org/)**. This pipeline consists of the following 3 steps: (1) *Download Data* (i.e. data is downloaded into a `Pandas.DataFrame` object directly from the URL referenced above); (2) *Data Processing* (i.e. `GeneMANIA` this workflow provides optional functionality to convert the default provided asymmetric edge list into a symmetric set of edges); and (3) *Data Output* (i.e. the processed edge lis tis output as a tab-delimited `csv` file). \n", | |
"\n", | |
"**Data Documentation for use in Publications:** Gene-gene interaction (GGI) data was downloaded from GeneMANIA [**[`PMID:20576703`](https://pubmed.ncbi.nlm.nih.gov/20576703/)**] (release date: 03/14/2017). GeneMANIA provides species-specific networks built using co-expression data, physical interactions, genetic interactions, shared protein domains, co-localization, pathways, computational inference, and others (e.g. phenotype, disease, and chemical relationships from OMIM and Ensembl). These relationships are obtained by processing data from GEO, BioGRID, EMBL-EBI, Pfam, Ensembl, NCBI, MGI, I2D, InParanoid, and Pathway Commons. See GeneMANIA's [help page](http://pages.genemania.org/help/#network-data-sources) for more information. While GeneMANIA provides many different types of networks, we utilized the Homo sapiens Combined network, which includes all of the interaction network types described above merged into a single network by leveraging Gene Ontology Biological Process-based functional enrichment analysis (described in detail [here](http://pages.genemania.org/help/#network-data-sources)). The Homo sapien combined GGI data was downloaded on 06/16/20 and all GGIs were included resulting in a analysis set of 13,959,260 asymmetric GGIs.\n", | |
"\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "F-34WrDXwgG3", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### Set-up Environment" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "cOLAMYHFiza2", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# load needed libraries\n", | |
"import ftplib\n", | |
"import pandas as pd\n", | |
"\n", | |
"from contextlib import closing\n", | |
"from google.colab import drive\n", | |
"from tqdm import tqdm" | |
], | |
"execution_count": 1, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "8_41IUF8wp2E", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### STEP 1 - DOWNLOAD DATA\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Yf3jIFS4wp_V", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# GGI URL\n", | |
"url = 'http://genemania.org/data/current/Homo_sapiens.COMBINED/COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt'\n", | |
"\n", | |
"# load data from URL into Pandas DataFrame\n", | |
"ggi_raw = pd.read_csv(url, sep='\\t', header=0)\n", | |
"\n", | |
"# preview first few rows of the data\n", | |
"ggi_raw.head(n=10)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "4ZzseFaIzI-s", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### STEP 2 - DATA PROCESSING" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "uTkPgz3szfzL", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
}, | |
"outputId": "7d7121db-6194-4e31-b27e-8f9ffe4418cf" | |
}, | |
"source": [ | |
"# print unique counts of edges and source/target nodes\n", | |
"ggi_unq = ggi_raw.drop_duplicates()\n", | |
"\n", | |
"edges = len(ggi_unq)\n", | |
"source_nodes = len(set(list(ggi_unq['Gene_A'])))\n", | |
"target_nodes = len(set(list(ggi_unq['Gene_B'])))\n", | |
"\n", | |
"# print counts\n", | |
"'There are {edges} unique edges, {source} unique source nodes, and {target} unique target nodes'.format(edges=edges,\n", | |
" source=source_nodes,\n", | |
" target=target_nodes)" | |
], | |
"execution_count": 3, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"application/vnd.google.colaboratory.intrinsic": { | |
"type": "string" | |
}, | |
"text/plain": [ | |
"'There are 6979630 unique edges, 19167 unique source nodes, and 19503 unique target nodes'" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 3 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "20PbFkYCzJJZ", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# create a symmetric version of data\n", | |
"source = list(ggi_unq['Gene_A']) + list(ggi_unq['Gene_B'])\n", | |
"target = list(ggi_unq['Gene_B']) + list(ggi_unq['Gene_A'])\n", | |
"weight = list(ggi_unq['Weight']) + list(ggi_unq['Weight'])\n", | |
"\n", | |
"# convert lists to Pandas DataFrame\n", | |
"ggi_sym = pd.DataFrame(list(zip(source, target, weight)),\n", | |
" columns =['Gene_A', 'Gene_B', 'Weight'])\n", | |
"\n", | |
"# remove duplicates\n", | |
"ggi_sym_unq = ggi_sym.drop_duplicates()\n", | |
"\n", | |
"edges = len(ggi_sym_unq)\n", | |
"source_nodes = len(set(list(ggi_sym_unq['Gene_A'])))\n", | |
"target_nodes = len(set(list(ggi_sym_unq['Gene_B'])))\n", | |
"\n", | |
"# print counts\n", | |
"'There are {edges} unique edges, {source} unique source nodes, and {target} unique target nodes'.format(edges=edges,\n", | |
" source=source_nodes,\n", | |
" target=target_nodes)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "RTKlY-RK5q7w", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# sort data by Weight to verify file looks correct\n", | |
"ggi_sym_unq_srt = ggi_sym_unq.sort_values(by=['Weight', 'Gene_A'])\n", | |
"\n", | |
"# preview data\n", | |
"ggi_sym_unq_srt.head(n=10)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "dscIgsJ2zJT8", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### STEP 3 - DATA OUTPUT" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "y38iY8FOzJar", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# mount GoogleDrive - you will be prompted to authenticate your GoogleDrive\n", | |
"# if you get stuck follow instructions here: https://stackoverflow.com/questions/49394737/exporting-data-from-google-colab-to-local-machine\n", | |
"drive.mount('/drive', force_remount=True)" | |
], | |
"execution_count": 7, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "eiXaofTMzqqs", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# save processed DataFrame locally - edges\n", | |
"ggi_sym_unq_srt.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_16June2020.csv', sep='\\t', header=True, index=False)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ooeEttC61qgi", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# save node list\n", | |
"unique_genes = ggi_sym_unq_srt['Gene_A'].drop_duplicates()\n", | |
"unique_genes.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_UniqueNodes_04July2020.csv', sep='\\t', header=True, index=False)" | |
], | |
"execution_count": 8, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "hPRFFcQrxsbe", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Node Attributes \n", | |
"In order to make the edges more interpretable, we also pull some node attribute data from the sources listed in the table below. \n", | |
"\n", | |
"Tab | Source | Source Version/Release Date | Source URL | Download Date\n", | |
"-- | -- | -- | -- | --\n", | |
"Ensembl_HS.GRCh38.100.Uniprot | Ensembl | 100 | [URL](ftp://ftp.ensembl.org/pub/release-100/tsv/homo_sapiens/Homo_sapiens.GRCh38.100.uniprot.tsv.gz) | 7/4/20\n", | |
"GOA_human | Gene Ontology Consortium | 6/1/20 | [URL](http://geneontology.org/gene-associations/goa_human.gaf.gz) | 7/4/20\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "KXlZjW7iL1_e", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### Download Node Data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "HzSdELuGEtF8", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"def gzipped_ftp_url_download(url: str, write_location: str):\n", | |
" \"\"\"Downloads a gzipped file from an ftp server.\n", | |
"\n", | |
" Args:\n", | |
" url: A string that points to the location of a temp mapping file that needs to be processed.\n", | |
" write_location: A string that points to a file directory.\n", | |
"\n", | |
" Returns:\n", | |
" write_loc: a String containing the directory and filename where the data was downloaded\n", | |
" \"\"\"\n", | |
" \n", | |
" server = url.replace('ftp://', '').split('/')[0]\n", | |
" directory = '/'.join(url.replace('ftp://', '').split('/')[1:-1])\n", | |
" file = url.replace('ftp://', '').split('/')[-1]\n", | |
" write_loc = write_location + '{filename}'.format(filename=file)\n", | |
"\n", | |
" print('Downloading Gzipped data from FTP Server: {}'.format(url))\n", | |
" with closing(ftplib.FTP(server)) as ftp, open(write_loc, 'wb') as fid:\n", | |
" ftp.login()\n", | |
" ftp.cwd(directory)\n", | |
" ftp.retrbinary('RETR {}'.format(file), fid.write)\n", | |
"\n", | |
" fid.close()\n", | |
"\n", | |
" return write_loc\n", | |
"\n", | |
"def convert_to_dict(data, col_a, col_b):\n", | |
" \"\"\"Converts a Pandas DataFrame into a dictionary.\n", | |
"\n", | |
" Args:\n", | |
" data: A Pandas DataFrame.\n", | |
" col_a: A string containing a column name to be used as the dicitonary key.\n", | |
" col_b: A string containing a column name to be used as the dictionary value.\n", | |
" \n", | |
" Returns:\n", | |
" node_metadata: A dictionary where keys are gene identifiers and values are a set of identifiers.\n", | |
" \"\"\"\n", | |
"\n", | |
" node_metadata = dict()\n", | |
"\n", | |
" for idx, row in tqdm(data.iterrows(), total=data.shape[0]):\n", | |
" if row[col_a] in node_metadata:\n", | |
" node_metadata[row[col_a]] |= {row[col_b]}\n", | |
" else:\n", | |
" node_metadata[row[col_a]] = {row[col_b]}\n", | |
"\n", | |
" return node_metadata" | |
], | |
"execution_count": 62, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "hjcNcR-ECh9D", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# Ensembl gene - UniProt\n", | |
"url = 'ftp://ftp.ensembl.org/pub/release-100/tsv/homo_sapiens/Homo_sapiens.GRCh38.100.uniprot.tsv.gz'\n", | |
"file_loc = gzipped_ftp_url_download(url, '/drive/My Drive/Colab Notebooks/data/')\n", | |
"\n", | |
"# read in data\n", | |
"ensembl_uniprot = pd.read_csv(file_loc, sep='\\t', header=0, compression='gzip')\n", | |
"ensembl_uniprot.head(n=5)\n", | |
"\n", | |
"# convert to dictionary\n", | |
"ensembl_uniprot_dict = convert_to_dict(ensembl_uniprot, 'gene_stable_id', 'xref')" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "UISXUIiLCiGm", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# GOA_Human Annotations - Gene Ontology Consortium\n", | |
"url= 'http://geneontology.org/gene-associations/goa_human.gaf.gz'\n", | |
"columns = ['DB', 'DB_Object_ID', 'DB_Object_Symbol', 'Qualifier', 'GO_ID', 'DB:Reference',\n", | |
" 'Evidence_Code', 'With (or) From', 'Aspect', 'DB_Object_Name', 'DB_Object_Synonym',\n", | |
" 'DB_Object_Type', 'Taxon', 'Date', 'Assigned_By', 'Annotation Extension', 'Gene Product Form ID']\n", | |
"\n", | |
"goa = pd.read_csv(url, sep='\\t', header=None, names=columns, compression='gzip', skiprows=32, low_memory=False)\n", | |
"goa.head(n=5)\n", | |
"\n", | |
"# convert to dictionary\n", | |
"goa_dict_GO = convert_to_dict(goa, 'DB_Object_ID', 'GO_ID')\n", | |
"goa_dict_GO_aspect = convert_to_dict(goa, 'GO_ID', 'Aspect')" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "0ent3pmVL5P3", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Aggregate Node Data\n", | |
"Join all of the node data into a single file keyed by node." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Tx8RgyyqMCH2", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# combine results into single data structure\n", | |
"data = []\n", | |
"\n", | |
"for gene in tqdm(list(unique_genes)):\n", | |
" if gene in ensembl_uniprot_dict.keys():\n", | |
" proteins = list(ensembl_uniprot_dict[gene])\n", | |
" # uniprot id\n", | |
" for protein in proteins:\n", | |
" # get go annotations\n", | |
" if protein in goa_dict_GO.keys():\n", | |
" for go in goa_dict_GO[protein]:\n", | |
" data += [[gene, protein, go, list(goa_dict_GO_aspect[go])[0]]]" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Bli2m3S9XLJI", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# convert list to Pnadas DataFrame\n", | |
"ensembl_gene_annotations = pd.DataFrame({'ensembl_gene_id': [x[0] for x in data],\n", | |
" 'uniprot_id': [x[1] for x in data],\n", | |
" 'go_id': [x[2] for x in data],\n", | |
" 'go_aspect': [x[3] for x in data]})" | |
], | |
"execution_count": 116, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "EakfFZj1UUbH", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# save output\n", | |
"ensembl_gene_annotations.to_csv('/drive/My Drive/Colab Notebooks/data/GGI_Combined_HomoSapien_NodeAnnotations_04July2020.csv', sep='\\t', header=True, index=False)" | |
], | |
"execution_count": 118, | |
"outputs": [] | |
} | |
] | |
} |
Unique Nodes (07/04/20): GGI_Combined_HomoSapien_UniqueNodes_04July2020.csv
Node Attributes (GOA Annotations - 07/04/20): [GGI_Combined_HomoSapien_NodeAnnotations_04July2020.csv
](https://drive.google.com/file/d/1-3iiEoaZc0m4pWEW77Q-bf3EQXg4rh4n/view?usp=sharing
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Data Exported on 06/16/20 (tab-delimited):
GGI_Combined_HomoSapien_16June2020.csv