Skip to content

Instantly share code, notes, and snippets.

@stephenturner
Created December 30, 2024 14:36
Show Gist options
  • Save stephenturner/5490f544d4cce0c6000735d864504bdb to your computer and use it in GitHub Desktop.
Save stephenturner/5490f544d4cce0c6000735d864504bdb to your computer and use it in GitHub Desktop.
Get information about human genes from RefSeq
library(tidyverse)
# Get Gene Summary info
gs_orig <- read_tsv("https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_summary.gz")
gs <- gs_orig |>
janitor::clean_names() |>
set_names(str_replace, "number_tax_id", "tax_id") |>
filter(tax_id==9606) |>
distinct()
gs
# Get gene information for human
gi_orig <- read_tsv("https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz")
gi <- gi_orig |>
janitor::clean_names() |>
set_names(str_replace, "number_tax_id", "tax_id") |>
filter(tax_id==9606) |>
distinct()
gi
# Get gene to gene ontology
go_orig <- read_tsv("https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz")
go <-
go_orig |>
janitor::clean_names() |>
set_names(str_replace, "number_tax_id", "tax_id") |>
filter(tax_id==9606) |>
distinct() |>
filter(category=="Process") |>
summarize(biological_process = paste(go_term, collapse="; "), .by="gene_id")
go
# join them all together
g <-
gi |>
filter(type_of_gene=="protein-coding") |>
inner_join(gs, by="gene_id") |>
left_join(go, by="gene_id") |>
select(symbol, description, gene_info=summary, biological_process) |>
distinct()
# Write out files
g |> write_csv("hs_gene_info.csv")
g |> nanoparquet::write_parquet("hs_gene_info.parquet")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment