Created
August 6, 2020 01:05
-
-
Save lychrel/85600f0f198282a2393166a3d89f6ce6 to your computer and use it in GitHub Desktop.
Arxiv Citation Recommendations via collaborative filtering
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "ArxivCitationRecommender.ipynb", | |
"provenance": [] | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"accelerator": "GPU" | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "FgiPvjubRjfz", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Citation Recommendations via Collaborative Filtering\n", | |
"\n", | |
"Grab the [internal citation data](https://www.kaggle.com/Cornell-University/arxiv?select=internal-citations.json) from the Kaggle dataset and upload it here.\n", | |
"\n", | |
"*Thanks to [this fastai.collab tutorial](https://towardsdatascience.com/collaborative-filtering-using-fastai-a2ec5a2a4049)*" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "YQLo0E9TRtgA", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Imports" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "A9BP2qm9LHWX", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"import fastai\n", | |
"from google.colab import files\n", | |
"import json\n", | |
"import pandas as pd\n", | |
"from tqdm import tqdm" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "vLeyIORDRzbG", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Upload/unzip citation data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "hblNi-B1LJRw", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"files.upload()" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "qAQjUy4-LUXl", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 51 | |
}, | |
"outputId": "4a60f36a-f9bf-4e4c-ae59-81da03acae0b" | |
}, | |
"source": [ | |
"!unzip 612177_1135627_compressed_internal-citations.json.zip" | |
], | |
"execution_count": null, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Archive: 612177_1135627_compressed_internal-citations.json.zip\n", | |
"replace internal-citations.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: " | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "UyXB-TztR67B", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Get citation JSON" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "q0fxHmpoLhG3", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"with open(\"internal-citations.json\", \"r\") as fp:\n", | |
" citations = json.loads(fp.read())\n", | |
"\n", | |
"print(list(citations.keys())[:10])" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "z-bWdHr9MHAB", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"for paper, list_of_citations in list(citations.items())[:10]:\n", | |
" print(\"{}: {}\".format(paper, list_of_citations))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "M0zzxz1VR9ql", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Generate Citation DF" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "6P6Vo0mqMQYF", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# generate CSV\n", | |
"citing_papers = []\n", | |
"citees = []\n", | |
"scores = []\n", | |
"for paper, list_of_citations in tqdm(citations.items()):\n", | |
" for citation in list_of_citations:\n", | |
" citing_papers.append(paper)\n", | |
" citees.append(citation)\n", | |
" scores.append(1.0)\n", | |
"\n", | |
"citation_df = pd.DataFrame({'paperID': citing_papers, 'citationID': citees, 'target': 1.0})" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "3gORPN8aNeTk", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"citation_df" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "gRGwYtRmSDGj", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### CF via fastai" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "VnSnzTqqNnd_", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"from fastai.collab import *\n", | |
"from fastai.tabular import *" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ESwlE3vJNv-W", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"data = CollabDataBunch.from_df(citation_df, seed=42, valid_pct=0.2)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ylhKS5PAN39l", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"y_range = [0.0, 1.0]\n", | |
"learn = collab_learner(data, n_factors=50, y_range=y_range, wd=1e-1)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "8uUDFUfZOHHy", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"learn.fit_one_cycle(5, 5e-3)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "7ZnyuIYiOPRx", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment