Skip to content

Instantly share code, notes, and snippets.

@warenlg
Last active September 5, 2018 12:38
Show Gist options
  • Save warenlg/e7eb96204f36359c32d5823b3948d144 to your computer and use it in GitHub Desktop.
Save warenlg/e7eb96204f36359c32d5823b3948d144 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"source": [
"%pylab inline\n",
"import numpy as np\n",
"import pandas as pd\n",
"from collections import defaultdict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The dataset of JS Uasts extracted from PGA repos"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"CSV File"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# https://drive.google.com/open?id=1es02UUFUWlR9k4hswCSQCAsSOqjma06y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset contains 62.6G of parquet files: one parquet per repository.\n",
"The table in the parquet contains 3 elements:\n",
"1. `path` to the JS file\n",
"2. `content` of the file as a binary string\n",
"3. `uast` of the file, not parsed. it needs Node.FromString() method from bblfsh"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>URL</th>\n",
" <th>PARQUET_FILENAME</th>\n",
" <th>JS_FILE_COUNT</th>\n",
" <th>JS_LINE_COUNT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>github.com/LearningLocker/learninglocker</td>\n",
" <td>learninglocker.parquet</td>\n",
" <td>736</td>\n",
" <td>48952</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>github.com/maxogden/gut</td>\n",
" <td>gut.parquet</td>\n",
" <td>169</td>\n",
" <td>21447</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>github.com/syzoj/syzoj</td>\n",
" <td>syzoj.parquet</td>\n",
" <td>421</td>\n",
" <td>58760</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>github.com/etaoux/brix</td>\n",
" <td>brix.parquet</td>\n",
" <td>346</td>\n",
" <td>58994</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>github.com/jstacoder/flask-cms</td>\n",
" <td>flask-cms.parquet</td>\n",
" <td>670</td>\n",
" <td>191701</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" URL PARQUET_FILENAME \\\n",
"0 github.com/LearningLocker/learninglocker learninglocker.parquet \n",
"1 github.com/maxogden/gut gut.parquet \n",
"2 github.com/syzoj/syzoj syzoj.parquet \n",
"3 github.com/etaoux/brix brix.parquet \n",
"4 github.com/jstacoder/flask-cms flask-cms.parquet \n",
"\n",
" JS_FILE_COUNT JS_LINE_COUNT \n",
"0 736 48952 \n",
"1 169 21447 \n",
"2 421 58760 \n",
"3 346 58994 \n",
"4 670 191701 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = pd.read_csv(\"PGA_JS_repos_uasts.csv\")\n",
"res.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of repos in PGA with more than 200 JS files : 2825\n"
]
}
],
"source": [
"print(\"Number of repos in PGA with more than 200 JS files :\", len(res))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average number of JS files : 708\n",
"Max number of JS files : 27933\n"
]
}
],
"source": [
"mean_file_count = res[\"JS_FILE_COUNT\"].sum() / len(res)\n",
"max_file_count = res[\"JS_FILE_COUNT\"].max()\n",
"\n",
"print(\"Average number of JS files : %d\" % (mean_file_count))\n",
"print(\"Max number of JS files : %d\" % (max_file_count))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average number of JS LOC per repo : 124328\n",
"Max number of JS LOC per repo : 6523060\n"
]
}
],
"source": [
"mean_line_count = res[\"JS_LINE_COUNT\"].sum() / len(res)\n",
"max_line_count = res[\"JS_LINE_COUNT\"].max()\n",
"\n",
"print(\"Average number of JS LOC per repo : %d\" % (mean_line_count))\n",
"print(\"Max number of JS LOC per repo : %d\" % (max_line_count))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@EgorBu
Copy link

EgorBu commented Sep 5, 2018

Hi, thanks for the analysis!
Is it possible to add distributions repositories based on JS files count?

@warenlg
Copy link
Author

warenlg commented Sep 5, 2018

You can find the initial distribution compiled from the PGA CSV file in this gist https://gist.github.com/warenlg/44bd576637ee161929a3f7e1a88554f5

However, you'll see that the number don't match, the reason :

  1. The step to preprocess all PGA in parquet files misses some guys
  2. At the time it has been run, the preprocess command from src-d/ml did not include lang in the output parquet files. So I had to filter by file extension, and here I missed a lot of files.

@warenlg
Copy link
Author

warenlg commented Sep 5, 2018

I put the shareable link to the CSV file at the beginning of the notebook. Just in case https://drive.google.com/open?id=1es02UUFUWlR9k4hswCSQCAsSOqjma06y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment