Last active
September 5, 2018 12:38
-
-
Save warenlg/e7eb96204f36359c32d5823b3948d144 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Populating the interactive namespace from numpy and matplotlib\n" | |
] | |
} | |
], | |
"source": [ | |
"%pylab inline\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"from collections import defaultdict" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## The dataset of JS Uasts extracted from PGA repos" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"CSV File" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# https://drive.google.com/open?id=1es02UUFUWlR9k4hswCSQCAsSOqjma06y" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The dataset contains 62.6G of parquet files: one parquet per repository.\n", | |
"The table in the parquet contains 3 elements:\n", | |
"1. `path` to the JS file\n", | |
"2. `content` of the file as a binary string\n", | |
"3. `uast` of the file, not parsed. it needs Node.FromString() method from bblfsh" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>URL</th>\n", | |
" <th>PARQUET_FILENAME</th>\n", | |
" <th>JS_FILE_COUNT</th>\n", | |
" <th>JS_LINE_COUNT</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>github.com/LearningLocker/learninglocker</td>\n", | |
" <td>learninglocker.parquet</td>\n", | |
" <td>736</td>\n", | |
" <td>48952</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>github.com/maxogden/gut</td>\n", | |
" <td>gut.parquet</td>\n", | |
" <td>169</td>\n", | |
" <td>21447</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>github.com/syzoj/syzoj</td>\n", | |
" <td>syzoj.parquet</td>\n", | |
" <td>421</td>\n", | |
" <td>58760</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>github.com/etaoux/brix</td>\n", | |
" <td>brix.parquet</td>\n", | |
" <td>346</td>\n", | |
" <td>58994</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>github.com/jstacoder/flask-cms</td>\n", | |
" <td>flask-cms.parquet</td>\n", | |
" <td>670</td>\n", | |
" <td>191701</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" URL PARQUET_FILENAME \\\n", | |
"0 github.com/LearningLocker/learninglocker learninglocker.parquet \n", | |
"1 github.com/maxogden/gut gut.parquet \n", | |
"2 github.com/syzoj/syzoj syzoj.parquet \n", | |
"3 github.com/etaoux/brix brix.parquet \n", | |
"4 github.com/jstacoder/flask-cms flask-cms.parquet \n", | |
"\n", | |
" JS_FILE_COUNT JS_LINE_COUNT \n", | |
"0 736 48952 \n", | |
"1 169 21447 \n", | |
"2 421 58760 \n", | |
"3 346 58994 \n", | |
"4 670 191701 " | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"res = pd.read_csv(\"PGA_JS_repos_uasts.csv\")\n", | |
"res.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Number of repos in PGA with more than 200 JS files : 2825\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Number of repos in PGA with more than 200 JS files :\", len(res))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Average number of JS files : 708\n", | |
"Max number of JS files : 27933\n" | |
] | |
} | |
], | |
"source": [ | |
"mean_file_count = res[\"JS_FILE_COUNT\"].sum() / len(res)\n", | |
"max_file_count = res[\"JS_FILE_COUNT\"].max()\n", | |
"\n", | |
"print(\"Average number of JS files : %d\" % (mean_file_count))\n", | |
"print(\"Max number of JS files : %d\" % (max_file_count))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Average number of JS LOC per repo : 124328\n", | |
"Max number of JS LOC per repo : 6523060\n" | |
] | |
} | |
], | |
"source": [ | |
"mean_line_count = res[\"JS_LINE_COUNT\"].sum() / len(res)\n", | |
"max_line_count = res[\"JS_LINE_COUNT\"].max()\n", | |
"\n", | |
"print(\"Average number of JS LOC per repo : %d\" % (mean_line_count))\n", | |
"print(\"Max number of JS LOC per repo : %d\" % (max_line_count))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
You can find the initial distribution compiled from the PGA CSV file in this gist https://gist.github.com/warenlg/44bd576637ee161929a3f7e1a88554f5
However, you'll see that the number don't match, the reason :
- The step to preprocess all PGA in parquet files misses some guys
- At the time it has been run, the preprocess command from
src-d/ml
did not includelang
in the output parquet files. So I had to filter by file extension, and here I missed a lot of files.
I put the shareable link to the CSV file at the beginning of the notebook. Just in case https://drive.google.com/open?id=1es02UUFUWlR9k4hswCSQCAsSOqjma06y
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, thanks for the analysis!
Is it possible to add distributions repositories based on JS files count?