Skip to content

Instantly share code, notes, and snippets.

@raimonizard
Last active December 9, 2024 10:19
Show Gist options
  • Save raimonizard/60abb6bdf76c8b3157e9632b212f4977 to your computer and use it in GitHub Desktop.
Save raimonizard/60abb6bdf76c8b3157e9632b212f4977 to your computer and use it in GitHub Desktop.
machine_learning_intro.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/raimonizard/60abb6bdf76c8b3157e9632b212f4977/machine_learning_intro.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# About"
],
"metadata": {
"id": "l1uWVK2uWpLS"
}
},
{
"cell_type": "markdown",
"source": [
"An exemple in order to work with the basics of **Machine Learning (ML)** with Python.\n"
],
"metadata": {
"id": "UtWQe9dgRrcd"
}
},
{
"cell_type": "markdown",
"source": [
"# Theory"
],
"metadata": {
"id": "BmKwwiIsl6mN"
}
},
{
"cell_type": "markdown",
"source": [
"## Machine Learning"
],
"metadata": {
"id": "Kqtnfo5B4c49"
}
},
{
"cell_type": "markdown",
"source": [
"**Machine Learning** is making **the computer learn** from studying data and statistics.\n",
"\n",
"Machine Learning is a step into the direction of **artificial intelligence** (AI).\n",
"\n",
"Machine Learning is a program that analyses data and learns to **predict the outcome**."
],
"metadata": {
"id": "0SOGlkSA4hiN"
}
},
{
"cell_type": "markdown",
"source": [
"## Data Set"
],
"metadata": {
"id": "4njstV_h5Ob7"
}
},
{
"cell_type": "markdown",
"source": [
"In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database."
],
"metadata": {
"id": "0MsgaooA5R8m"
}
},
{
"cell_type": "markdown",
"source": [
"## Mean, Median and Mode"
],
"metadata": {
"id": "O2j0DD4y5XwP"
}
},
{
"cell_type": "markdown",
"source": [
"In Machine Learning (and in mathematics) there are often three values that interests us:\n",
"\n",
"- **Mean**: The average value.\n",
"- **Median**: The mid point value after sorting the array. If there are two numbers in the middle, divide the sum of those numbers by two.\n",
"- **Mode**: The most common value (the one which appears more times)."
],
"metadata": {
"id": "fLwheQwM5Z_M"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"from scipy import stats as st\n",
"import random\n",
"\n",
"# Use random.choices to select k elements within a range with the chance of repeating values\n",
"# Used range(1, 220, 1) means values from min = 1 to max = 220 with 1 step between values.\n",
"# So values in (1, 2, 3, 4, ..., 220)\n",
"# [Source](https://pynative.com/python-random-sample/)\n",
"speed2 = random.choices(range(1, 220, 1), k = 15)\n",
"\n",
"print(speed2)\n",
"\n",
"mean = np.mean(speed2)\n",
"median = np.median(speed2)\n",
"\n",
"# Note that for mode() we use scipy.stats library instead of numpy:\n",
"mode = st.mode(speed2)\n",
"\n",
"print('Mean: ', mean, ' Median: ', median, ' Mode: ', mode)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Z8Wutgwc5pW3",
"outputId": "98dada64-0ed8-4985-ac87-a21967d87450"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[169, 2, 7, 11, 73, 130, 68, 75, 8, 65, 51, 139, 145, 55, 194]\n",
"Mean: 79.46666666666667 Median: 68.0 Mode: ModeResult(mode=2, count=1)\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Standard Deviation σ"
],
"metadata": {
"id": "EbNk337K_kli"
}
},
{
"cell_type": "markdown",
"source": [
"**Standard deviation** is a number that describes **how spread out the values are**.\n",
"\n",
"The standard deviation uses the same data unit as the values inside the dataset.\n",
"\n",
"- A **low standard deviation** means that **most of the numbers** are **close to** the **mean** (average) value.\n",
"\n",
"- A **high standard deviation** means that the **values are spread out** over a wider range."
],
"metadata": {
"id": "kK8DZtm9_oVA"
}
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"speed = random.choices(range(0, 220, 1), k = 400)\n",
"print(pd.DataFrame({ 'speed' : speed }))\n",
"\n",
"mean = np.mean(speed)\n",
"print(mean)\n",
"\n",
"standard_deviation = np.std(speed)\n",
"print(standard_deviation)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6ZHj_UuQAMCp",
"outputId": "ac385e11-b27d-4033-e14e-bd617b0488f0"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" speed\n",
"0 13\n",
"1 21\n",
"2 36\n",
"3 147\n",
"4 37\n",
".. ...\n",
"395 40\n",
"396 149\n",
"397 100\n",
"398 188\n",
"399 116\n",
"\n",
"[400 rows x 1 columns]\n",
"107.84\n",
"61.76649091538226\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Variance σ^2"
],
"metadata": {
"id": "2zoiwAHzCCH2"
}
},
{
"cell_type": "markdown",
"source": [
"**Variance** is **another number** that **indicates how spread out** the values are.\n",
"\n",
"In fact, if you take the square root of the variance, you get the standard deviation!\n",
"\n",
"Or the other way around, if you multiply the standard deviation by itself, you get the variance!"
],
"metadata": {
"id": "OG6BuunUCD4I"
}
},
{
"cell_type": "markdown",
"source": [
"To calculate the variance you have to do as follows:\n",
"\n",
"1. Find the mean:\n",
"\n",
"(32+111+138+28+59+77+97) / 7 = 77.4\n",
"\n",
"2. For each value: find the difference from the mean:\n",
"\n",
" 32 - 77.4 = -45.4\n",
"\n",
" 111 - 77.4 = 33.6\n",
"\n",
" 138 - 77.4 = 60.6\n",
"\n",
" 28 - 77.4 = -49.4\n",
"\n",
" 59 - 77.4 = -18.4\n",
"\n",
" 77 - 77.4 = - 0.4\n",
"\n",
" 97 - 77.4 = 19.6\n",
"\n",
"3. For each difference: find the square value:\n",
"\n",
" (-45.4)^2 = 2061.16\n",
" \n",
" (33.6)^2 = 1128.96\n",
" \n",
" (60.6)^2 = 3672.36\n",
" \n",
" (-49.4)^2 = 2440.36\n",
" \n",
" (-18.4)^2 = 338.56\n",
" \n",
" (- 0.4)^2 = 0.16\n",
" \n",
" (19.6)^2 = 384.16\n",
"\n",
"4. The variance is the average number of these squared differences:\n",
"\n",
" (2061.16 + 1128.96 + 3672.36 + 2440.36 + 338.56 + 0.16 + 384.16) / 7 = 1432.2"
],
"metadata": {
"id": "Ko6JW-pDCdnr"
}
},
{
"cell_type": "code",
"source": [
"import numpy\n",
"\n",
"speed = [32,111,138,28,59,77,97]\n",
"\n",
"x = numpy.var(speed)\n",
"\n",
"print(x)"
],
"metadata": {
"id": "hEGW-kIdDCPf",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "8b5e98ed-7dff-4aa1-a73f-f03b83a16a86"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"1432.2448979591834\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Percentiles"
],
"metadata": {
"id": "595_JdAkDp4n"
}
},
{
"cell_type": "markdown",
"source": [
"Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than."
],
"metadata": {
"id": "ECrBHph4DsAU"
}
},
{
"cell_type": "code",
"source": [
"ages = random.choices(range(0, 100, 1), k = 100)\n",
"print(ages)\n",
"\n",
"# Sort the values of ages list ordered from small to bigger\n",
"ages.sort()\n",
"print(ages)\n",
"\n",
"# Find the age value from which the 50% of the population is equal or younger\n",
"print(np.percentile(ages, 50))\n",
"\n",
"# Fint the the top 10% younger age:\n",
"print(np.percentile(ages, 10))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eKmUHYveD193",
"outputId": "496fd273-aac1-4270-b9c2-a633f984e36d"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[51, 55, 3, 35, 78, 89, 41, 18, 23, 37, 75, 31, 29, 20, 98, 97, 11, 91, 92, 69, 16, 50, 77, 45, 23, 58, 58, 99, 49, 49, 55, 82, 98, 59, 38, 48, 39, 17, 47, 94, 78, 26, 57, 66, 39, 5, 9, 20, 98, 45, 89, 60, 2, 78, 11, 99, 87, 12, 21, 37, 71, 84, 22, 85, 75, 72, 28, 36, 25, 0, 45, 98, 90, 8, 25, 67, 46, 89, 29, 95, 6, 34, 78, 75, 60, 19, 99, 17, 2, 76, 31, 36, 65, 1, 71, 63, 11, 90, 52, 49]\n",
"[0, 1, 2, 2, 3, 5, 6, 8, 9, 11, 11, 11, 12, 16, 17, 17, 18, 19, 20, 20, 21, 22, 23, 23, 25, 25, 26, 28, 29, 29, 31, 31, 34, 35, 36, 36, 37, 37, 38, 39, 39, 41, 45, 45, 45, 46, 47, 48, 49, 49, 49, 50, 51, 52, 55, 55, 57, 58, 58, 59, 60, 60, 63, 65, 66, 67, 69, 71, 71, 72, 75, 75, 75, 76, 77, 78, 78, 78, 78, 82, 84, 85, 87, 89, 89, 89, 90, 90, 91, 92, 94, 95, 97, 98, 98, 98, 98, 99, 99, 99]\n",
"49.0\n",
"11.0\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Data Distribution: Histogram"
],
"metadata": {
"id": "HBzCsi8JG2l7"
}
},
{
"cell_type": "markdown",
"source": [
"To visualize the **data distribution** set we can draw a **histogram** with the data we collected.\n",
"\n",
"We will use the Python module **Matplotlib** to draw a histogram.\n",
"\n",
"The **histogram** graph is a **two axis bar diagram**.\n",
"\n",
"- The x-axis: contains ordered values form the dataset.\n",
"- The y-axis: contains the count of elements inside the dataset which have its value inside each x-axis value.\n",
"\n",
"The histogram will have multiple vertical bars.\n",
"\n",
"It might be interesting to check out the standard deviation of the data in order to know how many bars to be used in the histogram.\n",
"\n",
"As more bars, more detailed information about the data distribution.\n"
],
"metadata": {
"id": "nokQI-r8G6_R"
}
},
{
"cell_type": "code",
"source": [
"# Example 1 with 400 integer pseudorandom values between 0 and 1000:\n",
"dataset = random.choices(range(0, 1000, 1), k = 400)\n",
"dataset.sort()\n",
"\n",
"print(type(dataset))\n",
"print(dataset)\n",
"\n",
"# matplotlib.pyplot.hist returns 3 arguments (counts, edges, bars)\n",
"# Calling the function plt.hist(dataset, number_of_bars) generates the graph\n",
"# and we can also get the output arguments to be used later for the bar_label\n",
"counts, edges, bars = plt.hist(dataset, 20)\n",
"plt.bar_label(bars)\n",
"\n",
"plt.show()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "O6iw4QrTIcZh",
"outputId": "b83832b4-46c6-45b4-8a23-8d8117966dec"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'list'>\n",
"[2, 6, 7, 8, 11, 13, 15, 17, 20, 23, 23, 24, 26, 28, 29, 32, 35, 36, 39, 45, 47, 52, 52, 54, 58, 60, 60, 65, 66, 69, 69, 72, 73, 78, 82, 82, 88, 90, 91, 92, 93, 93, 97, 99, 101, 107, 111, 111, 113, 114, 115, 119, 120, 126, 129, 131, 138, 140, 143, 143, 145, 145, 146, 149, 156, 161, 164, 165, 168, 169, 180, 180, 184, 185, 186, 188, 188, 190, 191, 198, 204, 207, 209, 216, 217, 218, 218, 223, 227, 231, 233, 234, 241, 244, 245, 246, 247, 250, 255, 256, 259, 264, 268, 268, 274, 280, 283, 288, 291, 297, 299, 300, 301, 301, 306, 312, 317, 317, 317, 322, 324, 327, 329, 339, 345, 348, 361, 361, 366, 366, 367, 372, 373, 375, 376, 377, 379, 379, 380, 383, 385, 388, 391, 392, 393, 398, 399, 400, 401, 401, 401, 403, 405, 406, 407, 408, 408, 408, 410, 411, 413, 414, 416, 421, 424, 425, 435, 438, 438, 439, 440, 441, 442, 450, 452, 459, 460, 462, 463, 471, 474, 477, 478, 482, 492, 492, 494, 495, 498, 498, 498, 501, 502, 507, 509, 516, 523, 524, 524, 527, 530, 531, 531, 531, 532, 533, 535, 540, 542, 542, 542, 544, 544, 547, 548, 555, 557, 558, 559, 559, 562, 567, 568, 568, 570, 572, 575, 580, 588, 593, 597, 597, 598, 598, 600, 602, 602, 603, 604, 607, 607, 608, 609, 616, 617, 618, 620, 626, 632, 635, 637, 638, 642, 648, 649, 650, 653, 655, 659, 661, 664, 666, 671, 673, 678, 681, 685, 691, 694, 697, 697, 701, 705, 707, 711, 711, 711, 713, 714, 714, 717, 718, 721, 721, 723, 725, 725, 726, 735, 738, 739, 741, 746, 747, 751, 751, 751, 752, 761, 762, 763, 765, 765, 766, 769, 773, 775, 778, 780, 780, 783, 787, 788, 791, 793, 793, 795, 798, 799, 802, 803, 803, 804, 805, 806, 814, 820, 823, 824, 825, 825, 828, 829, 829, 830, 831, 833, 835, 835, 836, 837, 839, 843, 845, 845, 847, 852, 852, 858, 860, 862, 865, 865, 872, 873, 873, 877, 879, 881, 884, 890, 891, 892, 892, 900, 900, 902, 902, 914, 916, 925, 931, 934, 935, 937, 940, 941, 942, 944, 944, 951, 951, 952, 959, 962, 966, 966, 972, 980, 981, 982, 985, 986, 986, 989, 991, 992, 996, 998, 998]\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
]
},
{
"cell_type": "code",
"source": [
"# Example 2 with 400 float pseudorandom values from 0.0 to 100.0:\n",
"dataset = np.random.uniform(0.0, 100.0, 400)\n",
"dataset.sort()\n",
"\n",
"print(type(dataset))\n",
"print(dataset)"
],
"metadata": {
"id": "ZNgm9g3dKqYY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"counts, edges, bars = plt.hist(dataset, 10)\n",
"plt.bar_label(bars)\n",
"\n",
"plt.show()\n"
],
"metadata": {
"id": "1OmUVDC-L5Ae"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Normal Data Distribution"
],
"metadata": {
"id": "tYZ6owkJQhqZ"
}
},
{
"cell_type": "markdown",
"source": [
"In this chapter we will learn how to **create an array** where the **values are concentrated** **around a given value**.\n",
"\n",
"In probability theory this kind of data distribution is known as the **normal data distribution**, or the **Gaussian data distribution**, after the mathematician Carl Friedrich Gauss who came up with the formula of this data distribution.\n",
"\n",
"Normal data Distribution is commongly known with the term **nd**.\n",
"\n",
"**Note**: A normal distribution graph is also known as the bell curve because of it's characteristic shape of a bell."
],
"metadata": {
"id": "SdSzFfc-Qj6l"
}
},
{
"cell_type": "code",
"source": [
"# The instruction numpy.random.normal(center_value, standard_deviation, num elements)\n",
"# generates a numpy.ndarray list with values around the median value = 5 with the standard deviation of 1.\n",
"# Assuming that we created a normal distribution series, the 68% of the values of the ndarray will be between [4, 6]; because (5-1) = 4 and (5+1) = 6. Those values are called to be on one standard deviation\n",
"# The 95% of the values will be between [3, 7] within two standard deviations (standard deviation value * 2).\n",
"# The 99.73% of the values will be within three standard deviations values [2, 8].\n",
"dataset_nd = np.random.normal(5.0, 1.0, 100000)\n",
"print(type(dataset_nd))\n",
"\n",
"counts, edges, bars = plt.hist(dataset_nd, 20)\n",
"plt.bar_label(bars)\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "0e3U47glRCO9"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Cars example with normal distribution"
],
"metadata": {
"id": "XJrOPcuIUztv"
}
},
{
"cell_type": "markdown",
"source": [
"Generate the two sets (cars age and cars speed)"
],
"metadata": {
"id": "AD3ztQOgaGGk"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"\n",
"cars_years_old = np.random.normal(5.0, 1.0, 1000)\n",
"cars_speed = np.random.normal(50.0, 20.0, 1000)"
],
"metadata": {
"id": "H4hWXwBXaECS"
},
"execution_count": 2,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Plot the datasets:"
],
"metadata": {
"id": "lK9_MDZcaLP1"
}
},
{
"cell_type": "code",
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Generate a scatter plot representing all the points\n",
"plt.scatter(cars_years_old, cars_speed)\n",
"\n",
"# Add a title for the graph\n",
"plt.title('Distribution of speed and age for cars', fontsize = 14)\n",
"\n",
"# Add labels for axis\n",
"plt.xlabel('car age (years)', fontsize = 8, color = 'blue', fontweight = 'bold')\n",
"plt.ylabel('car speed (kms/h)', fontsize = 8, color = 'blue', fontweight = 'bold')\n",
"\n",
"# Show a background grid\n",
"plt.grid()\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "1MR16h7wU5Qs",
"outputId": "3e11c175-9ba9-4dda-ea1c-293ea970f9cc",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 469
}
},
"execution_count": 7,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
]
},
{
"cell_type": "markdown",
"source": [
"Check the mean and standard deviation of the generated sets:"
],
"metadata": {
"id": "KHBmU59vaNuv"
}
},
{
"cell_type": "code",
"source": [
"mean_age = np.mean(cars_years_old)\n",
"print('The cars age mean is: ', mean_age)\n",
"\n",
"standard_deviation_age = np.std(cars_years_old)\n",
"print('The standard deviation for cars age is: ', standard_deviation_age)\n",
"\n",
"mean_speed = np.mean(cars_speed)\n",
"print('The mean value for cars speed is: ', mean_speed)\n",
"\n",
"standard_deviation_speed = np.std(cars_speed)\n",
"print('The std deviation for cars speed is: ', standard_deviation_speed)"
],
"metadata": {
"id": "vCch_xZPZvWL",
"outputId": "5fefbc1d-4c9b-4ddd-f6ed-0f4cd5c2b268",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"The cars age mean is: 4.998381513061106\n",
"The standard deviation for cars age is: 0.9892486206465505\n",
"The mean value for cars speed is: 50.603779692796415\n",
"The std deviation for cars speed is: 20.18948558606819\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## The mathematics line formula"
],
"metadata": {
"id": "4Bx1PIUTo5Mn"
}
},
{
"cell_type": "markdown",
"source": [
"In mathematics, the line is represented as shown below:\n",
"\n",
"![image.png]()\n",
"\n",
"Source [here](https://www.geogebra.org/m/WCbjCGDC)"
],
"metadata": {
"id": "uB_ge5ITl999"
}
},
{
"cell_type": "markdown",
"source": [
"## Regression"
],
"metadata": {
"id": "pB4-27EFo-dq"
}
},
{
"cell_type": "markdown",
"source": [
"The **term regression** is used when **you try to find the relationship between variables**.\n",
"\n",
"In Machine Learning, and in statistical modeling, **that relationship is used to predict the outcome of future events**.\n",
"\n",
"There are **different kinds of regression** to be used depending on the distribution of the data."
],
"metadata": {
"id": "pGhSmSJao_8_"
}
},
{
"cell_type": "markdown",
"source": [
"## Linear Regression"
],
"metadata": {
"id": "01OHAB4ypE2w"
}
},
{
"cell_type": "markdown",
"source": [
"The **linear regression** is the **straight line** which represents the relation among two variables.\n",
"\n",
"Linear regression uses the relationship between the data-points to draw a straight line through all them.\n",
"\n",
"This line can be used to predict future values.\n"
],
"metadata": {
"id": "er1V0L5mpnMo"
}
},
{
"cell_type": "markdown",
"source": [
"![image.png]()"
],
"metadata": {
"id": "6VnEqxyArYA0"
}
},
{
"cell_type": "markdown",
"source": [
"## Polynomial Regression"
],
"metadata": {
"id": "ZB7aadQ6qVdj"
}
},
{
"cell_type": "markdown",
"source": [
"**If your data points** clearly **will not fit a linear regression** (a straight line through all data points), **it might be ideal** for **polynomial regression**.\n",
"\n",
"Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points."
],
"metadata": {
"id": "7SdEtGVTqX8J"
}
},
{
"cell_type": "markdown",
"source": [
"![image.png]()"
],
"metadata": {
"id": "eSJa83uwq6Rs"
}
},
{
"cell_type": "markdown",
"source": [
"## Multiple Regression"
],
"metadata": {
"id": "Bsb29UFj2vwb"
}
},
{
"cell_type": "markdown",
"source": [
"Multiple regression is like linear regression, but with more than one independent value, meaning that we try to **predict a value based on two or more variables**.\n",
"\n",
"We have a **result variable** that **we want to predict** in **function** of **two or more known variables**."
],
"metadata": {
"id": "cqQpTPLo2y-k"
}
},
{
"cell_type": "markdown",
"source": [
"The multiple regression explains how two or more independent variables determine a dependent resulting variable.\n",
"\n",
"Those two or more independent variables can take part with different impact over the result variable. This is called the **coefficient factor**.\n",
"\n",
"With the independent variables' coefficients we can know what would happen over the dependent variable if we increse or decrease one of the independent variables.\n",
"\n",
"We can imagine the multiple regression model as a polynomical function with two or more variables multiplied by a factor.\n",
"\n",
"Something such as:\n",
"dependent_variable = a·x + b·y + c·z ...\n",
"\n",
"Where a, b and c are numerical values."
],
"metadata": {
"id": "d7ghSgVvCtqQ"
}
},
{
"cell_type": "markdown",
"source": [
"## Scale Features"
],
"metadata": {
"id": "lAZzfD399X3f"
}
},
{
"cell_type": "markdown",
"source": [
"When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?\n",
"\n",
"The answer to this problem is scaling. We can scale data into new values that are easier to compare."
],
"metadata": {
"id": "yiPXszql9au3"
}
},
{
"cell_type": "markdown",
"source": [
"The standardization method uses this formula:\n",
"\n",
"z = (x - u) / s\n",
"\n",
"Where z is the new value, x is the original value, u is the mean and s is the standard deviation.\n",
"\n",
"If you take the weight column from the data set above, the first value is 790, and the scaled value will be:\n",
"\n",
" (790 - 1292.23) / 238.74 = -2.1\n",
"\n",
"If you take the volume column from the data set above, the first value is 1.0, and the scaled value will be:\n",
"\n",
" (1.0 - 1.61) / 0.38 = -1.59\n",
"\n",
"Now you can compare -2.1 with -1.59 instead of comparing 790 with 1.0.\n",
"\n",
"You do not have to do this manually, the Python **sklearn** module has a method called **StandardScaler()** which returns a Scaler object with methods for transforming data sets."
],
"metadata": {
"id": "lfc17ywo9oJQ"
}
},
{
"cell_type": "markdown",
"source": [
"- It is necessary to import: **from sklearn.preprocessing import StandardScaler**\n",
"\n",
"- And use: **scale = StandardScaler()**"
],
"metadata": {
"id": "IenALSdE-Aqy"
}
},
{
"cell_type": "markdown",
"source": [
"For more detail, click [here](https://www.w3schools.com/python/python_ml_scale.asp)"
],
"metadata": {
"id": "Y1MYOh4v90Oo"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"---\n",
"\n"
],
"metadata": {
"id": "rErblYNRqOva"
}
},
{
"cell_type": "markdown",
"source": [
"# Code examples"
],
"metadata": {
"id": "xf56O6BrbWgj"
}
},
{
"cell_type": "markdown",
"source": [
"## Import libraries"
],
"metadata": {
"id": "CCPiKFWfWSRd"
}
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "tmTVCqCJRfg-"
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"from sklearn.linear_model import LinearRegression\n",
"import pandas as pd\n",
"from scipy import stats\n",
"import random\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"source": [
"## Diamonds example - Linear Regression"
],
"metadata": {
"id": "pjAK1WQOWA5i"
}
},
{
"cell_type": "markdown",
"source": [
"For more info click [here](https://discovery.cs.illinois.edu/learn/Towards-Machine-Learning/Machine-Learning-Models-in-Python-with-sk-learn/)\n",
"\n",
"The variable **model** uses the column variables *carat* and *price* from the dataframe comming from the [diamonds file](https://waf.cs.illinois.edu/discovery/diamonds.csv)\n",
"\n"
],
"metadata": {
"id": "cHa_GeDhXbQb"
}
},
{
"cell_type": "markdown",
"source": [
"### Get the data"
],
"metadata": {
"id": "z_07li2mfFH5"
}
},
{
"cell_type": "code",
"source": [
"diamonds_dataset = pd.read_csv(\"https://waf.cs.illinois.edu/discovery/diamonds.csv\")\n",
"\n",
"print(diamonds_dataset)"
],
"metadata": {
"id": "dTnVAIMxTvS_"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Train the model"
],
"metadata": {
"id": "ncCRdUBTe1o5"
}
},
{
"cell_type": "code",
"source": [
"diamonds_model = LinearRegression().fit(diamonds_dataset[ ['carat'] ]\n",
" , diamonds_dataset['price'])"
],
"metadata": {
"id": "dHTT1oZWe2oC"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Check the regression score"
],
"metadata": {
"id": "e2oBnrTlfW8j"
}
},
{
"cell_type": "code",
"source": [
"diamonds_model.score(diamonds_dataset[ ['carat'] ]\n",
" , diamonds_dataset['price'])"
],
"metadata": {
"id": "PxaJQb2RfZKp"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Test the model"
],
"metadata": {
"id": "aZS5WSkNtpsk"
}
},
{
"cell_type": "code",
"source": [
"diamonds_test = pd.DataFrame({'carat' : np.arange(0.1, 5.1, 0.1)})\n",
"\n",
"# Add a new column called 'predicted_price' to the diamonds_test dataset\n",
"diamonds_test['predicted_price'] = diamonds_model.predict(diamonds_test)\n"
],
"metadata": {
"id": "yttwNt66trxm"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Print the calculated results"
],
"metadata": {
"id": "pmt4RSUGuksk"
}
},
{
"cell_type": "code",
"source": [
"print(diamonds_test)"
],
"metadata": {
"id": "FoZVr6C8ucyh"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Plot the actual data vs calculated data"
],
"metadata": {
"id": "jYVtE7GeunWV"
}
},
{
"cell_type": "code",
"source": [
"plt.plot(diamonds_dataset['price']\n",
" , diamonds_dataset['carat'])\n",
"\n",
"# To filter the values from a pandas.DataFrame:\n",
"# dataframe[\n",
"# dataframe['column_name'] < > = != value\n",
"# ]\n",
"# ['column_name_to_show']\n",
"plt.plot(diamonds_test[diamonds_test['carat'] < 5]['predicted_price']\n",
" , diamonds_test[diamonds_test['carat'] < 5]['carat'])\n",
"\n",
"plt.title('Price of the diamond in function of its weight')\n",
"plt.xlabel('Diamond price (eur)')\n",
"plt.ylabel('Diamond weight (carat)')\n",
"\n",
"plt.grid()\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "hDxgDTveuptc"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Car speed example - Linear Regression"
],
"metadata": {
"id": "PQ6SlyfN0nw3"
}
},
{
"cell_type": "markdown",
"source": [
"The values of the python list **years_old** represent the age of the cars; the values of **speed* represent the velocity of the cars.\n",
"\n",
"The point here is to find the **Linear Regression** which represents the relation between the variables **years_old** and **speed**.\n",
"\n",
"The linear regression is used to determine one variable in function of another one.\n",
"\n",
"The values will be placed in the x-y axis.\n",
"\n",
"**cars** variable will be a pandas.DataFrame build from a python dictionary with two *keys* were each one will have a python list as a value inluding multiple values inside.\n",
"\n"
],
"metadata": {
"id": "CIMzasjBYGUW"
}
},
{
"cell_type": "code",
"source": [
"years_old = [5,7,8,7,2,17,2,9,4,11,12,9,6]\n",
"speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]\n",
"cars = pd.DataFrame({\n",
" \"years_old\" : years_old\n",
" , \"speed\" : speed\n",
" })"
],
"metadata": {
"id": "vLRGSGW8SUqx"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(cars)"
],
"metadata": {
"id": "NMYtqKe8S6A5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Prepare a plot from the columns *years_old* and *speed* from the DataFrame **cars**."
],
"metadata": {
"id": "_NKND0VsZ98l"
}
},
{
"cell_type": "code",
"source": [
"plt.scatter(cars['years_old'], cars['speed'])\n",
"\n",
"plt.title('Cars age vs car speed')\n",
"plt.xlabel('car age (years)')\n",
"plt.ylabel('car speed (km/h)')\n",
"\n",
"plt.grid()\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "TPIKd49jVyZ5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Train the model with sklearn"
],
"metadata": {
"id": "JZy8FofNzl1b"
}
},
{
"cell_type": "markdown",
"source": [
"Use the function **fit** from **sklearn.linear_model.LinearRegression** to train the model.\n",
"\n",
"In the current example, the variable **model_cars** will store the result of the trained model having the Linear Regression among *years_old* and *speed*."
],
"metadata": {
"id": "Qs1PLOxbab5U"
}
},
{
"cell_type": "code",
"source": [
"model_cars = LinearRegression()\n",
"model_cars = model_cars.fit(cars[ [\"years_old\"] ], cars[\"speed\"] )"
],
"metadata": {
"id": "tPFJbhJEabQQ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Test the model"
],
"metadata": {
"id": "SnLqSkxxzrhq"
}
},
{
"cell_type": "markdown",
"source": [
"Use the function **predict** from **sklearn.linear_model.LinearRegression** to calculate the resulting **speed** for each value.\n",
"\n",
"In the current example, we create a new pandas.DataFrame with some values for the variable **years_old**.\n",
"\n",
"After that, we create a new column called **calc_speed** which will have the predicted speed value in function of the years of the car and responding to the calculated linear regression."
],
"metadata": {
"id": "wdDgO7z0jKqK"
}
},
{
"cell_type": "code",
"source": [
"sample_data_cars = pd.DataFrame({\"years_old\" :\n",
" [1, 3, 4, 5, 6, 7, 8, 9, 10, 20]})\n",
"sample_data_cars[\"calc_speed\"] = model_cars.predict(sample_data_cars)"
],
"metadata": {
"id": "GWlmm8Zqc4YZ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(sample_data_cars)"
],
"metadata": {
"id": "Sw4micwvhTDm"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"plt.plot(sample_data_cars[\"years_old\"], sample_data_cars[\"calc_speed\"])\n",
"plt.scatter(sample_data_cars[\"years_old\"], sample_data_cars[\"calc_speed\"])\n",
"\n",
"plt.title('Linear Regression age vs speed')\n",
"plt.xlabel('car age (years)')\n",
"plt.ylabel('car speed (km/h)')\n",
"plt.grid()\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "cF0nLcwHj2uS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"plt.scatter(cars[\"years_old\"], cars[\"speed\"])\n",
"plt.plot(sample_data_cars[\"years_old\"], sample_data_cars[\"calc_speed\"])\n",
"\n",
"plt.title('Actual values and Linear Regression')\n",
"plt.xlabel('car age (years)')\n",
"plt.ylabel('car speed (km/h)')\n",
"plt.grid()\n",
"\n",
"plt.show()\n"
],
"metadata": {
"id": "gzH6Se2LkNXa"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Check the linear regression line formula"
],
"metadata": {
"id": "EMgMk-Y5x_0X"
}
},
{
"cell_type": "markdown",
"source": [
"Once we trained the model and obtained the linear regression, we can get the line formula as follows:"
],
"metadata": {
"id": "lAquYd8zy84l"
}
},
{
"cell_type": "code",
"source": [
"m = model_cars.coef_\n",
"b = model_cars.intercept_\n",
"\n",
"print('f(x) = ', m ,'x +', b)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tsvBUb43zEcF",
"outputId": "0010c0a3-4ced-4d16-d68b-53c4ea340e2a"
},
"execution_count": null,
"outputs": [
{
"output_type": "error",
"ename": "NameError",
"evalue": "name 'model_cars' is not defined",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-8-976d3dd0dd6c>\u001b[0m in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel_cars\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel_cars\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mintercept_\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'f(x) = '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mm\u001b[0m \u001b[0;34m,\u001b[0m\u001b[0;34m'x +'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mNameError\u001b[0m: name 'model_cars' is not defined"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"Now we've got the **formula** of the **linear regression line**.\n",
"\n",
"Now for instance we can check what would be the speed of a 7.5 years old car:"
],
"metadata": {
"id": "yA9MAskE1OBk"
}
},
{
"cell_type": "code",
"source": [
"y = lambda x : -1.75128771 * x + 103.10596026490066\n",
"\n",
"print(y(7.5))"
],
"metadata": {
"id": "RUiICt211agm"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"If the check the values over the previous graph, it looks great!\n",
"\n",
"We can also check a single value by using the **model_cars**:"
],
"metadata": {
"id": "Qiei4yM91umr"
}
},
{
"cell_type": "code",
"source": [
"print(model_cars.predict([[7.5]]))"
],
"metadata": {
"id": "fUOkEDLy17LW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Nevertheless, our model is not perfect, because if we test the speed for a 0 years old car, we get 103.10... value, because it is the value of the intercept of the line over the y axis when x equals 0."
],
"metadata": {
"id": "RtVtiyLe2VyE"
}
},
{
"cell_type": "markdown",
"source": [
"## Rain example - Linear Regression"
],
"metadata": {
"id": "qeTSsv6y0ttR"
}
},
{
"cell_type": "code",
"source": [
"#first_20_days = np.random.uniform(0.8, 1, 20)\n",
"#last_11_days = np.random.uniform(0, 0.20, 11)\n",
"\n",
"# Get k elements within a range(min, max, step) with random.choices()\n",
"first_20_days = random.choices(range(0, 10, 1), k = 20)\n",
"last_11_days = random.choices(range(11, 120, 1), k = 11)\n",
"\n",
"rain_values = np.concatenate((first_20_days, last_11_days), axis=None)\n",
"\n",
"print(rain_values)\n",
"\n",
"days = range(1, 32, 1)\n",
"\n",
"print(rain_values.size)\n",
"\n",
"rain_month = pd.DataFrame({\"day\" : days, \"rain\" : rain_values})\n",
"\n",
"print(rain_month)"
],
"metadata": {
"id": "qUPrbdTK0vw2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Plot the previous series"
],
"metadata": {
"id": "F5QWPp-nlONn"
}
},
{
"cell_type": "code",
"source": [
"plt.plot(rain_month[\"day\"], rain_month[\"rain\"])\n",
"plt.scatter(rain_month[\"day\"], rain_month[\"rain\"])\n",
"\n",
"plt.title('Poured rain per day')\n",
"plt.xlabel('Day of the month')\n",
"plt.ylabel('Rain (ml/m^2)')\n",
"plt.show()"
],
"metadata": {
"id": "g35QCoKvk1iR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Train the rain model"
],
"metadata": {
"id": "uxw9VGQ2E4cI"
}
},
{
"cell_type": "code",
"source": [
"model_rain = LinearRegression()\n",
"model_rain = model_rain.fit(rain_month[ [\"day\"] ], rain_month[\"rain\"] )"
],
"metadata": {
"id": "gmqVtOfeEsGj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Test the model"
],
"metadata": {
"id": "2J7syQ7necOk"
}
},
{
"cell_type": "code",
"source": [
"# Generate a new DF called days representing the 31 days of the month\n",
"days = pd.DataFrame({\"day\" : range(1, 32, 1)})\n",
"\n",
"# Test our model for each day\n",
"days[\"predict\"] = model_rain.predict(days)\n",
"\n",
"print(days)"
],
"metadata": {
"id": "GDFLLV6EE3pn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"As we can see at the plot below, there is no such a good correlation between the days of the month and the amount of poured rain (orange series).\n",
"\n",
"That is why the linear regression represented in blue doesn't have a great accuracy."
],
"metadata": {
"id": "LzB_z-FxlrUi"
}
},
{
"cell_type": "code",
"source": [
"# Plot the continous line for the actual data\n",
"plt.plot(rain_month[\"day\"], rain_month[\"rain\"])\n",
"\n",
"# Plot the scatter dots for the actual data\n",
"plt.scatter(rain_month[\"day\"], rain_month[\"rain\"])\n",
"\n",
"# Plot the continuos line for the prediction\n",
"# Represents the Linear Regression line\n",
"plt.plot(days[\"day\"], days[\"predict\"])\n",
"\n",
"plt.title('Poured rain per day with the linear regression')\n",
"plt.xlabel('Day of the month')\n",
"plt.ylabel('Rain (ml/m^2)')\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "fZGqps9QlUii"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Linear Regression score"
],
"metadata": {
"id": "IoiHT-O2cXn-"
}
},
{
"cell_type": "markdown",
"source": [
"**LinearRegression.score(x, y)** returns the **coefficient of determination** of the **prediction** (a float number between 0 and 1; being 1 a perfect determination and 0 a poor one)."
],
"metadata": {
"id": "KqVcm6U_nMBL"
}
},
{
"cell_type": "code",
"source": [
"model_rain.score(rain_month[ [\"day\"] ], rain_month[\"rain\"])"
],
"metadata": {
"id": "6VTl55Itm7PK"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Polynominal Regression example"
],
"metadata": {
"id": "Cwq4qp601vop"
}
},
{
"cell_type": "markdown",
"source": [
"**If your data points** clearly **will not fit a linear regression**(a straight line through all data points), **it might be ideal for polynomial regression**.\n",
"\n",
"Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points but by using a curved line.\n",
"\n",
"For that purpose, we will use:\n",
"- **numpy.poly1d(numpy.polyfit(x, y, polynomic degree))** to create a polynomial model\n",
"- **numpy.linspace(start, end, count of value points)** to specify how to draw the polynomial model curve."
],
"metadata": {
"id": "R1GqzWTm2c9E"
}
},
{
"cell_type": "markdown",
"source": [
"Create sample data for a polynomic function of degree = 3:"
],
"metadata": {
"id": "1BbuYLXK6aJm"
}
},
{
"cell_type": "code",
"source": [
"x = range(1, 23, 1)\n",
"print(x[21])\n",
"y = [100,90,80,82,60,60,55,58,60,65,70,70,75,78,76,78,79,80,90,99,99,100]"
],
"metadata": {
"id": "MBDAyuiM6GB2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Show the created points:"
],
"metadata": {
"id": "Rgg5_rPf6iHh"
}
},
{
"cell_type": "code",
"source": [
"plt.scatter(x, y)"
],
"metadata": {
"id": "zBd_7c1t6hZY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Train the polynomic model:"
],
"metadata": {
"id": "hSaKSo9l7r_g"
}
},
{
"cell_type": "code",
"source": [
"mymodel = np.poly1d(np.polyfit(x, y, 3))\n",
"\n",
"myline = np.linspace(1, 22, 100)"
],
"metadata": {
"id": "dSo0Xngn7rnl"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Show the results:"
],
"metadata": {
"id": "qcEMYiCh76GD"
}
},
{
"cell_type": "code",
"source": [
"plt.scatter(x, y)\n",
"plt.plot(myline, mymodel(myline))\n",
"\n",
"plt.title('Actual points and Polynomic regression')\n",
"plt.grid()\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "775tqJGF77q6"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Check the model significancy by looking the r-squared.\n",
"\n",
"If the model is significant, means that we can trust his predictions, if not, it won't bring trustfull predictions.\n",
"\n",
"The r-squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related.\n",
"\n",
"Python and the Sklearn module will compute this value for you, all you have to do is feed it with the x and y arrays:"
],
"metadata": {
"id": "047LAmnb8Cj7"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.metrics import r2_score\n",
"\n",
"r_squared = r2_score(y, mymodel(x))\n",
"\n",
"print(r_squared)"
],
"metadata": {
"id": "Yrg0W6ny8Ffm"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Test the model with individual values:"
],
"metadata": {
"id": "IleN0bac8p5k"
}
},
{
"cell_type": "code",
"source": [
"predicted_value = mymodel(15)\n",
"\n",
"print(predicted_value)"
],
"metadata": {
"id": "NTne7Wvz8xDc"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Car example - Multiple Regression"
],
"metadata": {
"id": "DKKu5mLP42Oh"
}
},
{
"cell_type": "markdown",
"source": [
"### Get the data from csv"
],
"metadata": {
"id": "E1kCwavSEnrd"
}
},
{
"cell_type": "markdown",
"source": [
"Take a look at the data set below, it contains some information about cars."
],
"metadata": {
"id": "Rj5S4qQq5Xkp"
}
},
{
"cell_type": "code",
"source": [
"cars_complete_dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/cars.csv')"
],
"metadata": {
"id": "25eVHGrq8sX5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(cars_complete_dataset)"
],
"metadata": {
"id": "2EAlAQ6889PO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We can predict the CO2 emission of a car based on the size of the engine, but with multiple regression we can throw in more variables, like the weight of the car, to make the prediction more accurate."
],
"metadata": {
"id": "8FTuZnsk8wsL"
}
},
{
"cell_type": "markdown",
"source": [
"### Prepare the data for the model"
],
"metadata": {
"id": "b-jXQ5koEuub"
}
},
{
"cell_type": "markdown",
"source": [
"From the dataset, get the subset of *independent variables* that we want to use to predict (columns 'Weight' and 'Volume') and the result variable or also called *dependent variable* (column 'CO2').\n",
"\n",
"**Note**: It is common to name the list of independent values with a upper case X, and the list of dependent values with a lower case y."
],
"metadata": {
"id": "2PoDk6zs9HAf"
}
},
{
"cell_type": "code",
"source": [
"X = cars_complete_dataset[['Weight', 'Volume']]\n",
"y = cars_complete_dataset['CO2']"
],
"metadata": {
"id": "O4kBwJBD-SwU"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Train the model"
],
"metadata": {
"id": "Ys6lCp-vEzEW"
}
},
{
"cell_type": "markdown",
"source": [
"Import the library **linear_model** from **sklearn**."
],
"metadata": {
"id": "9yrpGm4X_crN"
}
},
{
"cell_type": "code",
"source": [
"from sklearn import linear_model as lm"
],
"metadata": {
"id": "gTgNvhBr_0YA"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Create the model co2_model and train it with the fit function and passing the independent variables ['Weight', 'Volume'] and the dependent variable ['CO2']."
],
"metadata": {
"id": "ByE7mZ5l_3zx"
}
},
{
"cell_type": "code",
"source": [
"co2_model = lm.LinearRegression().fit(X, y)\n",
"print(type(co2_model))"
],
"metadata": {
"id": "t_GaXtox_dPZ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Test the model"
],
"metadata": {
"id": "p5qSX40DE36N"
}
},
{
"cell_type": "markdown",
"source": [
"Now, we can use this **co2_model** to predict the CO2 generation of other motors based on their values of car weight and engine volume."
],
"metadata": {
"id": "4jBjXa_8AOGf"
}
},
{
"cell_type": "code",
"source": [
"cybertruck_weight = 3104\n",
"cybertruck_engine = 6000\n",
"\n",
"cybertruck_co2 = co2_model.predict([[cybertruck_weight, cybertruck_engine]])\n",
"\n",
"print(cybertruck_co2)"
],
"metadata": {
"id": "Ljux4LA2AiD-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Get the coefficient factor"
],
"metadata": {
"id": "t3ldcBfqE6iv"
}
},
{
"cell_type": "markdown",
"source": [
"Calculate the **coefficient factor** for each independent variables."
],
"metadata": {
"id": "CxC4mtz6Ec5p"
}
},
{
"cell_type": "code",
"source": [
"print(co2_model.coef_)"
],
"metadata": {
"id": "qgQxLc2jEkG_"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The result array represents the coefficient values of weight and volume.\n",
"\n",
"Weight: 0.00755095\n",
"Volume: 0.00780526\n",
"\n",
"These values tell us that if the weight increase by 1kg, the CO2 emission increases by 0.00755095g.\n",
"\n",
"And if the engine size (Volume) increases by 1 cm3, the CO2 emission increases by 0.00780526 g."
],
"metadata": {
"id": "FcpCvg48FZAF"
}
},
{
"cell_type": "markdown",
"source": [
"### Scale its features"
],
"metadata": {
"id": "Bm37MRUw-ckP"
}
},
{
"cell_type": "code",
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn import linear_model\n",
"\n",
"scale = StandardScaler()\n",
"\n",
"# Scale the values inside X (independent variables weight and engine)\n",
"scaledX = scale.fit_transform(X)"
],
"metadata": {
"id": "SdzKKMF9-kW3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Regenerate the model:"
],
"metadata": {
"id": "hWFg6gjf_wNF"
}
},
{
"cell_type": "code",
"source": [
"new_model = linear_model.LinearRegression().fit(scaledX, y)"
],
"metadata": {
"id": "wxgJhl4p_yl8"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Scale a single value to be tested:"
],
"metadata": {
"id": "GbMnq81xAY_K"
}
},
{
"cell_type": "code",
"source": [
"scaled = scale.transform([[2300, 1.3]])"
],
"metadata": {
"id": "rIEhOeiCAafy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Test the scaled value:"
],
"metadata": {
"id": "Vr5kWPRDAi5w"
}
},
{
"cell_type": "code",
"source": [
"predictedCO2 = new_model.predict([scaled[0]])\n",
"\n",
"print(predictedCO2)"
],
"metadata": {
"id": "nAOikw-TAkiT"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"---"
],
"metadata": {
"id": "iGNj8SEN4zMD"
}
},
{
"cell_type": "markdown",
"source": [
"# References"
],
"metadata": {
"id": "o9LMPSz7h5e8"
}
},
{
"cell_type": "markdown",
"source": [
"- [Machine Learning Models in Python with sk-learn](https://discovery.cs.illinois.edu/learn/Towards-Machine-Learning/Machine-Learning-Models-in-Python-with-sk-learn/)\n",
"- [Machine Learning - Linear Regression](https://www.w3schools.com/python/python_ml_linear_regression.asp)\n",
"- [Matplotlib axis labels](https://www.scaler.com/topics/matplotlib/matplotlib-axis-label/)\n",
"- [Machine Learning - Multiple Regression](https://www.w3schools.com/python/python_ml_multiple_regression.asp)"
],
"metadata": {
"id": "9_99KMHZh73p"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment