kirisakow · May 29, 2023 15:53 · Platinum-Dragon · Mar 2, 2023 · kirisakow · Mar 2, 2023
diff --git a/run_spark_mllib_scala_in_colab_with_almond.ipynb b/run_spark_mllib_scala_in_colab_with_almond.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Scala",
      "name": "scala"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/kirisakow/2f6ef957673df6dcbc20bcdaa33c202a/run_spark_mllib_scala_in_colab_with_almond.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Run Spark MLlib and Scala in Google Colab with Almond",
        "\n\n",
        "### <u>**Deprecation warning:**</u> Google Colab interface has seemingly undergone changes and does not allow to use side kernels the way it used to be. I myself have stopped using Google Colab and have been using Docker images and containers instead. To run Scala in Jupyter Notebook as a Docker container, you can use this [guide of mine](https://github.com/kirisakow/scala-jupyter-container) based on `jupyter/all-spark-notebook` Docker image, the latest Almond and Scala. Happy coding!"
      ],
      "metadata": {
        "id": "tnRm0YwmdLhl"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Important prerequisite 1 / 4\n",
        "\n",
        "Open your Colab Notebook with a text editor and make sure the `kernelspec` key is set to work with Scala, like so:\n",
        "\n",
        "```json\n",
        "{\n",
        "  ⋮\n",
        "  \"kernelspec\": {\n",
        "    \"display_name\": \"Scala\",\n",
        "    \"name\": \"scala\"\n",
        "  }\n",
        "  ⋮\n",
        "}\n",
        "```"
      ],
      "metadata": {
        "id": "UMudsO4-dQ03"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QVJoUDPtb9gX"
      },
      "source": [
        "## Important prerequisite 2 / 4\n",
        "\n",
        "Run the cell below to [install the Almond kernel](https://almond.sh/docs/quick-start-install) into the global Jupyter kernels:"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "! curl -sS -Lo coursier https://git.io/coursier-cli\n",
        "! chmod +x coursier\n",
        "SCALA_VERSION=\"2.12.8\"\n",
        "ALMOND_VERSION=\"0.3.1\"\n",
        "! ./coursier bootstrap -r jitpack -i user -I user:sh.almond:scala-kernel-api_$SCALA_VERSION:$ALMOND_VERSION sh.almond:scala-kernel_$SCALA_VERSION:$ALMOND_VERSION -o almond 1>/dev/null 2>&1\n",
        "! ./almond --install 1>/dev/null \n",
        "! rm -f ./coursier ./almond"
      ],
      "metadata": {
        "id": "j-1b2BcOm6py"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Important prerequisite 3 / 4\n",
        "\n",
        "Reload Google Colab page for Scala to activate."
      ],
      "metadata": {
        "id": "wyH-FiPgxfIL"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now you can work in Scala:"
      ],
      "metadata": {
        "id": "5hDyl5WedYRK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "println(scala.util.Properties.versionString)"
      ],
      "metadata": {
        "id": "N8XpeKoGnqWJ",
        "outputId": "e4fc5ea1-0f92-4608-a1d7-4993a21fa419",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "version 2.12.8\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Important prerequisite 4 / 4\n",
        "\n",
        "Download dependencies"
      ],
      "metadata": {
        "id": "ivoZNETEXxwy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import $ivy.`sh.almond::almond-spark:0.3.0`\n",
        "import $ivy.`org.apache.spark::spark-sql:2.4.0`\n",
        "import $ivy.`org.apache.spark::spark-mllib:2.4.0`"
      ],
      "metadata": {
        "id": "5Dawce-sDcZB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import org.apache.log4j.{Level, Logger}\n",
        "\n",
        "Logger.getLogger(\"org\").setLevel(Level.OFF)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Bbwh-nrQE7cP",
        "outputId": "2289b274-0357-4973-ddf6-3ad44a5346ef"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "\u001b[32mimport \u001b[39m\u001b[36morg.apache.log4j.{Level, Logger}\n",
              "\n",
              "\u001b[39m"
            ]
          },
          "metadata": {},
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Initialize SparkSession instance:"
      ],
      "metadata": {
        "id": "L_8OZ3r7YB8f"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import org.apache.spark.sql._\n",
        "\n",
        "val spark = {\n",
        "  NotebookSparkSession.builder()\n",
        "    .master(\"local[*]\")\n",
        "    .config(\"spark.ui.port\", \"4050\")\n",
        "    .getOrCreate()\n",
        "}"
      ],
      "metadata": {
        "id": "kgZa-oheE_P3"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Make a dummy dataset:"
      ],
      "metadata": {
        "id": "3t0rnZQQYHs6"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import spark.implicits._\n",
        "\n",
        "val data = Seq((1,2,3), (4,5,6), (6,7,8), (9,19,10))\n",
        "val ds = spark.createDataset(data)\n",
        "ds.show()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "U7u5ztb_GHCG",
        "outputId": "4e7795a6-a90a-4672-894c-f59cbcafc74c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "+---+---+---+\n",
            "| _1| _2| _3|\n",
            "+---+---+---+\n",
            "|  1|  2|  3|\n",
            "|  4|  5|  6|\n",
            "|  6|  7|  8|\n",
            "|  9| 19| 10|\n",
            "+---+---+---+\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "\u001b[32mimport \u001b[39m\u001b[36mspark.implicits._\n",
              "\n",
              "\u001b[39m\n",
              "\u001b[36mdata\u001b[39m: \u001b[32mSeq\u001b[39m[(\u001b[32mInt\u001b[39m, \u001b[32mInt\u001b[39m, \u001b[32mInt\u001b[39m)] = \u001b[33mList\u001b[39m((\u001b[32m1\u001b[39m, \u001b[32m2\u001b[39m, \u001b[32m3\u001b[39m), (\u001b[32m4\u001b[39m, \u001b[32m5\u001b[39m, \u001b[32m6\u001b[39m), (\u001b[32m6\u001b[39m, \u001b[32m7\u001b[39m, \u001b[32m8\u001b[39m), (\u001b[32m9\u001b[39m, \u001b[32m19\u001b[39m, \u001b[32m10\u001b[39m))\n",
              "\u001b[36mds\u001b[39m: \u001b[32mDataset\u001b[39m[(\u001b[32mInt\u001b[39m, \u001b[32mInt\u001b[39m, \u001b[32mInt\u001b[39m)] = [_1: int, _2: int ... 1 more field]"
            ]
          },
          "metadata": {},
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Retrieve a remote dataset:"
      ],
      "metadata": {
        "id": "UwLXm8t6YNoD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import org.apache.spark.SparkFiles\n",
        "\n",
        "spark.sparkContext.addFile(\n",
        "  \"https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_libsvm_data.txt\"\n",
        ")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1GpokHaQVj1v",
        "outputId": "dbdb4f11-7a35-4f02-93ab-8d7a6287dc52"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "\u001b[32mimport \u001b[39m\u001b[36morg.apache.spark.SparkFiles\n",
              "\n",
              "\u001b[39m"
            ]
          },
          "metadata": {},
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Do a binomial logistic regression:"
      ],
      "metadata": {
        "id": "7o11CiK7YYFp"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import org.apache.spark.ml.classification.LogisticRegression\n",
        "\n",
        "// Load training data\n",
        "val training = spark.read.format(\"libsvm\").load(SparkFiles.get(\"sample_libsvm_data.txt\"))\n",
        "\n",
        "val lr = new LogisticRegression()\n",
        "  .setMaxIter(10)\n",
        "  .setRegParam(0.3)\n",
        "  .setElasticNetParam(0.8)\n",
        "\n",
        "// Fit the model\n",
        "val lrModel = lr.fit(training)"
      ],
      "metadata": {
        "id": "6ag_AB6gRCxW"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "// Print the coefficients and intercept for logistic regression\n",
        "println(s\"Intercept: ${lrModel.intercept}\")\n",
        "println(s\"Coefficients: ${lrModel.coefficients}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1N2UiD3yWJ3o",
        "outputId": "125ce63c-9af6-45c9-da5e-392f82bc54e1"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Intercept: 0.22456315961250325\n",
            "Coefficients: (692,[244,263,272,300,301,328,350,351,378,379,405,406,407,428,433,434,455,456,461,462,483,484,489,490,496,511,512,517,539,540,568],[-7.353983524188197E-5,-9.102738505589466E-5,-1.9467430546904298E-4,-2.0300642473486668E-4,-3.1476183314863995E-5,-6.842977602660743E-5,1.5883626898239883E-5,1.4023497091372047E-5,3.5432047524968605E-4,1.1443272898171087E-4,1.0016712383666666E-4,6.014109303795481E-4,2.840248179122762E-4,-1.1541084736508837E-4,3.85996886312906E-4,6.35019557424107E-4,-1.1506412384575676E-4,-1.5271865864986808E-4,2.804933808994214E-4,6.070117471191634E-4,-2.008459663247437E-4,-1.421075579290126E-4,2.739010341160883E-4,2.7730456244968115E-4,-9.838027027269332E-5,-3.808522443517704E-4,-2.5315198008555033E-4,2.7747714770754307E-4,-2.443619763919199E-4,-0.0015394744687597765,-2.3073328411331293E-4])\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Do a multinomial logistic regression:"
      ],
      "metadata": {
        "id": "J6DSqYlQYgC8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "// We can also use the multinomial family for binary classification\n",
        "val mlr = new LogisticRegression()\n",
        "  .setMaxIter(10)\n",
        "  .setRegParam(0.3)\n",
        "  .setElasticNetParam(0.8)\n",
        "  .setFamily(\"multinomial\")\n",
        "\n",
        "val mlrModel = mlr.fit(training)"
      ],
      "metadata": {
        "id": "5I0YikbXWVqS"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "// Print the coefficients and intercepts for logistic regression with multinomial family\n",
        "println(s\"Multinomial intercepts: ${mlrModel.interceptVector}\")\n",
        "println(s\"Multinomial coefficients: ${mlrModel.coefficientMatrix}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "yWoW1vYQWeJW",
        "outputId": "63014263-aa61-4c4b-b042-b04dae8bddcc"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Multinomial intercepts: [-0.12065879445860686,0.12065879445860686]\n",
            "Multinomial coefficients: 2 x 692 CSCMatrix\n",
            "(0,244) 4.290365458958277E-5\n",
            "(1,244) -4.290365458958294E-5\n",
            "(0,263) 6.488313287833108E-5\n",
            "(1,263) -6.488313287833092E-5\n",
            "(0,272) 1.2140666790834663E-4\n",
            "(1,272) -1.2140666790834657E-4\n",
            "(0,300) 1.3231861518665612E-4\n",
            "(1,300) -1.3231861518665607E-4\n",
            "(0,350) -6.775444746760509E-7\n",
            "(1,350) 6.775444746761932E-7\n",
            "(0,351) -4.899237909429297E-7\n",
            "(1,351) 4.899237909430322E-7\n",
            "(0,378) -3.5812102770679596E-5\n",
            "(1,378) 3.581210277067968E-5\n",
            "(0,379) -2.3539704331222065E-5\n",
            "(1,379) 2.353970433122204E-5\n",
            "(0,405) -1.90295199030314E-5\n",
            "(1,405) 1.90295199030314E-5\n",
            "(0,406) -5.626696935778909E-4\n",
            "(1,406) 5.626696935778912E-4\n",
            "(0,407) -5.121519619099504E-5\n",
            "(1,407) 5.1215196190995074E-5\n",
            "(0,428) 8.080614545413342E-5\n",
            "(1,428) -8.080614545413331E-5\n",
            "(0,433) -4.256734915330487E-5\n",
            "(1,433) 4.256734915330495E-5\n",
            "(0,434) -7.080191510151425E-4\n",
            "(1,434) 7.080191510151435E-4\n",
            "(0,455) 8.094482475733589E-5\n",
            "(1,455) -8.094482475733582E-5\n",
            "(0,456) 1.0433687128309833E-4\n",
            "(1,456) -1.0433687128309814E-4\n",
            "(0,461) -5.4466605046259246E-5\n",
            "(1,461) 5.4466605046259286E-5\n",
            "(0,462) -5.667133061990392E-4\n",
            "(1,462) 5.667133061990392E-4\n",
            "(0,483) 1.2495896045528374E-4\n",
            "(1,483) -1.249589604552838E-4\n",
            "(0,484) 9.810519424784944E-5\n",
            "(1,484) -9.810519424784941E-5\n",
            "(0,489) -4.88440907254626E-5\n",
            "(1,489) 4.8844090725462606E-5\n",
            "(0,490) -4.324392733454803E-5\n",
            "(1,490) 4.324392733454811E-5\n",
            "(0,496) 6.903351855620161E-5\n",
            "(1,496) -6.90335185562012E-5\n",
            "(0,511) 3.946505594172827E-4\n",
            "(1,511) -3.946505594172831E-4\n",
            "(0,512) 2.621745995919226E-4\n",
            "(1,512) -2.621745995919226E-4\n",
            "(0,517) -4.459475951170906E-5\n",
            "(1,517) 4.459475951170901E-5\n",
            "(0,539) 2.5417562428184555E-4\n",
            "(1,539) -2.5417562428184555E-4\n",
            "(0,540) 5.271781246228031E-4\n",
            "(1,540) -5.271781246228032E-4\n",
            "(0,568) 1.860255150352447E-4\n",
            "(1,568) -1.8602551503524485E-4\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Do an example of a simple ML Pipeline over a natural language dummy dataset:"
      ],
      "metadata": {
        "id": "SN4c8ylcbFK0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import org.apache.spark.ml.{Pipeline, PipelineModel}\n",
        "import org.apache.spark.ml.classification.LogisticRegression\n",
        "import org.apache.spark.ml.feature.{HashingTF, Tokenizer}\n",
        "import org.apache.spark.ml.linalg.Vector\n",
        "import org.apache.spark.sql.Row\n",
        "\n",
        "// Prepare training documents from a list of (id, text, label) tuples.\n",
        "val training = spark.createDataFrame(Seq(\n",
        "  (0L, \"a b c d e spark\", 1.0),\n",
        "  (1L, \"b d\", 0.0),\n",
        "  (2L, \"spark f g h\", 1.0),\n",
        "  (3L, \"hadoop mapreduce\", 0.0)\n",
        ")).toDF(\"id\", \"text\", \"label\")\n",
        "\n",
        "// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.\n",
        "val tokenizer = new Tokenizer()\n",
        "  .setInputCol(\"text\")\n",
        "  .setOutputCol(\"words\")\n",
        "val hashingTF = new HashingTF()\n",
        "  .setNumFeatures(1000)\n",
        "  .setInputCol(tokenizer.getOutputCol)\n",
        "  .setOutputCol(\"features\")\n",
        "val lr = new LogisticRegression()\n",
        "  .setMaxIter(10)\n",
        "  .setRegParam(0.001)\n",
        "val pipeline = new Pipeline()\n",
        "  .setStages(Array(tokenizer, hashingTF, lr))\n",
        "\n",
        "// Fit the pipeline to training documents.\n",
        "val model = pipeline.fit(training)"
      ],
      "metadata": {
        "id": "Nj6nTB1LZx3B"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "// Now we can optionally save the fitted pipeline to disk\n",
        "model.write.overwrite().save(\"/tmp/spark-logistic-regression-model\")\n",
        "\n",
        "// We can also save this unfit pipeline to disk\n",
        "pipeline.write.overwrite().save(\"/tmp/unfit-lr-model\")\n",
        "\n",
        "// And load it back in during production\n",
        "val sameModel = PipelineModel.load(\"/tmp/spark-logistic-regression-model\")\n",
        "\n",
        "// Prepare test documents, which are unlabeled (id, text) tuples.\n",
        "val test = spark.createDataFrame(Seq(\n",
        "  (4L, \"spark i j k\"),\n",
        "  (5L, \"l m n\"),\n",
        "  (6L, \"spark hadoop spark\"),\n",
        "  (7L, \"apache hadoop\")\n",
        ")).toDF(\"id\", \"text\")\n",
        "\n",
        "// Make predictions on test documents.\n",
        "model.transform(test)\n",
        "  .select(\"id\", \"text\", \"probability\", \"prediction\")\n",
        "  .collect()\n",
        "  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>\n",
        "    println(s\"($id, $text) --> prob=$prob, prediction=$prediction\")\n",
        "  }"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 781
        },
        "id": "rrIgES8LZ-MS",
        "outputId": "773ad554-751e-4124-bb50-4922a08ff134"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">runJob at SparkHadoopWriter.scala:78</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(150);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">runJob at SparkHadoopWriter.scala:78</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(151);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">runJob at SparkHadoopWriter.scala:78</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(152);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">runJob at SparkHadoopWriter.scala:78</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(153);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">parquet at LogisticRegression.scala:1241</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(154);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">parquet at LogisticRegression.scala:1241</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(155);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">runJob at SparkHadoopWriter.scala:78</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(156);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">runJob at SparkHadoopWriter.scala:78</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(157);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">runJob at SparkHadoopWriter.scala:78</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(158);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">runJob at SparkHadoopWriter.scala:78</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(159);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">first at ReadWrite.scala:615</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(160);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">first at ReadWrite.scala:615</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(161);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">first at ReadWrite.scala:615</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(162);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">first at ReadWrite.scala:615</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(163);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">first at ReadWrite.scala:615</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(164);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">first at ReadWrite.scala:615</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(165);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">first at ReadWrite.scala:615</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(166);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">load at LogisticRegression.scala:1255</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(167);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div>\n",
              "  <span style=\"float: left; word-wrap: normal; white-space: nowrap; text-align: center\">head at LogisticRegression.scala:1273</span>\n",
              "  <span style=\"float: right; word-wrap: normal; white-space: nowrap; text-align: center\"><a href=\"#\" onclick=\"cancelStage(168);\">(kill)</a></span>\n",
              "</div>\n",
              "<br>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<div class=\"progress\">\n",
              "  <div class=\"progress-bar bg-success\" role=\"progressbar\" style=\"width: 0%; word-wrap: normal; white-space: nowrap; text-align: center; color: white\" aria-valuenow=\"0\" aria-valuemin=\"0\" aria-valuemax=\"100\">\n",
              "    0 / 1\n",
              "  </div>\n",
              "</div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(4, spark i j k) --> prob=[0.15964077387874118,0.8403592261212589], prediction=1.0\n",
            "(5, l m n) --> prob=[0.8378325685476612,0.16216743145233875], prediction=0.0\n",
            "(6, spark hadoop spark) --> prob=[0.06926633132976273,0.9307336686702373], prediction=1.0\n",
            "(7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "\u001b[36msameModel\u001b[39m: \u001b[32mPipelineModel\u001b[39m = pipeline_33376d963408\n",
              "\u001b[36mtest\u001b[39m: \u001b[32mDataFrame\u001b[39m = [id: bigint, text: string]"
            ]
          },
          "metadata": {},
          "execution_count": 23
        }
      ]
    }
  ]
 }