This is part of Google's annual summer program, which allows people to be involved in open-source development. Before joining, I had one experience with open-source with MindsDB, where I integrated spaCy (Python NLP package) into their system as part of Hacktoberfest 2023. Eventually, I became one of the winners, motivating me to stay in open-source!
I then remembered that my friend back in college told me about Google Summer of Code! It could be a fantastic experience. Therefore, I decided to join the program! I mainly searched for projects that were AI-focused, but there were few. However, luckily, I bumped into Joplin, and they had an idea to create a summarised for notes in their note-taking app an idea to create a summarised for notes in their note-taking app! It instantly caught my attention and decided that this is the project I want to spend during my summertime.
I proposed using LLMs with Transformers.js
and TextRank
to summarise all notes and notebooks and highlight content in notes in the text editor. From the discussion with the community during the competition period, I decided to implement the feature in a plugin
rather than in the core application to minimize risks, ensure modularity and isolation, and give users a free choice whether they want to use the AI feature.
The project aims to create note summaries to help users synthesize main ideas and arguments to identify salient points. This means that users will have a clear idea of what the note is about in a short piece of text with less mental effort.
- Assist in processing notes to improve efficiency: Distill critical information from notes, highlight key ideas and quickly skim notes.
- Classify or cluster notes by their contents: Summarize key concepts from notes and use them in similar group notes. This could be used for tagging notes.
- Distill important information from long notes to empower solutions such as search, question, and answer.
There are two main types of summarization: extractive and abstractive
● Extractive summarisation: This method takes sentences directly from the original note, depending on their importance. The summary obtained contains exact sentences from the original text.
● Abstractive summarisation: Abstractive summarization is closer to what a human usually does — i.e., conceive the text, compare it with their memory and related information, and then re-create its core in a brief text.
Abstractive summarization tends to be more computationally expensive since you must utilize neural networks and generative systems. On the other hand, extractive summarization does not require the use of deep learning and data labeling [1].
I started my coding period by researching NLP and implementing unsupervised machine learning methods (TextRank, LexRank, LSA, and KMeans Clustering) for extractive summarisation. Before applying any methods we need to preprocess the note content and do vectorisation. This is the usual flow:
-
Getting the note body from Joplin Plugin API
-
Tokenize sentences with natural: create an array of sentences from the note
-
Perform "vectorization"
-
Understanding vectorization: in simple terms, it is a way to convert sentences into vector forms so that we can perform various algorithms. For example, in LSA, we create sentence vectors to form a matrix and then perform SVD to discover the most important dimensions by getting the diagonal matrix.
- With those dimensions, we can determine which sentences are the most important: "Salient and recurring word patterns are likely to be captured and represented by a singular vector. The corresponding eigen value indicates the degree of importance of the pattern. Sentences containing this pattern will be projected along this vector and the sentence that best represents this pattern will have the largest component along this vector" [8].
-
Vectorization methods:
- Binary Matrix -> converting sentences into binary vectors
- TF-IDF -> convert sentences based on the frequency and importance of the words in the sentence
- Word2Vec -> create word embeddings [3] - good for finding out semantic relationships between words
-
-
Apply unsupervised machine learning algorithms to vectors.
- KMeans Clustering example:
- [STEP 1] Select random k (those will be centroids) →
- [STEP 2] Create k clusters and start clustering sentence vectors →
- [STEP 3] Run until convergence →
- [STEP 4] The most important sentences will be closest to centroids →
- [STEP 5] Take either k sentences or m sentences that are closest to k centroids to include them in the final summary
- KMeans Clustering example:
-
You take n sentences from the result
Algorithms | Description | Weakness | Link |
---|---|---|---|
TextRank | TextRank is a graph-based ranking algorithm inspired by PageRank. It connects words or sentences based on how frequently they appear near each other in the text and uses the number of shared words between sentences to establish similarity. | May not capture complex relationships between sentences accurately. | https://blogs.cornell.edu/info2040/2018/10/22/40068/ |
LexRank | LexRank is similar to TextRank but uses cosine similarity of TF-IDF vectors (sentence vectors) and is more tailored towards the extraction of information from multiple texts written about the same topic. | The algorithm may not perform well on a set of unclustered/unrelated set of documents | http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html |
LSA | LSA creates a term-sentence matrix (frequency of words within sentences of the document then applies SVD (Single-Value Decomposition) to learn about relationships between words and sentences. | Struggles with polysemy and synonyms | Latent Semantic Analysis |
KMeans Clustering | KMeans Clustering group sentence vectors into different clusters. Sentence vectors that are closest to cluster centroids are included in summaries. | Figuring out the best pre-defined k value for training | KMeans Clustering with TF-IDF and KMeans Clustering with word2vec |
ONNX Runtime is a cross-platform machine-learning model accelerator with a flexible interface to integrate hardware-specific libraries and can be used with models from PyTorch, Tensorflow/Keras, TFLite, scikit-learn, and other frameworks. The ONNX enables us to run ML models in web browsers.
Transformers.js is a new state-of-the-art machine-learning library by HuggingFace for the web. With this library, we can run pre-trained transformer models or our custom ML models in browsers! We do not have to do custom training for abstractive summarization since we will use their pre-trained models tailored for summarization. I will test and benchmark multiple models based on summary quality and inference time. The best model I found so far is Google/flan-t5-small
(±60MB); in the future, I would like to use Google/flan-t5-base
(±200MB) instead since it performs much better but has more extended inference and takes more memory. If it cannot upload to NPM, I need to download Google/flan-t5-base
when the users install the plugin.
ML model | Link |
---|---|
facebook/bart-large-cnn | https://huggingface.co/facebook/bart-large-cnn |
sshleifer/distilbart-cnn-12-6 | https://huggingface.co/sshleifer/distilbart-cnn-12-6 |
google/pegasus-xsum | https://huggingface.co/google/pegasus-xsum |
Google/flan-t5-small | https://huggingface.co/google/flan-t5-small |
MBZUAI/LaMini-Flan-T5-248M | https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M |
sshleifer/distilbart-xsum-6-6 | https://huggingface.co/sshleifer/distilbart-xsum-6-6 |
I initially had problems with Transformers.js to make it work in the Joplin plugin. The main issue was that when it is bundled with Webpack, it needed node-loader
. When running the app, node-loader
could not find the .node
files in the dist folder. To not depend on the node-loader,
a mentor (Laurent Cozic) recommended using web workers instead since they came into a similar problem with Tesseract.js to deal with .wasm in a plugin, and they solved the problem by running the web worker and loading the package in the app. Therefore, I created a tech spec about creating a generic web worker that future contributors and developers would use and bring benefits such as running computations in the background, preventing the application's main thread, or running packages that have problems when bundled with Webpack.
That proved to be more effort and out of the project's scope. However, luckily, another member of the community suggested a solution of running it in a plugin by downloading and loading ONNX .wasm
files locally, running the code in a web environment, and setting it in the webpack.config.js.
With a few more changes, I made it run in the plugin! The disadvantage of this approach is that it could not handle some of the cases, and developers have to be experienced in Webpack and set the configuration in webpack.config.js
themselves. Furthermore, if developers update the webpack.config.js
, all configurations will be lost. Therefore, the generic web worker would be more desirable but more challenging to implement.
I created a survey to find out which unsupervised machine learning algorithm performed the best in terms of the quality of summaries and the length of the summaries. It did not get many engagements, and I struggled to choose the best one.
Later, I came up with the idea to let users craft summaries! Basically, what happens is that there are options to choose the algorithm and the length of summaries. After that, users can edit the text area once the algorithm outputs the summaries. When users save the summaries, it redirects them to the summary note details page in the panel or creates a summary in the above current notes. In the panel, the summaries are exported into TipTap editor where they can freely edit and style text too! The plugin allows users to control and craft summaries, which I think is pretty cool and unique!
joplin-plugin-ai-summarisation-panel.mp4
joplin-plugin-ai-summarisation-context-menu.mp4
flowchart LR
A[Opening Joplin]-.-> B[Using the Panel]
A[Opening Joplin]-.-> C[Using Context Menus]
C -.-> D[Click on the Notebook]
C -.-> E[Click on the Note]
E -.-> F[Right-click on the note]
E -.-> G[Highlight multiple text in the note]
F -.-> H[Summarize the note]
G -.-> I[Right-click on the text]
I -.-> J[Summarize the highlighted text]
B -.-> K[Click on the note in the notebook tree]
K -.-> L[Edit the summary, configure length and choose different algorithms]
L -.-> M[Click save]
M -.-> N[edit, change font-weight, etc.]
D -.-> O[Right-click on the notebook]
O -.-> P[Summarize the notebook]
There are still things that could be added, such as editing the summary title, getting back to crafting configuration for already summarised notes, and more.
A nice member of the community gave me some feedback and suggestions for improvement. The feedback can be found here.
Using word2vec
might enhance the quality of summaries in unsupervised machine learning methods [4] since it captures semantic relationships between words, unlike TF-IDF, where it is only based on the frequency and importance of the words. Instead of KMeans Clustering, there would be an option to use hierarchical clutsering instead. The advantages of using those are that we do not have to define the k value and let it create several clusters. Some members of the community recommended using HBDSCAN
. You can find the discussion here.
Another enhancement could be to apply dimensionality reduction on sentence embeddings from word2vec
. That will help us to tackle the curse of dimensionality, the higher the dimensions of our data is, the more sparse are the sentence vectors in the space. Furthermopre in high dimensions, the difference in distances between data points tends to become negligible, making measures like Euclidean distance less meaningful. We can use dimensonality reduction techniques such as UMAP
or tSNE
.
In some cases, small LLMs do not perform well on some texts, especially news articles. To enhance this, we can fine-tune the model and train on the dataset tailored towards automatic text summarisation such as CNN/Daily Mail.
To make the inferenre faster, there is a possibility of using WebGPU
, which we can use it with either ONNX runtime or Transformers.js [11].
As explained before, this new feature in the Plugin API allows users to create new web workers that would run in the core application instead of the plugin. For more details, you can go here
I would like to first thank Joplin's mentors and community who embarked on this journey with me. It has been a truly amazing experience for me!! I am very grateful for that, and I hope people (including you who is reading this!) are enjoying the new plugin!!
It was really fun to dive into the world of NLP!! However, I really struggled with the AI ecosystem's limitations in Javascript. Still, it forced me to implement the algorithms from scratch, which was really cool! It was also cool to imagine and apply some linear algebra concepts! Usually, I would otherwise use sci-kit-learn or other machine-learning packages if they existed in Javascript. However, having those packages would be very beneficial if we want to have the most optimal unsupervised algorithms. For example, I wanted to do co-reference resolution, which means we match nouns with pronouns in the following sentences (Spiderman is cool. He can fly! -> Spiderman is cool. Spiderman can fly!). That would strengthen the connection between sentences. One way is to do Hobb's algorithm. Still, the disadvantage is that it is heuristics-based, which lacks understanding of semantic connections between sentences (also, the distance between sentences plays is a problem). Using neural networks would be much better as they cover more cases and reduce false positives (i.e., spaCy coref!
However, the future in the Javascript/Typescript ecosystem looks bright. Apart from ONNX runtime
and Transformers.js
, there is, for example, Pyodide, which allows us to run packages in the browsers. We could utilize sklearn and other scientific libraries, which would be greatly useful and focused only on implementing theoretical solutions.
- WebAssembly (WASM) allows AI models and algorithms to be run on the web. It is important to understand and use it. "It provides a way to run code written in multiple languages on the web at near-native speed, with client apps running on the web that previously couldn't have done so." [12]
- Before starting to implement, it is useful to create a tech spec to provide a comprehensive overview of the system and enable other engineers to give more meaningful input.
- While there are open-source AI libraries and packages in Javascript/Typescript, they often don't match the exact implementation of the algorithms in original research papers. This gap underscores the need for more AI engineers in the Javascript/Typescript ecosystem. However, implementing algorithms from scratch gives you more control and ways of changing and improving the model.
- The community is everything in open-source; you will meet different people from all walks of life and (technical and human) experiences. They are essential for making the application/features to be the best version as much as possible.
- I understand now why some people advocate for "Release early, release often. "Based on all the feedback, the application/feature improves with each release.
Technologies: Javascript/Typescript, Webpack, natural, onnxruntime-web, transformers.js, mathjs, .wasm, React, styled components, ChakraUI, TipTap, jest
Anyway, this is not the end!! I will still be here after the program since the plugin could be improved. I think it is nice that, in the end, we allow users to edit summaries and add their notes in the Tiptap editor to their actual notes! Furthermore, I definitely want to implement a new plugin API with web workers that would be easy for future contributors to create workers!
I’m happy to be the GSoC contributor for 2024 for Joplin!
Plugin GitHub repository: https://github.com/joplin/plugin-ai-summarisation
Plugin in Joplin's Plugin Repository: https://joplinapp.org/plugins/plugin/org.joplinapp.plugins.AISummarisation/
Google Summer of Code 2024: https://summerofcode.withgoogle.com/programs/2024/projects/Ble8LKDb
Weekly Updates: https://discourse.joplinapp.org/c/gsoc-projects/summarize-ai/35
Overview of Unsupervised Method for Extractive Summarisation: https://discourse.joplinapp.org/t/overview-of-unsupervised-methods-for-extractive-summarization/38529
Tech Spec: Generic Web Worker in Joplin: https://discourse.joplinapp.org/t/tech-spec-generic-web-worker-in-joplin/39490
[1] IBM - Text Summarization https://www.ibm.com/topics/text-summarization
[2] Automatic Text Summarization Methods: https://arxiv.org/abs/2204.01849
[3] Word Embeddings: https://www.ibm.com/topics/word-embeddings#:~:text=Word%20embeddings%20capture%20semantic%20relationships,more%20nuanced%20understanding%20of%20language.
[4] Automatic Text Summarization for Public Health WeChat Official Accounts Platform Base on Improved TextRank: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9300291/#:~:text=The%20automatic%20text%20summarization%20based,model%20significantly%20improved%20extraction%20accuracy.
[5] Curse of Dimensionality: https://www.datacamp.com/blog/curse-of-dimensionality-machine-learning?dc_referrer=https%3A%2F%2Fwww.google.com%2F
[6] TextRank: A Graph-Based NLP Algorithm: https://blogs.cornell.edu/info2040/2018/10/22/40068/
[7] LexRank: Graph-based Lexical Centrality as Salience in Text Summarization: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html
[8] Extractive Text Summarization using Latent Semantic Analysis: https://aishwaryap.github.io/iitm/NLPProject.pdf
[9] Extractive based Text Summarization Using KMeans and TF-IDF: https://www.researchgate.net/publication/333081743_Extractive_based_Text_Summarization_Using_KMeans_and_TF-IDF
[10] Understanding Text Summarization using K-means Clustering: https://medium.com/@akankshagupta371/understanding-text-summarization-using-k-means-clustering-6487d5d37255
[11] Use WebGPU + ONNX Runtime Web + Transformer.js to build RAG applications by Phi-3-mini: https://techcommunity.microsoft.com/t5/educator-developer-blog/use-webgpu-onnx-runtime-web-transformer-js-to-build-rag/ba-p/4190968
[12] WebAssembly: https://developer.mozilla.org/en-US/docs/WebAssembly
Don't be offended HaHaBill, as I scarcely know what I'm talking about. I was on the verge of trying ChatGPT to summarize my Joplins until I read about your project. A mate is a paranoid internet security type, with deep understanding of IT and I thought he'd be interested in reading about it.
FWIW he straight off said it should be run locally to prevent my Joplin data getting out. I guess ChatGPT has the same problem.
There's no need to reply as I'd probably not understand but just please consider.