stepney141/クラスタリング手法.md

Last active April 19, 2025 15:27

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/stepney141/38b36869c1c653eb1b1367cfd6aff08a.js"></script>
Save stepney141/38b36869c1c653eb1b1367cfd6aff08a to your computer and use it in GitHub Desktop.

Download ZIP

By ChatGPT o3 (w/ Deep Research)

Raw

クラスタリング手法.md

了解しました。書籍のベクトルデータに対してクラスタリングおよび可視化を行うために使える技術について、目的や粒度の多様性も考慮した上で、幅広く網羅的に調査します。

可視化レベルの選択肢（例：2次元マップ、クラスタラベル、階層構造など）や、各クラスタリング手法・可視化ツールの特性（精度、スケーラビリティ、使いやすさ）を含めて比較できるように整理してご提案します。少々お待ちください。

書籍ベクトルのクラスタリングと可視化手法の比較提案

粗いジャンルレベルのクラスタリング（大まかな分類）

クラスタリング手法: 粗粒度なジャンル分けには、比較的少数の大きなクラスタを想定した手法が適しています。代表的なのは K-meansクラスタリング です。K-meansは指定したクラスタ数にデータ点を分割し、各クラスタの重心（平均ベクトル）を計算して繰り返し割り当てを更新します (2.3. Clustering — scikit-learn 1.6.1 documentation) (2.3. Clustering — scikit-learn 1.6.1 documentation)。クラスタ形状が凸（球状）で類似したサイズの場合にうまく機能し、大規模データでも高速に処理できる利点があります (2.3. Clustering — scikit-learn 1.6.1 documentation)。Pythonのscikit-learnやRのkmeans関数、Excelアドイン、そしてTableauやPower BIなどBIツールにも実装があり（OSS/商用ともに利用可能）、幅広いプラットフォームでサポートされています。Tableau（Salesforce社、商用）は内部でK-means（分散に基づく partitioning）を用いてクラスタリングを行い、実行のたびに結果が安定するよう工夫されています (Uncover patterns in your data with Tableau 10’s clustering feature)。Power BI（Microsoft社、商用）も散布図ビジュアル上で自動クラスタリング機能を提供しており、こちらもK-meansによってデータポイントをグルーピングします。また、**Gaussian Mixture Model（GMM）**によるクラスタリングも選択肢です（scikit-learnのGaussianMixtureクラスなどがOSSで提供）。GMMは各クラスタを正規分布で表現し、データがそれぞれのクラスタに属する確率も算出できるソフトクラスタリング手法です。ただしパラメータとしてクラスタ数を指定する必要があり、高次元では計算コストが高くなります。粗い分類では、あらかじめ想定するジャンル数（例えば主要なジャンルが10種類など）をクラスタ数として指定することで、K-meansやGMMで大まかな群分けが可能です。

次元削減手法: ジャンルといった大枠の分布を可視化するには、まず主成分分析（PCA）が有力です。PCAは高次元データの分散が最大となる方向を順に抽出し、データを低次元に線形写像します。PCAはデータの大局的な構造（全体の分散構造）を保存する傾向があり ()、粗い分類には有用です。実際、PCAは「高次元データのグローバルな構造を維持しつつ次元圧縮する」（極端なデータ点の関係が保存される）手法とされます ()。一方で非線形な分離が必要な場合にはPCAだけでは不十分なこともあります。その際はt-SNEやUMAPを用いて2次元に可視化し、各点をジャンルラベルで色分けするといった方法も有効です。ただしt-SNEやUMAPは局所構造の強調に優れるため（後述）、粗いジャンル分けを見る場合は高いperplexity（t-SNE）や大きな近傍サイズn_neighbors（UMAP）を設定し、全体構造をやや重視した投影にすると良いでしょう。PCAはPython（scikit-learnのPCAクラス、RのprcompなどOSSで利用可能）で容易に使え、TableauやPower BIでも事前に圧縮した2次元データを読み込んで散布図化できます。また、TensorFlowなどの機械学習フレームワーク付属のEmbedding Projector（Google提供のOSSツール）ではPCAを含む手法で埋め込みベクトルをプロットでき、全体傾向を把握するのに役立ちます (Visualizing Data using the Embedding Projector in TensorBoard | TensorFlow)。

可視化ツール: 粗いクラスタ（ジャンル）を可視化するには、散布図による直感的な表示が一般的です。2次元に圧縮した各書籍ベクトルをプロットし、クラスタ（ジャンル）ごとに色分けします。オープンソースではPythonのMatplotlib/Seabornによる散布図や、Plotly・Bokehによるインタラクティブなプロットが利用できます。例えばPlotlyやBokehを使えば、点にマウスオーバーして書籍タイトル等の詳細を表示するといった対話的機能も簡単に実装できます (Interactive Data Visualization with Plotly and Bokeh | by Aakash R)。PlotlyはPython版・R版・JavaScript版が存在し、ウェブアプリやダッシュボードに組み込みやすいというメリットがあります（OSSライブラリとして提供）。商用のTableauやPower BIでは、ドラッグ＆ドロップで散布図を作成しクラスタリング結果をカラーエンコードすることができ、フィルタ操作などインタラクティブな分析も可能です。Tableauはクラスタリングの結果をグループ変数として保存し、ダッシュボード内の他のグラフで利用することもできます (Uncover patterns in your data with Tableau 10’s clustering feature) (Uncover patterns in your data with Tableau 10’s clustering feature)。Web技術スタックの場合、D3.js（JavaScriptのOSSライブラリ）を用いてカスタムなビジュアルを作成する選択肢もあります。D3.jsは低レベルから柔軟にインタラクティブ可視化を構築できる強力なライブラリで (What is D3? | D3 by Observable - D3.js)、大まかなジャンルを円や矩形で表現するカスタムチャート（例えばツリーマップやバブルチャートで各クラスタの大きさを比較するなど）を作成することも可能です。また、クラスタ間の類似度を可視化するために距離行列のヒートマップや**多次元尺度法（MDS）**による配置図を使う方法もありますが、2000件規模であれば散布図で十分直感的に把握できるでしょう。

細かなトピックレベルのクラスタリング（詳細な分類）

クラスタリング手法: 微粒度なトピック（テーマ）ごとのクラスタリングには、クラスタ数を自動決定できたり、小さなクラスタも検出できる手法が適しています。代表例はDBSCANおよびその拡張である HDBSCAN です。DBSCAN（Density-Based Spatial Clustering of Applications with Noise）は、密度の高い領域をクラスタと見なし、疎な点はノイズ（外れ値）として扱う手法です。クラスタ数を指定せずにすむ反面、距離閾値εと最小点数といったパラメータ調整が必要です。HDBSCAN（Hierarchical DBSCAN）はDBSCANを階層化して改良したアルゴリズムで、クラスタの密度が異なっても検出可能であり (Discovering the Power of HDBSCAN Clustering for Unsupervised Learning | by udit | Medium)、さらにどのクラスタにも属さないノイズ点を自動で識別してくれます (Discovering the Power of HDBSCAN Clustering for Unsupervised Learning | by udit | Medium)。HDBSCANではεパラメータを不要にし、代わりに「最小クラスタサイズ」を指定するだけで安定したクラスタを検出できます (Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation) (Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation)。そのため意味の薄い細かすぎる分割を避けつつ、データが示す自然なトピック集合を抽出するのに適しています (Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation)。例えば2000冊の書籍をHDBSCANでクラスタリングすれば、類似したテーマを持つ書籍群が自動的にいくつも見つかり、他と孤立した本はノイズ（アウトライヤ）として区別できます (Discovering the Power of HDBSCAN Clustering for Unsupervised Learning | by udit | Medium)。HDBSCANはPython用のOSSライブラリ(hdbscan)があり、scikit-learnとのインターフェースも良好です。またRやJava（ELKIライブラリ等）でも実装があります。もう一つの有力な手法は階層的凝集クラスタリングを細かく用いる方法です。凝集的手法（Agglomerative Clustering）は、データを個々に開始して徐々にクラスタを統合していくため、最後には木構造（デンドログラム）が得られます。細かなトピック分析では、その下位の枝を多数取得することで細分類が可能です。例えばWard法の階層クラスタリングは、K-meansに類似した分散最小化基準で統合を行うため、均質な小クラスタを多数作るのに向いています (2.3. Clustering — scikit-learn 1.6.1 documentation)。他にもスペクトラル・クラスタリング（グラフベースの手法で、非凸形状のクラスタも捉えられる）を用いて類似度行列から細かなコミュニティを発見することもできます。こちらはscikit-learnのSpectralClusteringやRのspeccなどで利用可能です。ただしクラスタ数は指定またはチューニングが必要です。クラスタ数指定が難しい場合、Mean-Shiftという手法もあります（scikit-learnでOSS実装あり）。Mean-Shiftはデータの密度山を探索してクラスタ中心を見つける方法で、こちらもクラスタ数の事前指定不要です。ただし計算量が多く、結果の安定性調整も必要になります (Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation) (Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation)。総じて、細かいクラスタを高精度に抽出するにはHDBSCANが有力候補であり、他にもパラメータフリーまたは自動推定型のクラスタリングとしてはAffinity Propagation（こちらもOSS実装あり、クラスタ数自動だが計算量大）なども考えられます。

次元削減手法: 細かなトピックの分布を視覚化するには、t-SNE（t-分布に基づく確率的近傍埋め込み）やUMAP（Uniform Manifold Approximation and Projection）といった非線形次元圧縮が適しています。これらは高次元空間で近い文書同士を低次元でも近接させ、局所的なクラスタ構造を視覚的に明確にします。t-SNEは特に局所構造の保存に優れており、「高次元空間での類似度が高いデータポイントを近くに配置する」点に焦点を当てています (Difference between PCA VS t-SNE | GeeksforGeeks)。そのため細かいトピッククラスタが点の塊として現れやすく、視覚的にクラスターが判別しやすくなります (Difference between PCA VS t-SNE | GeeksforGeeks)。一方でt-SNEは全体構造（クラスタ間の距離関係）を必ずしも忠実に表現しないため、あくまで「局所的な島」の集まりとして解釈する必要があります。これに対しUMAPは、t-SNEと同様に局所構造を維持しつつもある程度グローバルな構造も保つよう工夫された手法です ( 7 Ways UMAP Revolutionizes Data Visualization in 2023 )。UMAPは数学的にはトポロジーと測地距離に基づく手法で、計算も高速で大規模データにも適用しやすい利点があります (T-SNE vs UMAP vs SNE: Dimensionality Reduction Essentials) (T-SNE vs UMAP vs SNE: Dimensionality Reduction Essentials)。実際、「UMAPは損失関数の設計により前任者（t-SNEなど）よりデータのグローバル構造を捉えられる」ことが報告されています (T-SNE vs UMAP vs SNE: Dimensionality Reduction Essentials) (T-SNE vs UMAP vs SNE: Dimensionality Reduction Essentials)。細かいトピック可視化では、まずt-SNEやUMAPで2次元マップを作成し、点群がどのように分かれるかを見るのが一般的です。Pythonのscikit-learnにはt-SNE実装が含まれ、UMAPは別途umap-learnライブラリ（OSS）が利用できます。RでもRtsneパッケージやumapパッケージが利用可能です。パラメータ調整として、トピック粒度の可視化ではt-SNEのperplexityを小さめ（例：30前後）にすると細部のクラスタが浮き出やすく、UMAPでも最小距離min_distを小さく設定するとクラスタ間の空白が強調され区別しやすくなります。なお、細かなクラスタが多数ある場合、3次元投影にしてTensorBoardのEmbedding Projectorで観察する方法も有効です。このツールはWebブラウザ上で3D回転やズーム、特定クラスタのハイライト検索ができるため、数十以上のトピックを持つデータでも探索的に分析できます (Visualizing Data using the Embedding Projector in TensorBoard | TensorFlow)。

可視化ツール: 微粒度なクラスタ可視化では、インタラクティブな散布図やクラスタ分布図が望ましいです。例えばPlotlyやBokehを用いれば、数十種類のクラスタそれぞれに異なる色やマーカーを割り当て、凡例から表示・非表示を切り替えることで興味のあるトピック群に絞り込んで閲覧することができます。Plotlyはブラウザ上でパン・ズームも自由にできるため、密集した領域を拡大して細かなクラスタを検証するのにも向いています。またPlotly DashやStreamlitといったPythonフレームワークを使えば、クラスタリング結果を含む対話型Webアプリを素早く構築できます（いずれもOSSで、Pythonとの親和性が高い）。Rの場合、Shinyアプリケーションで動的なプロットを実装したり、PlotlyのR版を使って対話型グラフを作成することができます。商用BIツールでも、例えばPower BIではクラスタリング結果をカテゴリフィールドとして保持しスライサーで絞り込む操作が可能です。またPower BIやTableauでR/Pythonスクリプト連携機能を用い、t-SNEで得た座標にクラスタラベルを付与したデータを取り込めば、ツール内で対話的にプロットを操作できます。さらに細かなトピッククラスタの場合、キーワード雲（ワードクラウド）で各クラスタの特徴語を可視化する方法も考えられます。例えばクラスタ内の書籍タイトルや説明文から頻出単語を抽出し、それをクラスタ代表として表示すれば、ユーザは各クラスタのテーマを直感的に掴みやすくなります。このような解析はPythonのBERTopicライブラリ（OSS）で一括して行うことも可能です。BERTopicはBERT等で文書をベクトル化した上でHDBSCANによりクラスタリングし、各クラスタに対して代表トピックの単語群を算出するトピックモデル手法です (Interactive Topic Modeling with BERTopic - Maarten Grootendorst)。さらにPlotlyを用いたトピック間の距離マップや階層ツリーのインタラクティブ可視化機能も備えており、細かなトピック分析において強力なツールとなります。JavaScript単体で実装する場合は、D3.jsでフォースレイアウトを使いクラスタごとにノードを固めて表示したり、Tooltipで詳細を出すカスタム可視化も可能ですが、実装コストを考えると既存ライブラリやツールの活用が望ましいでしょう。

階層的クラスタリングとマルチレベルの可視化（粗密両対応）

(Plot Hierarchical Clustering Dendrogram — scikit-learn 1.6.1 documentation) 階層型クラスタリングの結果を示すデンドログラムの例。データポイントが逐次マージされていく木構造で表現されており、高いレベル（上部の太い枝）では大まかなグループ分け、低いレベル（下部の細かい枝）では個々の点に近い細分類を見ることができる ([PDF] Reading Dendrograms - Wheaton College)。

クラスタリング手法: 粗いジャンルから細かいトピックまで階層的にクラスタを構造化したい場合、**階層型クラスタリング（Hierarchical Clustering）**が最適です。前述の凝集型（Agglomerative）クラスタリングは、その結果をデンドログラム（樹形図）として可視化することで、クラスタの階層構造を示します。デンドログラムでは、縦軸を距離（または不一致度）、横軸をデータ要素として、木の結合の高さがクラスタ間の距離を意味します (How to interpret the dendrogram of a hierarchical cluster analysis)。ユーザは木をある高さで横切る（カットする）ことで所望のクラスタ数に対応するグループを得ることができます ([PDF] Reading Dendrograms - Wheaton College)。例えば、高い位置でカットすれば2～3個の大きなクラスタ（ジャンル）が得られ、低い位置でカットすれば10個以上の詳細クラスタ（トピック）が得られる、といった具合です ([PDF] Reading Dendrograms - Wheaton College)。凝集型手法にはWard法のほかに、平均結合法（average linkage）や完全結合法（complete linkage）など距離の定義がいくつかあります。Ward法は均質なクラスタを好みますが、非ユークリッド距離には適用できない点に注意が必要です (2.3. Clustering — scikit-learn 1.6.1 documentation)。平均結合法は様々な距離指標に柔軟で、文書ベクトル間のコサイン距離などにも対応可能です。階層的クラスタリングはSciPy（Python, OSS）のlinkage関数＋dendrogram関数で容易に実行・可視化できますし、Rにもhclust関数が標準で備わっています。データ件数が将来的に非常に多くなる場合には、階層法は計算量O(N^2)のため負荷が高くなります。その際は、BIRCHと呼ばれる大規模データ向け階層クラスタリングアルゴリズム（scikit-learnのBirchクラスでOSS提供）を使う選択もあります (2.3. Clustering — scikit-learn 1.6.1 documentation) (2.3. Clustering — scikit-learn 1.6.1 documentation)。BIRCHはデータをまとめる構造を構築しながらインクリメンタルにクラスタリングを行う手法で、段階的にクラスタを粗から細へ生成できます。また、HDBSCANも内部的に階層クラスタリングを用いており、その結果得られるクラスタツリーを分析すればマルチレベルのクラスタ構造を理解できます (Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation) (Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation)。例えばHDBSCANライブラリでは、クラスタの安定性プロファイルや分離の様子を可視化するための凝縮ツリー（condensed tree）のプロット機能が提供されています。これにより、どの密度レベルでクラスタが分岐・合流するかを視覚的に確認でき、階層的な話題の包含関係を考察できます。さらに、トピックモデルの分野には階層的LDAなど文書トピックの階層構造を直接学習する手法もありますが、こちらは独自にモデルを学習させる必要があるため、本質問では既存ベクトルをクラスタリングするアプローチに絞ります。

次元削減・可視化手法: 階層構造を視覚化するには、二通りのアプローチがあります。（1）デンドログラムなどツリー構造そのものを描画する方法と、（2）通常の2次元プロット上で異なる粒度のクラスタを重ねて表示する方法です。前者のデンドログラムは階層クラスタリングの標準的な可視化で、粗い分類から細かい分類まで一目で俯瞰できます ([PDF] Reading Dendrograms - Wheaton College)。Python+Matplotlib（SciPy）やRでデンドログラムを描けば、インタラクティブでなくともクラスタ形成の順序を示す図として有用でしょう。もしインタラクティブ性が欲しい場合、Plotlyにはツリー図を描く機能や、D3.jsにも階層レイアウト（ツリーマップ、サンバーストチャート等）が用意されています。例えばD3.jsのサンバースト図を使えば、中心から外側に向かって粗いジャンル→細かいトピックと階層を示すことができ、ユーザはセグメントをクリックして下位レベルにドリルダウンするといった操作も可能です。後者の方法では、一度2次元に埋め込み表示したプロット上でマーカーの形状や色の二重符号化によって複数レベルのクラスタを表現します。例えば大ジャンルごとに色分けし、さらに同じ色内でシンボル（○や△など）を変えて細かいクラスタを区別するといった具合です。あるいは粗いジャンルで一度プロットを描き、次に各ジャンル内で別途ズームインしたプロット（サブプロットや別ページ）を用意するのも分かりやすいでしょう。このような階層的な可視化は、人間の認知負荷を下げるために段階的なインタラクションを取り入れるのがコツです。例えばTableauでは一つのダッシュボードに大分類のビューと小分類のビューを用意し、フィルター操作で連動させることで、選んだジャンルの内部に属する細分トピックの散布図が表示されるような構成が可能です。技術スタック間の相性で言えば、階層データの可視化はJavaScript（D3.jsやEChartsなど）の得意分野ですが、PythonでもPlotlyやBokehを通じてWebブラウザ上で階層構造を対話的に見せることができます。加えて、OSSのGraphvizを使ってクラスタツリーを描画しPNG/SVG出力する方法や、Adobe Illustratorなどでデザイン調整する手もあります。PowerBIには標準で階層クラスタリングの専用可視化はありませんが、カスタムビジュアルやRスクリプトでデンドログラムを描くことも不可能ではありません。

精度・解釈性の観点: 階層的手法の利点は、クラスタ数の決定を後回しにできる点と、クラスター間の包含関係を説明できる点です (Clustering and Visualising Documents using Word Embeddings | Programming Historian)。デンドログラムから「これらの細かいトピックAとBは上位では同じジャンルに属する」といった知見が得られ、ユーザの興味領域を木構造として捉えることができます (Clustering and Visualising Documents using Word Embeddings | Programming Historian)。一方、ツリーが複雑になりすぎると解釈が難しくなるため、適切なレベルでカットして階層を2～3段階程度に要約することが現実的です。例えば、「文学」「科学」「自己啓発」といった上位ジャンルをまず色分け表示し、その中で「文学→日本文学/海外文学」「科学→物理/生物/...」のように中位レベルを区別する、といった多段構成で可視化すると理解しやすくなります。階層クラスタリングの結果の精度（クラスタの質）はデータの特徴量次第ですが、質問文の前提通り書籍の内容を適切にベクトル化した特徴があるならば、階層クラスタリングでも意味的に首尾一貫したグループが得られるでしょう (Clustering and Visualising Documents using Word Embeddings | Programming Historian)。特にHDBSCANの凝縮ツリーなどを使えば、密度の安定したクラスタのみを抽出してノイズを排除できるため、解釈しやすい結果につながります (Comparing Python Clustering Algorithms — hdbscan 0.8.1 documentation) (Discovering the Power of HDBSCAN Clustering for Unsupervised Learning | by udit | Medium)。最後に、可視化に際してはクラスタのラベル付けも重要です。粗いジャンルには人間がジャンル名を付与できますが、細かいトピックには例えば代表的なキーワード上位3語を自動抽出してラベルにする、といった工夫で可読性が向上します（Pythonではsklearn.feature_extraction.textによるTF-IDF上位語抽出などで対応可能）。このように、クラスタリング手法（K-meansからHDBSCAN、階層型まで）と次元圧縮（PCAからUMAPまで）、可視化ツール（Matplotlib/Plotly、Tableau/PowerBI、D3.jsなど）を組み合わせることで、2000件程度の書籍でも自分の興味の分布やジャンル間の近さを多角的に理解できる可視化が実現できます。以上の提案を基に、粗い分析から詳細分析まで目的に応じた手法を選定してください。各手法・ツールはPython、R、JavaScriptといった主要スタックで実装例やライブラリが豊富に存在し、OSSから商用まで要件に応じて柔軟に選べる環境が整っています。ぜひ複数の手法を試し比較しながら、最適なクラスタ可視化アプローチを検討してください。

Raw

クラスタリング手法_プロンプト.md

Me

ありがとうございます。あなたか教えてくれた手法のどれかを使って類似度を計算した後、クラスタリングや可視化をしたいです。具体的には、書籍をクラスタリングして、自分が興味を持っている本のジャンルや本同士の近さ・遠さを可視化したいです。そのために使えるテクノロジー（技術の名前・具体的なプロダクト名を含む）を提案してください。

あなたが提案する選択肢を見て、精度を重視しつつ柔軟にプラットフォームを選択します。ですから、あなたは私の参考のために可能な限り幅広い選択肢をたくさん提示してください。

ChatGPT

了解しました。書籍の類似度ベクトルを用いて、クラスタリングや可視化を行いたいということですね。

そのために、いくつか補足情報を教えてください：

書籍のジャンルやテーマはどれくらい多様ですか？（たとえば、小説／ビジネス書／自己啓発などの混在か、特定ジャンル中心か）
可視化の粒度はどの程度を想定していますか？（たとえば、2次元のマップでざっくりジャンルを把握したい、クラスタをラベル付けして絞り込みたい、など）
使用予定のプラットフォームやツールはありますか？（例：Pythonを使いたい、Webアプリに組み込みたい、TableauやPower BIを使いたい等）

これらを教えていただければ、より適切なツールや手法（例：UMAP・t-SNE・クラスタリングアルゴリズム・可視化ライブラリなど）を幅広くご提案できます。

Me

多様だと思いますが、正確にはわかりません。それを知るために、クラスタリングや可視化をしたいのです。
私はこのような可視化をしたことがないので、判断できません。様々な粒度があるならば、どのような粒度でどのような手法があるのか網羅的にサーベイしてください。
あなたが提案する選択肢を見て、精度を重視しつつ柔軟にプラットフォームを選択します。ですから、あなたは私の参考のために可能な限り幅広い選択肢をたくさん提示してください。

Raw

書籍類似度計算技術.md

ありがとうございます、詳細がとても明確なので助かります。

本の「特徴文書」を精度よくベクトル化し、類似度を計算するために有効なテクノロジー（自然言語処理モデル、ベクトルデータベース、API、ライブラリ等）を、精度重視・日本語対応・スケーラビリティの観点から幅広く提案できるように調査を開始します。

少々お待ちください。調査が終わり次第、結果をご報告します。

Similar Book Recommendation: NLP Vectorization & Similarity Solutions

Text Embedding (Vectorization) Techniques

OpenAI Embeddings (e.g. text-embedding-ada-002) – Provider: OpenAI (Commercial API).
- Japanese support: Yes. OpenAI’s embedding model is multilingual (trained on diverse Internet text). It can handle long inputs up to about 8191 tokens (characters/subwords) (OpenAI's Text Embeddings v3 | Pinecone), which covers most book summaries. This enables embedding even lengthy Japanese synopses without truncation.
- Accuracy & strengths: OpenAI’s embeddings are state-of-the-art in capturing semantic meaning. The ADA-002 model (1536-dimensional) was one of the most widely adopted text embedding models (OpenAI's Text Embeddings v3 | Pinecone) and achieved strong performance on benchmarks (e.g. ~61% average on a broad semantic similarity benchmark, MTEB) (OpenAI's Text Embeddings v3 | Pinecone). These embeddings encode concepts and context, so semantically similar texts map to nearby vectors. (OpenAI reports that such embeddings “capture the concepts within content” and “power applications like knowledge retrieval” in systems like ChatGPT (New embedding models and API updates | OpenAI).)
- Scalability & usage: No infrastructure needed beyond API calls – OpenAI’s cloud scales the computation. You can preprocess all ~2000 documents (and future expansions) by calling the API to get vectors and cache them. There is a cost per request, but for moderate data this is manageable. Because the embedding is done offline (cached), latency isn’t an issue. OpenAI doesn’t impose platform constraints (use from anywhere with internet).
- Similarity integration: The output is a standard dense vector. You can use cosine similarity (or dot-product on normalized vectors) to compare book embeddings – this is the typical approach for semantic similarity (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.). The high quality of these vectors means cosine similarity will accurately reflect semantic closeness of books’ content. You can store the vectors and use any similarity search library (FAISS, etc.) or even brute-force search due to small scale.
Cohere Multilingual Embeddings – Provider: Cohere, via API (Commercial SaaS).
- Japanese support: Yes. Cohere’s latest embedding model (embed-v4.0) is a multilingual model supporting 100+ languages (including Japanese) (Introduction to Embeddings at Cohere — Cohere). It’s designed to work well on non-English texts and even cross-lingually.
- Accuracy & strengths: The model produces 1024-dimensional embeddings optimized for semantic meaning. It’s considered “best-in-class multilingual” by Cohere (Introduction to Embeddings at Cohere — Cohere). A key strength is that it can capture nuances in Japanese text as effectively as in English. Cohere also allows optimizing embeddings for different tasks: for example, one can embed queries vs. documents slightly differently (using an input_type parameter) for better similarity search results (Introduction to Embeddings at Cohere — Cohere). For this book similarity (a symmetric comparison of documents), you would embed all books as search_document. The embeddings handle long input reasonably (exact token limit isn’t public, but thousands of characters should be fine).
- Scalability & usage: As a cloud service, it scales seamlessly – you send texts and get vectors. It’s commercial (with free trial options). For ~2000 books, the cost is moderate; if the dataset grows, you pay for more calls but don’t need to manage any servers or GPU yourself. Embedding is done offline and cached, so runtime performance is not a concern.
- Similarity integration: Outputs are dense vectors (float arrays). These can be compared with cosine similarity (as shown in Cohere’s docs (Introduction to Embeddings at Cohere — Cohere)) or used in any ANN index. Cohere’s vectors work well with standard similarity metrics – you can load them into a vector database or use libraries like FAISS for fast retrieval. (If using Cohere’s API for queries as well, you could also embed a new book description and find nearest neighbors by cosine.)
Sentence Transformers (Multilingual BERT Models) – Provider: Open source (Sentence-Transformers by UKP/TU Darmstadt, Hugging Face 🤗; Apache 2.0 License).
- Japanese support: Yes. There are pre-trained multilingual models that cover Japanese (e.g. distiluse-base-multilingual or paraphrase-multilingual-MiniLM models supporting 50+ language (Semantic Search — Sentence Transformers documentation)】). In addition, dedicated Japanese sentence-transformer models exist – for example, the model sbert-jsnli-luke-japanese-base-lite (by OSS community) maps Japanese sentences/paragraphs into a 768-dimensional vector space for semantic searc (awesome-japanese-nlp-resources/docs/huggingface.md at main · taishi-i/awesome-japanese-nlp-resources · GitHub)】. These models are trained on Japanese data (like Japanese NLI or paraphrase datasets), so they capture Japanese semantics well.
- Accuracy & strengths: Sentence-Transformer models are BERT/RoBERTa-based bi-encoder models fine-tuned specifically for producing semantic sentence embeddings. They excel at encoding the meaning of long texts (up to a limit) into a single vector. They handle synonyms and context effectively – e.g. they will place “面白い本” (“interesting book”) close to “興味深い書籍” (synonymous phrase) in vector space, unlike simple keyword methods. Many multilingual models are available; you can choose one known to perform well for Japanese. The quality is high – these models often approach state-of-the-art on semantic textual similarity tasks. One consideration: base BERT models have a input length limit (typically 512 WordPiece tokens). If a summary is very long, you may need to truncate or split the text. However, publisher summaries usually fit within this limit. If not, strategies like chunking the text and averaging embeddings can be used.
- Scalability & usage: This solution is open-source, running on your own infrastructure. For ~2000 documents, you could even compute all embeddings on a standard PC or a single GPU. As the collection grows, you might need a machine with more memory or a GPU for faster computation, but there’s no hard limit – you can always batch-process. No external service is required (good for on-premises needs). Once vectors are computed and cached, the system can handle queries quickly. Scaling to tens of thousands of books is feasible; if millions in the future, you might then combine with a distributed index, but the model inference itself can be done offline in batches.
- Similarity integration: The vectors (typically 384–768 dimensions) can be compared with cosine similarity or dot product. This method is very compatible with cosine similarity – indeed, Sentence-Transformers are often used in semantic search by embedding corpus documents and then finding nearest vectors by cosin (Semantic Search — Sentence Transformers documentation)】. You can use a library like FAISS or even brute-force search (for 2000 items) to retrieve the most similar book vectors. Approximate nearest neighbor (ANN) indexes also work effectively if the dataset grows large, since these embeddings are in a high-dimensional Euclidean space just like any other. (Many vector search tools even have examples using Sentence-BERT embeddings.)
Doc2Vec (Paragraph Vectors) – Provider: Algorithm by Le & Mikolov (Google Research); implementations in open-source libraries like Gensim (OSS).
- Japanese support: Yes. Doc2Vec is language-agnostic – it learns vectors from text through unsupervised training, so it can be applied to Japanese text. (You would need to tokenize Japanese text into words or subwords before training, using a tokenizer like MeCab, since Doc2Vec operates on word sequences.)
- Accuracy & strengths: Doc2Vec learns a fixed-length dense vector for each document by training on a corpus, *predicting words from the document vector ([1405.4053] Distributed Representations of Sentences and Documents)】. It captures semantics better than simple bag-of-words. In the original paper, Paragraph Vector representations outperformed bag-of-words models on tasks, overcoming their weaknesses of losing word order and meanin ([1405.4053] Distributed Representations of Sentences and Documents) ([1405.4053] Distributed Representations of Sentences and Documents)】. The resulting embedding can encode topics and subtle similarities (e.g. two books with related themes even if not sharing exact keywords). However, its accuracy in practice can depend on the training data size and quality. With only ~2000 documents, a Doc2Vec model may not generalize as well (risk of overfitting to these summaries). It works best when trained on a larger corpus (perhaps augment with other Japanese book summaries or Wikipedia data). If well-trained, it produces ~ hundred-dimensional vectors (you choose the size, e.g. 100–300) that reflect semantic content. It handles long text naturally (no hard token limit; longer documents will influence the vector through the training process).
- Scalability & usage: Training Doc2Vec is computationally heavier than using a pre-trained model, but with 2000 docs it’s still quite fast. Gensim’s implementation (Python, C optimized) can train on CPU reasonably. Once trained, getting a vector for each book is instantaneous. This can be fully on-premise. For future expansion: if new books come in, you might update the model (which is an offline process). Doc2Vec doesn’t natively handle incremental updates easily; you might retrain with the new data. In terms of runtime search, since vectors are cached, the similarity computation is the same as any method (scales with number of vectors, which is not an issue at 2k or even 20k).
- Similarity integration: Doc2Vec yields a dense vector per document. You can compare them with cosine similarity just like other embedding vectors. There is nothing special in integration – once you have the vectors, you could use brute-force cosine or ANN indexes for faster search. Cosine is appropriate since the vectors capture semantic closeness (the training objective itself is to place similar contexts close in vector space).
fastText Word Embeddings (with Averaging) – Provider: Facebook AI Research (OSS library, MIT license).
- Japanese support: Yes. fastText provides pre-trained word vectors for 157 languages including Japanes (Word vectors for 157 languages · fastText)】. These are 300-dimensional word embeddings trained on Wikipedia and Common Crawl. It handles Japanese by using character n-grams, so it can generate vectors for words even if they weren’t seen explicitly.
- Accuracy & strengths: This is a lighter-weight, older approach to represent text. Each word is represented as a vector; to get a document vector (title + summary), one common technique is to average the word vectors (or take a weighted average). The strength of this approach is primarily in its speed and simplicity. It will capture some semantic similarity — e.g. if two book summaries share many common words or related words, their averaged vectors will be closer. However, it doesn’t capture word order or context-specific meanings well (it’s a purely additive bag-of-words model). It also struggles with polysemy: the word “Python” in two summaries will contribute the same vector even if one is about snakes and the other about programming. Thus, semantic accuracy is lower than transformer-based methods. On the upside, since it uses subword information, it can handle misspellings or rare Japanese kanji compound words by deconstructing into n-grams. For long documents, averaging dilutes the importance of any one sentence, so it might miss the overall theme if not enough keywords overlap. In summary, this approach is fast but less semantically precise.
- Scalability & usage: Extremely scalable and efficient. Loading the pre-trained Japanese vectors (a few hundred MB) is the biggest step. After that, computing an average vector for 2000 documents is trivial (milliseconds). Even for millions of documents, this would be very fast and lightweight (just summing vectors). It’s entirely on-premises and CPU-friendly. As your collection grows, you just compute the average for new documents and append the vector — no retraining needed (unless you choose to train your own embeddings). It’s also flexible: you could augment the embeddings by training on your specific corpus if desired (fastText can be trained on custom data to get domain-specific word vectors).
- Similarity integration: Since the output is a dense vector (300-dim by default), you can use cosine similarity. Cosine on these average-of-words vectors corresponds to a sort of semantic overlap measure. It’s compatible with any vector search library or a simple dot product computation. Because the vectors are lower-dimensional, brute-force search is even faster. ANN algorithms also handle 300-d vectors easily if needed.
TF–IDF / BM25 (Sparse Vector Baseline) – Provider: Apache Lucene (foundation of Elasticsearch/OpenSearch) – Open Source.
- Japanese support: Yes, with proper text analysis. For Japanese, this means using a morphological analyzer to tokenize text (e.g. Elastic’s Kuromoji plugin or MeCab) since Japanese is written without spaces. Lucene-based solutions and other libraries can then compute term frequencies. Elasticsearch, for example, can be configured to analyze Japanese text and build TF-IDF or BM25 indexes.
- Accuracy & strengths: This is a lexical similarity approach. Each document is represented as a high-dimensional sparse vector of token frequencies. Similarity is computed by metrics like cosine similarity on the TF-IDF vectors or more commonly BM25 score (a slight variation on TF-IDF). The strength of this approach is in exact keyword matching: if two books share many specific terms or names, the similarity will be high. It’s very interpretable and has been the backbone of search engines for decades. However, it does not capture semantic meaning beyond exact or similar words. As the original Doc2Vec paper noted, bag-of-words models “ignore semantics of the words ([1405.4053] Distributed Representations of Sentences and Documents)】 – for example, “強力な” vs “力強い” (“powerful” vs “strong”) would be seen as completely different unless a specific synonym list is provided. It also cannot inherently handle when similar concepts are described with different vocabulary. You can improve it slightly by adding synonym dictionaries or using techniques like latent semantic analysis, but out-of-the-box it’s limited. For long documents, TF-IDF can actually be quite effective at picking up topic overlap (since there will be many content words), but it might also overweight common terms. In summary, this approach is precise on word overlap but misses deeper semantic relations – it might fail to link two books that are about the same topic using different wording.
- Scalability & usage: Extremely scalable. Search engines using TF-IDF/BM25 easily handle millions of documents. The indexing is efficient and query retrieval is sub-linear via inverted indices. For 2000 books, this is no problem at all – you could query similarity by retrieving each document’s vector and computing cosine, or more directly use Elasticsearch’s “More Like This” query or similarity scripts. Since this scenario is off-line/cached, one approach is to precompute pairwise similarities or use an approximate method like locality-sensitive hashing for speed, but given 2000 items, brute-force comparisons are also fine. Operationally, deploying something like Elasticsearch on-prem or using it in the cloud (Elastic Cloud) would give you a full-text search engine that can also do similarity queries. But if semantic accuracy is the priority, you’d likely use this method only as a baseline or in combination with semantic vectors (some systems use a hybrid of dense and sparse features).
- Similarity integration: Cosine similarity can be applied to TF-IDF vectors (viewing each as a vector in the high-dimensional term space). This will yield a numerical similarity score. In practice, tools like Elastic implement BM25 scoring which is a form of weighted overlap measure. They also allow script scoring if one wanted to compute cosine explicitly. ANN algorithms are not typically used for sparse vectors because sparse vectors are better handled by inverted indices. Instead of an ANN library, you’d rely on the search engine’s built-in index for efficiency. This approach integrates well if you already plan to use a text search engine (you get free text querying and basic similarity), but it’s less ideal for pure semantic similarity recommendations.

Similarity Search & Indexing Solutions

Exact Cosine Similarity (Brute-force) – The simplest approach is to compute cosine similarity between a given book’s vector and all other book vectors directly. This can be done with a few lines of code (e.g. using NumPy or PyTorch to do vector dot products).
- Accuracy: This yields the exact similarity according to the embedding, so it will give the true top-N most similar books. There is no approximation error. Given that quality of recommendation is the top priority, using exact cosine is desirable especially at smaller scales.
- Performance: With ~2000 items, computing cosine similarities is extremely fast (a 2000×2000 similarity matrix is trivial for a modern CPU). Even if the catalog grows to, say, 20,000, brute-force (20k dot products per query) is on the order of milliseconds. It’s only when you reach very large scales (millions of vectors or very high query frequency) that this becomes a bottleneck. Since the use-case here (book recommendations) can be handled with caching and offline computation, speed is not a big issue. One could precompute a similarity matrix for all pairs given the dataset is small, or compute on the fly per query – either is fine.
- Scalability: If in the future the number of books becomes huge (say millions), exact brute-force does become heavy. At that point, one could switch to an approximate method or use GPU acceleration. But up to at least tens or a few hundred thousand, one can also optimize exact search using BLAS libraries or by packing vectors in matrix form for batch dot products. Many frameworks also allow moderately fast exact search. For now, this approach is the simplest and most straightforward to implement.
- Cosine vs. other metrics: Cosine similarity is the recommended metric for comparing text embeddings (since the vector magnitude is less important than direction). All the vectorization methods above produce embeddings where cosine is meaningful. (Euclidean distance on normalized vectors is equivalent to cosine distance). So, sticking with cosine is standard. In practice, you would normalize all book vectors to unit length and then use a dot product to get cosine similarit (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.)】.
FAISS (Facebook AI Similarity Search) – Provider: Meta AI (Open Source library, MIT license).
- Overview: FAISS is a highly optimized library for *efficient similarity search of dense vectors (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.)】. It’s written in C++ with Python bindings, and supports a variety of indexing methods (both exact and approximate). It’s widely used in industry and academia for vector search.
- Key features: FAISS can handle very large collections of vectors – even those that don’t fit in RAM by using on-disk indexes or compressed representation (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.)】. It offers GPU acceleration for many algorithms, which can dramatically speed up both indexing and querying. FAISS includes algorithms for exact search (e.g. IndexFlatIP for brute-force inner product) as well as Approximate Nearest Neighbors (ANN) search (e.g. IVF inverted indexes, HNSW graphs, PQ quantization for compression). You can choose the index type based on your accuracy/speed needs. For example, for 2000 vectors, you might simply use an exact index (which will be very fast). If the dataset grows to millions, you could switch to an ANN index in FAISS for scalability.
- Cosine similarity support: FAISS supports cosine similarity via inner product. If you normalize your vectors to unit length, an inner-product search is effectively cosine similarity searc (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.)】. FAISS also supports L2 distance; but for embeddings, inner product (cosine) is most common.
- Accuracy vs performance: FAISS with an exact index will give perfect accuracy (just like brute-force). Its ANN indexes allow trading some accuracy for speed if needed. For instance, FAISS’s HNSW implementation or IVF (inverted file) with refined vectors can recall ~99% of nearest neighbors with a fraction of the search time. You can tune parameters like the number of clusters or graph neighbors to balance this. Since your use prioritizes accuracy over speed, you could configure FAISS to operate in a near-exact mode (e.g. large HNSW ef parameter or exact search).
- Scalability & integration: As an open library, you’d run FAISS in your application stack (it can be in-memory or you can persist indexes to disk). For on-prem deployment it’s a great choice. It’s also integrated under the hood in some managed services. FAISS can easily handle future scale: e.g. 100k or 1M vectors on a single server (especially with GPU) or more with multi-index setups. If you expect a major expansion, you could also consider sharding the data and using FAISS on multiple nodes (there are frameworks and even a FAISS GPU server project). For now, FAISS offers a robust, accurate solution – you could compute all book embeddings, store them in a FAISS index, and query by a book’s vector to get top similar books almost instantly.
Annoy (Approximate Nearest Neighbors Oh Yeah) – Provider: Spotify (Open Source, Apache 2.0).
- Overview: Annoy is a lightweight ANN library that builds a forest of binary trees (using random projections) for quick nearest neighbor looku (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】. It was developed at Spotify for music recommendation use-cases and is well-suited for scenarios where you want fast query times and are willing to do an offline index build.
- Key features: It’s optimized for memory and disk efficiency. Annoy indexes can be memory-mapped from disk, meaning you can save the index to a file and load it quickly without reading everything into RAM at once. This makes it possible to handle fairly large datasets on disk. Annoy allows you to tune the trade-off between speed and accuracy by adjusting the number of trees it builds and the search depth. More trees give higher recall (closer to exact results) at the cost of a larger index and slightly slower queries. Fewer trees make it faster but with a higher chance of missing some nearest neighbors.
- Accuracy: Annoy uses angular distance (cosine similarity) or Euclidean distance as metrics. For cosine, you normalize vectors and use the angular distance option. It tends to be quite fast for moderate dimensional vectors (e.g. 100-1000 dims). In terms of accuracy, Annoy can get very good recall if you use enough trees, but it might require more memory than graph-based methods for the same accuracy. It’s often used when you need fast, static indexes – one downside is that Annoy is static: once built, you cannot easily add new items (you’d rebuild the index with the new data). For a slowly growing collection of books, you could rebuild periodically if needed (2000 items is so small that rebuilds are negligible).
- Scalability & usage: Annoy is simple to integrate (available as a Python library annoy). For 2000 items, honestly, Annoy is not necessary – brute force will outperform it because the overhead of the index isn’t worth it for such a small set. But if the number of books grew to, say, 100k or 1M, Annoy becomes useful. It’s very lightweight compared to deploying a whole database or GPU solution. It’s an on-prem solution (no cloud service required). Because it’s file-based, you can even share the index file between processes or load it in a web service without huge memory use.
- Compatibility: It works with any vectors. You would normalize your book feature vectors and tell Annoy to use angular distance. Then querying by a book’s vector gives the top N closest vectors. It will return an approximate nearest neighbor list. If accuracy must be 100%, you could do Annoy to get a candidate list and then re-rank with exact cosine on those (though in many cases, properly tuned Annoy will already be very accurate for the top few results). In summary, Annoy is a good option for fast offline similarity search when dataset grows, though newer ANN methods (like HNSW) often give better speed/accuracy trade-offs at scale.
HNSWlib (Hierarchical Navigable Small World graphs) – Provider: Originally an algorithm by Yu. Malkov et al.; implemented in HNSWlib and also integrated in other libraries like nmslib. (Open Source, MIT license).
- Overview: HNSW is a popular graph-based ANN approach. The library HNSWlib is a C++/Python implementation that is highly optimized. Many vector search systems use HNSW under the hood due to its excellent accuracy-speed balance.
- Key features: HNSW builds a navigable small-world graph of the vectors in multiple layers. During search, it greedily traverses this graph to find nearest neighbors. It’s known for very high recall (accuracy) even at low search times. In fact, with enough search effort (a parameter ef), HNSW can get near-exact results in a fraction of the time of brute force. A distinctive aspect is that all data points are stored as nodes in memory, along with edges connecting to nearest neighbors. This means HNSW indexes are memory-heavy (you need to hold the whole graph in RAM). However, it yields extremely fast query speeds (sub-millisecond for thousands of points, and scales sub-linear with dataset size). It’s more efficient on high-dimensional data than tree-based methods.
- Accuracy: HNSW is considered one of the top algorithms for ANN when you need very high accuracy. With appropriate tuning, it can return results virtually identical to brute-force cosine similarity. For example, HNSWlib can be configured with an M (connections per node) and ef (search effort) such that recall is 99% (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】. It outperforms Annoy in many cases, especially as dimensionality and dataset size grow, because the graph search adapts well.
- Scalability & usage: HNSWlib can handle quite large datasets (millions of vectors) as long as you have RAM for them. In many open-source vector databases (Milvus, Weaviate, etc.), HNSW is the default index type because of its performance. For your case, if in the future the book database grows to tens or hundreds of thousands, HNSW is a solid choice to maintain accuracy with faster-than-brute-force queries. Building the graph is relatively fast (and can be done offline). Also, unlike Annoy, HNSW can dynamically add or remove vectors – you can insert new book vectors into an existing HNSW index (with some loss of optimality unless you rebuild periodically). This makes it more flexible for an evolving dataset. Integration-wise, you could use HNSWlib library directly in Python. You’d normalize vectors and use cosine distance. Querying will give an approximate nearest neighbor list.
- Recommendation: Since speed is not a huge concern at 2k items, you might not need HNSW right now (brute force or basic FAISS is fine). But knowing that HNSW is available is useful if you anticipate scaling. It offers a great accuracy-to-speed ratio, essentially giving you peace of mind that you’re not sacrificing result quality even when using AN (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】. It’s a strong match for use with high-quality embeddings (like those from BERT/OpenAI) because it can maintain their effectiveness in retrieval.
ScaNN (Scalable Nearest Neighbors) – Provider: Google Research (Open Source).
- Overview: ScaNN is an ANN library from Google that focuses on optimizing vector search with learned quantization techniques. It’s designed for high recall and throughput, especially leveraging hardware accelerators.
- Key features: ScaNN uses a combination of partitioning (like clustering) and quantization to prune the search space and compute distances faster. It can be seen as an advanced version of the traditional IVF+PQ (inverted file index with product quantization) approach, with some learned components to improve accuracy. Google’s implementation is highly optimized for TPU/GPU usage, but it also works on CPU. It often requires using TensorFlow or building from source to integrate.
- Accuracy: It can achieve very high recall with proper tuning. The idea is to reduce computations while minimizing loss in precision. Benchmarks have shown ScaNN to be among the top performers for large-scale search when run on the appropriate hardware. It particularly shines when you have very large datasets and want to use quantization to compress vector memory footprint. For moderate sizes, its benefits might not be as pronounced over simpler methods, but it’s engineered for scale.
- When to consider: If the book corpus grows into the hundreds of thousands or millions and you are possibly using Google’s cloud (or have a TPU/GPU environment), ScaNN could be a viable option. It’s perhaps an overkill at 2k items. But it’s good to know that it exists as an OSS solution for ANN. Google’s focus with ScaNN is on extreme scale and performance. According to guidance, ScaNN works best in setups optimized for accelerators and can provide *excellent scalability for heavy workloads (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】.
- Integration: Using ScaNN might require more engineering effort (it’s not as plug-and-play as FAISS or Annoy). However, there are wrapper libraries and even integration in TensorFlow similarity libraries. If you anticipate leveraging Google’s Vertex AI Matching Engine (which is a managed ANN service that likely uses ScaNN or similar under the hood), knowing ScaNN is underlying can be useful. For a simpler route, FAISS or HNSW would suffice unless you specifically need what ScaNN offers.
Elasticsearch / OpenSearch with k-NN – Provider: Elastic NV (Elasticsearch X-Pack), Amazon (OpenSearch), partly OSS.
- Overview: Elasticsearch (and its open-source fork OpenSearch) is a full-text search engine that in recent versions supports vector similarity search (k-NN) alongside traditional keyword search. This could be a convenient option if you want both text search and vector search in one system.
- Japanese support: Yes. Elasticsearch has strong Japanese text analysis support (e.g. Kuromoji for tokenization). More importantly, as of version 8.x, it also supports semantic search with vectors for Japanese. In fact, Elastic 8.9 introduced proper vector search for Japanese text, allowing for semantic queries on Japanese conten (Use a Japanese language NLP model in Elasticsearch to enable semantic searches | Elastic Blog)】. This means you can index your book summary embeddings in ES and perform cosine similarity queries directly, even on Japanese data, and it will work as expected.
- k-NN implementation: Elasticsearch uses an ANN algorithm (in Elastic’s case, HNSW is used under the hood for the k-NN feature). OpenSearch (the OSS fork) has a k-NN plugin based on NMSLIB which also implements HNSW and other algorithms. So essentially, you’re getting a vector index (like HNSW) integrated into a search engine. You can configure the space (cosine, L2) and parameters. Queries can ask for the top N similar vectors to a query vector.
- Hybrid capabilities: One notable feature is the ability to do hybrid searches – for example, you could combine a vector similarity query with a filter (find similar books but only within the same genre or only those published after 2020, etc.). The search engine can intersect vector results with metadata or do reranking that combines vector similarity with textual relevance. If your application might benefit from such combined criteria, ES is a good choice.
- Scalability & ops: Elasticsearch is built for scale-out. You can run a cluster of nodes and it will distribute indices and queries. Storing 2000 vectors is trivial for it; even 2 million wouldn’t be an issue (aside from memory for the k-NN graphs). However, note that the vector search feature may require a certain license level in Elasticsearch (the open source OpenSearch provides it for free, Elastic’s version might require at least basic or platinum for machine learning features depending on exact usage). Operationally, running ES is heavier than a simple Python library; but if you already have an Elastic cluster or don’t mind the overhead, it gives you a lot of additional search functionality.
- Use case: If your primary goal is just finding similar books, ES might be an overkill in terms of setup. But if you also want to support text queries like “find books about space exploration” and then perhaps use vector similarity to refine results, ES allows you to store both the text and the embeddings. It’s a more integrated solution. It’s also accessible via a REST API which can be convenient. In short, Elasticsearch/OpenSearch k-NN is a solid choice to consider when you need enterprise-level search features combined with vector similarity. (For reference, Elastic’s documentation and blog show step-by-step usage of Japanese NLP models with vector search in E (Use a Japanese language NLP model in Elasticsearch to enable semantic searches | Elastic Blog)】.)
Milvus (Vector Database) – Provider: Open Source (LF AI & Data Foundation, originally by Zilliz Inc.).
- Overview: Milvus is a purpose-built vector database system for managing and searching embedding vectors at scale. It provides a high-level service that wraps ANN algorithms (like HNSW, IVF, etc.) and adds database features such as persistence, replication, clustering, and a SQL-like query interface.
- Scalability & performance: Milvus is designed for *large-scale and production environments (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】. It can handle collections with billions of vectors by distributing data across nodes. It supports hybrid search (combining vector similarity with structured filters or even keyword conditions) and can perform real-time ingestion of new vectors. Because it’s built for scale, it can index data in the background, maintain multiple replicas for reliability, and use GPU acceleration if available.
- Operations: With Milvus, you run it as a service (it can be deployed via Docker, Kubernetes, etc.). You interact with it via client SDKs or REST, inserting vectors and querying for nearest neighbors. It abstracts away the specifics of the ANN algorithm – you can choose an index type (like HNSW or IVF) and metric (cosine, IP, L2) when creating a collection, but Milvus handles the details. It’s a good choice if you want a managed solution on-premises (or self-hosted cloud) that scales as your data grows, without having to manually code with FAISS or tune ANN parameters each time. Milvus also has features like partitioning, time-travel (historical version queries), etc., which might be overkill here but useful in large applications.
- Japanese support: Language is not directly relevant to Milvus since it stores vectors (the text processing to vectors happens outside). But it can store any metadata alongside, such as titles or IDs, and that metadata can certainly be Japanese. For example, you could store each book’s vector with fields like title, author, etc., and then query Milvus for similar vectors and retrieve those fields. For pure vector similarity, Milvus treats all vectors equally regardless of origin.
- Compatibility: Milvus supports cosine similarity queries (technically you’d normalize vectors and use inner product or use Euclidean if normalized – Milvus allows specifying metric). It integrates well with upstream embedding generators; you would generate embeddings using one of the methods above and insert them. If you have an application that may eventually use a microservice architecture, you could have a Milvus instance as a vector search service. It’s also evolving: new versions integrate more with the ecosystem (for instance, there’s a tool called Attu – a GUI for Milvus, and integrations for common ORMs, etc.).
- When to use: At 2000 items, Milvus is not necessary. But if you plan to significantly expand (tens of thousands or more) and want a robust, ready-to-use solution, Milvus is a strong candidate. It’s open source, so you won’t have license fees, but you will manage the infrastructure. If your use-case grows to need dynamic updates (e.g. users adding notes or new books frequently) or if you want to support many simultaneous queries, a vector DB like Milvus can handle that load. In essence, Milvus offers the *features of a traditional DB (distributed indexing, CRUD, etc.) tailored to vector data (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】, which standalone libraries like Faiss/Annoy don’t provide out of the box.
Pinecone (Managed Vector DB Service) – Provider: Pinecone (Commercial Cloud Service).
- Overview: Pinecone is a fully managed vector database in the cloud. It provides an API through which you can upsert vectors with associated metadata and perform similarity searches (with filtering support). It’s a popular choice for developers who want to avoid managing infrastructure and still get high-performance vector search.
- Key features: Pinecone offers horizontal scaling, high availability, and low-latency search out of the box. You don’t worry about index types or memory – you choose a pod size (which corresponds to performance/capacity) and Pinecone handles the rest. Under the hood, it likely uses optimized ANN algorithms (the details are abstracted, but they mention using approaches like HNSW). Pinecone also supports metadata filtering (so you can store attributes like genre, and then query “similar books where genre = mystery”). It provides a straightforward API: e.g. query(vector) returns the IDs of the most similar stored vectors.
- Japanese support: Like Milvus, Pinecone deals with vectors, so it’s agnostic to language. You’d generate the Japanese text embeddings using one of the NLP techniques above (possibly also using Pinecone’s own support – they have integrations where they host some embedding models or you use OpenAI through the (Key features - Pinecone Docs)】). But essentially, any vector you provide (Japanese text or otherwise) will be stored. The service being hosted means you send data over the internet, which is fine for non-sensitive book info.
- Scalability & ease of use: Pinecone shines in ease – for a modest number of vectors, you can use their free tier or a small pod and have results in minutes. As your data grows, you can scale up the pod or create a larger index without rearchitecting your app. They handle the ANN tuning, index building, and maintenance. It’s effectively outsourcing the similarity search problem. For production systems, this can save a lot of engineering effort. One trade-off is cost: beyond a certain scale, using a managed service might be more expensive than self-hosting an open solution, depending on how much optimization you can do on your own. But for many, the reliability and zero-maintenance is worth it.
- Performance: Pinecone is optimized for speed at scale – they claim “ultra-low query latency, even with billions of items” and support for real-time update (Everything you need to know about Pinecone – A Vector Database)】. In practice, for thousands of items, any method will be fast, but Pinecone ensures that even as you grow, the performance remains good without you needing to tune parameters. They also automatically handle things like vector indexing in the background, so inserts don’t block queries much.
- Integration: Using Pinecone would involve sending your 2000 book vectors (perhaps after generating them with a model) to Pinecone’s index. Then for a given book, you query Pinecone for similar vectors; it will return a list of IDs or metadata of the nearest books. You’d then map those IDs back to book titles etc. Since Pinecone is a general vector DB, this workflow is straightforward. They also have client libraries for Python, etc., making development easier. In summary, Pinecone is a great option if you want a production-grade, scalable solution without managing servers, and you don’t mind using a commercial service for the vector search componen (What is a Vector Database & How Does it Work? Use Cases + Examples | Pinecone)】.
Weaviate (Semantic Vector DB) – Provider: Semi Technologies (Open Source, GPL/BSL license; also offered as a cloud service).
- Overview: Weaviate is another popular open-source vector database that comes with some unique features. It not only stores vectors and does similarity search, but also can connect to modules for data ingestion and even perform on-the-fly vectorization via plugins.
- Features: Weaviate has a schema-based approach – you define classes (like “Book”) with properties (like title, author, vector, etc.). You can then import data and run queries in GraphQL. One very interesting aspect is built-in modules for embedding: for example, there’s a text2vec-transformers module and text2vec-cohere, text2vec-openai, etc. These allow Weaviate to generate embeddings for your text data at query or import time using pre-configured models (including multi-language models). In fact, Weaviate’s multilingual transformers module can embed text in Japanese using a model like Multilingual-MiniLM. Alternatively, you can generate the vectors yourself and push them in (like with Milvus). Weaviate uses HNSW under the hood for ANN, similar to Milvus.
- Japanese support: Through the modules, yes. If you enable text2vec-transformers with a multilingual model, it will handle Japanese input. There are community models and one can even plug custom models. For instance, you could have Weaviate directly read the “summary” field of each book, pass it to a Japanese SBERT model, and index the resulting vector – all within the Weaviate service. This simplifies the pipeline (you don’t have to run a separate script to vectorize; Weaviate can do it). The accuracy of course depends on the chosen model. If using OpenAI or Cohere modules, it’s essentially calling those APIs internally.
- Scalability & ops: Weaviate can be self-hosted (Docker Compose or Kubernetes deployments are common) or you can use their hybrid SaaS (they host, you manage through an account). It also supports sharding and replication, though these features are maturing. It’s generally good for mid-scale (tens of millions of vectors). Like other vector DBs, it supports filtering by metadata and hybrid queries. One advantage is the GraphQL interface, which some find intuitive for querying (“Get Book where vector nearText ... and genre=ScienceFiction”).
- When to use: If you prefer an open-source solution but want something higher-level than FAISS, Weaviate is a candidate. It might be a bit heavier than Milvus for pure similarity search, but the built-in NLP capabilities could be a plus. For example, if in the future you want to add a feature: “Given a user query (in natural language), find relevant books”, Weaviate could directly vectorize the query and find nearest book vectors (this is asymmetric search, and Weaviate supports it via the same modules). Essentially it can serve as both your embedding generator and search index. In terms of accuracy, with the right model it will be on par with the embeddings that model provides (since underneath it’s still using cosine/HNSW).
- Integration: Without modules, it’s similar to Milvus – you push vectors and query by vector. With modules, it can be as simple as providing the text and letting Weaviate handle vectorization. The choice depends on how much you trust a given module’s model vs. doing it yourself. Weaviate is under active development, and its community is growing, making it a solid open solution in the semantic search space.

References:

OpenAI embeddings and usage: OpenAI API announcemen (New embedding models and API updates | OpenAI)】; Pinecone report on OpenAI models (dimensions, token limits, multilingual performance (OpenAI's Text Embeddings v3 | Pinecone) (OpenAI's Text Embeddings v3 | Pinecone)】.
Cohere multilingual embedding model supporting Japanes (Introduction to Embeddings at Cohere — Cohere)】.
Japanese Sentence-BERT model (768-dim) for semantic searc (awesome-japanese-nlp-resources/docs/huggingface.md at main · taishi-i/awesome-japanese-nlp-resources · GitHub)】; Multilingual model covering 50+ language (Semantic Search — Sentence Transformers documentation)】.
Paragraph Vector (Doc2Vec) original paper (unsupervised learning of document embeddings ([1405.4053] Distributed Representations of Sentences and Documents) ([1405.4053] Distributed Representations of Sentences and Documents)】.
fastText word vectors for 157 languages (including Japanese (Word vectors for 157 languages · fastText)】.
Elasticsearch 8.9 release – Japanese semantic vector search suppor (Use a Japanese language NLP model in Elasticsearch to enable semantic searches | Elastic Blog)】.
FAISS library – supports large-scale vector search and cosine similarit (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.) (GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.)】.
Annoy library – lightweight, disk-backed ANN searc (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】.
HNSWlib vs others – graph-based high-accuracy AN (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn) (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】.
ScaNN and ANN library comparison (best with accelerators (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】.
Vector databases vs. standalone indexes (Milvus for distributed scaling (What is Annoy (Approximate Nearest Neighbors Oh Yeah)? - Zilliz Learn)】; Pinecone vector DB feature (What is a Vector Database & How Does it Work? Use Cases + Examples | Pinecone)】.

Raw

書籍類似度計算技術_プロンプト.md

Me

私は、自分が読みたいと思っている本のリストを持っています。そして、リスト中の本の1冊ずつに対して「タイトル・著者・出版社名・出版社提供の目次/あらすじ情報」の文書（以下、「特徴文書」という）を持っています。これをもとにして、私は「リスト中のある本を1冊指定すると、リスト中に含まれる『その本と似ている内容の別の本』を、類似度順に提示してくれるシステム」を作ろうとしています。

このとき、本ごとの「特徴文書」を精度よく数値化し、それをもとに書籍同士の類似度を計算するのに使えるテクノロジー（技術の名前・具体的なプロダクト名を含む）を提案してください。

ChatGPT o3

面白いプロジェクトですね！提案を正確にするために、いくつか教えていただけますか？

何冊くらいの本の特徴文書がありますか？
日本語のテキストでしょうか？
精度と計算時間のどちらを重視しますか？（例：多少時間がかかっても精度重視、リアルタイム性重視など）
使えるプラットフォームに制約はありますか？（例：ローカルで完結させたい、クラウド利用OK、オープンソースのみ、など）

Me

現在のところ約2000冊あります。ただし毎週「読みたい本」は増えていくので、時間を追うごとにこれよりも増えます。
原則として日本語がほとんどですが、英語の本も少しあります。
類似度の計算は時間より精度を重視します。その結果をキャッシュしておき、実際に「似ている本」を検索する際にはそのキャッシュを参照するつもりです。
プラットフォームには制約はありません。あなたが提案する選択肢を見て、精度を重視しつつ柔軟にプラットフォームを選択します。ですから、あなたは私の参考のために可能な限り幅広い選択肢をたくさん提示してください。