Last active
October 18, 2016 20:25
-
-
Save dirkroorda/8420224 to your computer and use it in GitHub Desktop.
ipython notebook for cooccurrences of lexemes between books in the Hebrew Bible
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Cooccurrences of lexemes between the books of the Hebrew Bible" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img align=\"right\" src=\"files/logo.png\" width=\"200\"/>" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Research Question" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"What does linguistic variation between bible books tell us about their origin, and about the evolution and transmission of their texts?" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Method" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We study the co-occurrences of lexemes across the books of the bible and represent this data in a undirected weighted graph, where the books are nodes.\n", | |
"There are edges between very pair of books that share a lexeme occurrence. \n", | |
"Edges are weighted: the more lexemes are shared by a pair of books, the heavier the edge. However, the weight is corrected and normalized as well. \n", | |
"\n", | |
"* *correction*: frequent lexemes contribute less to the weight than rare lexemes,\n", | |
"* *normalization*: the weight contribution of a lexeme is divided by the number of lexemes in the union of two books.\n", | |
"\n", | |
"The initial plan was to consider only common nouns, but we are also experimenting with nouns in general, verbs, and all lexemes.\n", | |
"Moreover, we also experiment with two measures of normalisation: \n", | |
"\n", | |
"* *normal*: divide by the sum of the number of distinct lexemes in the concatenation of two books\n", | |
"* *quadratic*: as in *normal*, but divide by the square of the sum." | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Compute" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Import the python modules, the plot modules, the LAF-Fabric module (``graf``) and initialize the ``graf`` processor." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import sys\n", | |
"import collections\n", | |
"import matplotlib.pyplot as plt\n", | |
"import graf\n", | |
"from graf.notebook import Notebook\n", | |
"%matplotlib inline\n", | |
"processor = Notebook()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Load the data, especially the features we need.\n", | |
"Note that the task will be named *cooccurrences*.\n", | |
"After loading we retrieve the names by which we can access the various pieces of the LAF data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"processor.init('bhs3.txt.hdr', '--', 'cooccurrences', {\n", | |
" \"xmlids\": {\n", | |
" \"node\": False,\n", | |
" \"edge\": False,\n", | |
" },\n", | |
" \"features\": {\n", | |
" \"shebanq\": {\n", | |
" \"node\": [\n", | |
" \"db.otype\",\n", | |
" \"ft.part_of_speech,noun_type,lexeme_utf8\",\n", | |
" \"sft.book\",\n", | |
" ],\n", | |
" \"edge\": [\n", | |
" ],\n", | |
" },\n", | |
" },\n", | |
"})\n", | |
"(msg, P, NN, F, X) = processor.data()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.00s COMPILING source: UP TO DATE\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.00s COMPILING annox: UP TO DATE\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.00s loading common: node_sort ... \n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.14s loading common: node_out ... \n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.44s loading common: node_in ... \n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.74s loading common: edges_from ... \n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.85s loading common: edges_to ... \n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.97s clearing xmlids: xid ...\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.97s clearing feature: feature ...\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 3.14s clearing annox: xfeature ...\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 3.94s present feature: shebanq:db.otype (node) from source bhs3.txt.hdr\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 3.94s present feature: shebanq:ft.noun_type (node) from source bhs3.txt.hdr\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 3.94s present feature: shebanq:sft.book (node) from source bhs3.txt.hdr\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 3.94s present feature: shebanq:ft.part_of_speech (node) from source bhs3.txt.hdr\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 3.94s present feature: shebanq:ft.lexeme_utf8 (node) from source bhs3.txt.hdr\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.00s LOGFILE=/Users/dirk/Scratch/shebanq/results/db/bhs3.txt.hdr/cooccurrences/__log__cooccurrences.txt\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 0.00s BEGIN TASK=cooccurrences SOURCE=bhs3.txt.hdr\n" | |
] | |
} | |
], | |
"prompt_number": 2 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"For your convenience:\n", | |
"\n", | |
"* *msg*: for printing messages to console and log\n", | |
"* *P*: access to primary data\n", | |
"* *NN*: iterator of nodes in primary data order\n", | |
"* *F*: feature data\n", | |
"* *X*: original XML identifiers of the LAF resource\n", | |
"\n", | |
"You can inspect the API by giving commands like ``msg?``, ``F.*?``, ``X??``" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"F.*?" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"NN??" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We are going to generate data files for [Gephi](https://gephi.org), in its native XML format. \n", | |
"Here we specify the subtasks and weighing methods.\n", | |
"\n", | |
"* *Subtasks* correspond to the kind of lexemes we are counting.\n", | |
"* *Methods* correspond to the kind of normalization that we are applying: dividing by the sum or the square of the sum.\n", | |
"\n", | |
"We also spell out the XML header of a Gephi file" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"tasks = {\n", | |
" 'noun_common': {\n", | |
" '1': processor.out_file(\"noun_common_1.gexf\"),\n", | |
" '2': processor.out_file(\"noun_common_2.gexf\"),\n", | |
" },\n", | |
" 'noun_proper': {\n", | |
" '1': processor.out_file(\"noun_proper_1.gexf\"),\n", | |
" '2': processor.out_file(\"noun_proper_2.gexf\"),\n", | |
" },\n", | |
" 'verb': {\n", | |
" '1': processor.out_file(\"verb_1.gexf\"),\n", | |
" '2': processor.out_file(\"verb_2.gexf\"),\n", | |
" },\n", | |
" 'all': {\n", | |
" '1': processor.out_file(\"all_1.gexf\"),\n", | |
" '2': processor.out_file(\"all_2.gexf\"),\n", | |
" },\n", | |
"}\n", | |
"\n", | |
"methods = {\n", | |
" '1': lambda x, y: float(x) / y,\n", | |
" '2': lambda x, y: float(x) / y / y,\n", | |
"}\n", | |
"\n", | |
"data_header = '''<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n", | |
"<gexf xmlns:viz=\"http:///www.gexf.net/1.2draft/viz\" xmlns=\"http://www.gexf.net/1.1draft\" version=\"1.2\">\n", | |
"<meta>\n", | |
"<creator>LAF-Fabric</creator>\n", | |
"</meta>\n", | |
"<graph defaultedgetype=\"undirected\" idtype=\"string\" type=\"static\">\n", | |
"'''" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 6 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Initialization" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"book_name = None\n", | |
"books = []\n", | |
"lexemes = collections.defaultdict(lambda: collections.defaultdict(lambda:collections.defaultdict(lambda:0)))\n", | |
"lexeme_support_book = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 9 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Walk through the relevant nodes and collect the data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"for node in NN():\n", | |
" this_type = F.shebanq_db_otype.v(node)\n", | |
" if this_type == \"word\":\n", | |
" lexeme = F.shebanq_ft_lexeme_utf8.v(node)\n", | |
"\n", | |
" lexemes['all'][book_name][lexeme] += 1\n", | |
" lexeme_support_book['all'][lexeme][book_name] = 1\n", | |
"\n", | |
" p_o_s = F.shebanq_ft_part_of_speech.v(node)\n", | |
" if p_o_s == \"noun\":\n", | |
" noun_type = F.shebanq_ft_noun_type.v(node)\n", | |
" if noun_type == \"common\":\n", | |
" lexemes['noun_common'][book_name][lexeme] += 1\n", | |
" lexeme_support_book['noun_common'][lexeme][book_name] = 1\n", | |
" elif noun_type == \"proper\":\n", | |
" lexemes['noun_proper'][book_name][lexeme] += 1\n", | |
" lexeme_support_book['noun_proper'][lexeme][book_name] = 1\n", | |
" elif p_o_s == \"verb\":\n", | |
" lexemes['verb'][book_name][lexeme] += 1\n", | |
" lexeme_support_book['verb'][lexeme][book_name] = 1\n", | |
"\n", | |
" elif this_type == \"book\":\n", | |
" book_name = F.shebanq_sft_book.v(node)\n", | |
" books.append(book_name)\n", | |
" sys.stderr.write(\"{} \".format(book_name))\n", | |
"sys.stderr.write(\"\\n\")\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Genesis Exodus " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Leviticus Numbers " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Deuteronomy Joshua " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Judges I_Samuel " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"II_Samuel I_Kings " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"II_Kings Isaiah " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Jeremiah Ezekiel " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Hosea Joel Amos Obadiah Jonah Micah " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Nahum Habakkuk Zephaniah Haggai Zechariah Malachi Psalms " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Job Proverbs " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Ruth Canticles Ecclesiastes Lamentations Esther " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"Daniel Ezra Nehemiah " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"I_Chronicles II_Chronicles " | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 10 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Sort the data according to the various subtasks, and compute the edges with their weights." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"nodes_header = '''<nodes count=\"{}\">\\n'''.format(len(books))\n", | |
"\n", | |
"for this_type in tasks:\n", | |
"\n", | |
" lexeme_support = {}\n", | |
" for lexeme in lexeme_support_book[this_type]:\n", | |
" lexeme_support[lexeme] = len(lexeme_support_book[this_type][lexeme])\n", | |
" \n", | |
" book_size = collections.defaultdict(lambda: 0)\n", | |
" for book in lexemes[this_type]:\n", | |
" book_size[book] = len(lexemes[this_type][book])\n", | |
" \n", | |
" node_data = []\n", | |
" for node in range(len(books)):\n", | |
" node_data.append('''<node id=\"{}\" label=\"{}\"/>\\n'''.format(node + 1, books[node]))\n", | |
"\n", | |
" edge_id = 0\n", | |
" edge_data = collections.defaultdict(lambda: [])\n", | |
" for src in range(len(books)):\n", | |
" for tgt in range(src + 1, len(books)):\n", | |
" book_src = books[src]\n", | |
" book_tgt = books[tgt]\n", | |
" lexemes_src = {}\n", | |
" lexemes_tgt = {}\n", | |
" lexemes_src = lexemes[this_type][book_src]\n", | |
" lexemes_tgt = lexemes[this_type][book_tgt]\n", | |
" intersection_size = 0\n", | |
" weights = collections.defaultdict(lambda: 0)\n", | |
" for lexeme in lexemes_src:\n", | |
" if lexeme not in lexemes_tgt:\n", | |
" continue\n", | |
" pre_weight = lexeme_support[lexeme]\n", | |
" for this_method in tasks[this_type]:\n", | |
" weights[this_method] += methods[this_method](1000, pre_weight)\n", | |
" intersection_size += 1\n", | |
" combined_size = book_size[book_src] + book_size[book_tgt] - intersection_size\n", | |
" edge_id += 1\n", | |
" for this_method in tasks[this_type]:\n", | |
" edge_data[this_method].append('''<edge id=\"{}\" source=\"{}\" target=\"{}\" weight=\"{:.3g}\"/>\\n'''.\n", | |
" format(edge_id, src + 1, tgt + 1, weights[this_method]/combined_size))\n", | |
" \n", | |
" for this_method in tasks[this_type]:\n", | |
" edges_header = '''<edges count=\"{}\">\\n'''.format(len(edge_data[this_method]))\n", | |
" out_file = tasks[this_type][this_method]\n", | |
" out_file.write(data_header)\n", | |
"\n", | |
" out_file.write(nodes_header)\n", | |
" for node_line in node_data:\n", | |
" out_file.write(node_line)\n", | |
" out_file.write(\"</nodes>\\n\")\n", | |
"\n", | |
" out_file.write(edges_header)\n", | |
" for edge_line in edge_data[this_method]:\n", | |
" out_file.write(edge_line)\n", | |
" out_file.write(\"</edges>\\n\")\n", | |
" out_file.write(\"</graph></gexf>\\n\")\n", | |
"\n", | |
" sys.stderr.write(\"{}: nodes: {}; edges: {}\\n\".format(this_type, len(books), edge_id))\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"noun_common: nodes: 39; edges: 741\n", | |
"all: nodes: 39; edges: 741\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stderr", | |
"text": [ | |
"verb: nodes: 39; edges: 741\n", | |
"noun_proper: nodes: 39; edges: 741\n" | |
] | |
} | |
], | |
"prompt_number": 13 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Finish the task." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"processor.final()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 8m 44s END TASK cooccurrences\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 8m 45s \n", | |
"total 2000\n", | |
"-rw-r--r-- 1 dirk staff 191B 14 jan 16:28 __log__cooccurrences.txt\n", | |
"-rw-r--r-- 1 dirk staff 122K 14 jan 16:27 all_1.gexf\n", | |
"-rw-r--r-- 1 dirk staff 124K 14 jan 16:27 all_2.gexf\n", | |
"-rw-r--r-- 1 dirk staff 122K 14 jan 16:27 noun_common_1.gexf\n", | |
"-rw-r--r-- 1 dirk staff 124K 14 jan 16:27 noun_common_2.gexf\n", | |
"-rw-r--r-- 1 dirk staff 123K 14 jan 16:27 noun_proper_1.gexf\n", | |
"-rw-r--r-- 1 dirk staff 125K 14 jan 16:27 noun_proper_2.gexf\n", | |
"-rw-r--r-- 1 dirk staff 122K 14 jan 16:27 verb_1.gexf\n", | |
"-rw-r--r-- 1 dirk staff 124K 14 jan 16:27 verb_2.gexf\n", | |
"\n" | |
] | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
" 8m 45s \n", | |
"1000K\t/Users/dirk/Scratch/shebanq/results/db/bhs3.txt.hdr/cooccurrences\n", | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 14 | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Visualization" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The output files can be loaded into Gephi and subjected to various graph rendering algorithms." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment