Last active
May 9, 2021 02:20
-
-
Save max-mapper/6551333 to your computer and use it in GitHub Desktop.
fast loading of a large dataset into leveldb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// data comes from here http://stat-computing.org/dataexpo/2009/the-data.html | |
// download 1994.csv.bz2 and unpack by running: cat 1994.csv.bz2 | bzip2 -d > 1994.csv | |
// 1994.csv should be ~5.2 million lines and 500MB | |
// importing all rows into leveldb took ~50 seconds on my machine | |
// there are two main techniques at work here: | |
// 1: never create JS objects, leave the data as binary the entire time (binary-split does this) | |
// 2: group lines into 16 MB batches, to take advantage of leveldbs batch API (byte-stream does this) | |
var level = require('level') | |
var byteStream = require('byte-stream') | |
var split = require('binary-split') | |
var fs = require('fs') | |
var count = 0 | |
var wbs = 1024 * 1024 * 16 | |
var db = level('data.db', {writeBufferSize: wbs}, function(){ | |
var batcher = byteStream(wbs) | |
fs.createReadStream('1994.csv') | |
.pipe(split()) | |
.pipe(batcher) | |
.on('data', function(lines) { | |
var batch = db.batch() | |
for (var i = 0; i < lines.length; i++) { | |
batch.put(count, lines[i]) | |
count++ | |
} | |
batch.write(batcher.next.bind(batcher)) | |
}) | |
}) |
@joeybaker: V8 already does that optimisation for you.
nice one. thanks for sharing. didn't know about byte-stream or binary-split.
@aheckmann I wrote them this week :D
Well done Max!
OS: Darwin 10.9
Memory: 4 GB 1600 MHz DDR3
Processor: 1.8 GHz Intel Core i5
time node gist.js
66.84 real 91.93 user 4.09 sys
I just did a bigger import, all of the 1990s data.
cat 1990.csv.bz2 1991.csv.bz2 1992.csv.bz2 1993.csv.bz2 1994.csv.bz2 1995.csv.bz2 1996.csv.bz2 1997.csv.bz2 1998.csv.bz2 1999.csv.bz2 > 1990s.csv.bz2
cat 1990s.csv.bz2 | bzip2 -d > 1990s.csv
it results in a 52,694,400 line file (5.18GB csv) and takes 11m4.321s to run the above script, which results in a 2.33GB leveldb folder
have you tested how the behaves in relation to key size? i'm going to test tomorrow but i was just wondering.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Silly optimization, but I bet you can squeak more pref out of this by changing line 24 to:
for (var i = 0, l = lines.length; i < l; i++) {