posts: 084, draft
authorLucian Mogosanu <lucian.mogosanu@gmail.com>
Mon, 11 Feb 2019 22:05:07 +0000 (00:05 +0200)
committerLucian Mogosanu <lucian.mogosanu@gmail.com>
Mon, 11 Feb 2019 22:05:07 +0000 (00:05 +0200)
posts/y05/084-gutenberg-ii.markdown [new file with mode: 0644]
uploads/2019/02/dissect-guten.bash [new file with mode: 0644]
uploads/2019/02/file-types.txt [new file with mode: 0644]

diff --git a/posts/y05/084-gutenberg-ii.markdown b/posts/y05/084-gutenberg-ii.markdown
new file mode 100644 (file)
index 0000000..8763525
--- /dev/null
@@ -0,0 +1,138 @@
+---
+postid: 084
+title: gutenberg.org, part zwei, a dissection
+date: February 12, 2019
+author: Lucian MogoČ™anu
+tags: tmsr
+---
+
+Now that we, and by "we" I mean "I", have a full copy of the
+[Project Gutenberg archive][gutenberg-i], the first thing I can do
+before publishing it is to have a look at what's actually there, keeping
+in mind the objective of [separating][btcbase-1894708]
+[the wheat from the chaff][btcbase-1894768]. However, before finding out
+what's there, there's the legitimate question of what "finding out
+what's there" even means: ideally, one'd catalogue everything that's
+there, which is, I guess, not such a difficult task, given that the
+archive already contains some metadata; also, it's conceivable that one
+would want to verify the authenticity of (some of) the writings, which
+is hard, or maybe downright sisyphean if we're to consider every single
+text that's in there[^1].
+
+For now, I'm going to narrow down the scope of this "finding out what's
+there" to discovering what file types there are, how many of them and
+how much they weigh in bytes. To this end, I'm publishing a script
+called [dissect-guten.bash][dissect-guten.bash], that, for each file in
+the gutenberg archive, it:
+
+1. looks at its extension, and
+2. if the extension exists, i.e. if the file is of the form
+   `$name.$ext`, then
+3. it increments a counter `num[$ext]`, and
+4. it adds the size of the file to a byte-counter `siz[$ext]`
+
+The full code:
+
+~~~~ {.bash}
+#!/bin/bash
+
+declare -A num
+declare -A siz
+
+print_data() {
+       echo "--------------"
+       for key in "${!num[@]}"; do
+               echo ${key}:${num[${key}]}:${siz[${key}]}
+       done
+
+       exit 1
+}
+
+# Print data if we press ctrl+c
+trap print_data SIGINT
+
+# Can't use find guten | while read line ... because this will spawn a
+# sub-shell where all our updates will be lost.
+while read line; do
+       fn=$(basename -- $line)
+       ext="${fn##*.}"
+       if [ ! \( \( "${fn}" = "${ext}" \) -o \( -z "${ext}" \) \) ]; then
+               fsize=$(stat -c "%s" $line)
+               if [ -z ${num["${ext}"]} ]; then
+                       num+=(["${ext}"]=1)
+                       siz+=(["${ext}"]="${fsize}")
+               else
+                       num+=(["${ext}"]=$(echo ${num["${ext}"]} + 1 \
+                               | BC_LINE_LENGTH=0 bc))
+                       siz+=(["${ext}"]=$(echo ${siz["${ext}"]} + ${fsize} \
+                               | BC_LINE_LENGTH=0 bc))
+               fi
+               >&2 echo "${ext}:${num["${ext}"]}:${siz["${ext}"]}"
+       fi
+done < <(find guten/)
+print_data
+~~~~
+
+The script is unfortunately a set of shitty bash-isms, but otherwise it
+is fairly short and readable. Running this in a directory where `guten/`
+is present and redirecting standard output to, say,
+[file-types.txt][file-types.txt], yields a result that I won't recount
+in full in the post, but that, in short, consists of a set of lines of
+the form:
+
+~~~~
+ext:n:siz
+~~~~
+
+where `ext` is an extension, `n` is the number of files with that
+extension and `siz` is the total size in bytes. We can then e.g.:
+
+~~~~
+$ sort -t: -nk3 file-types.txt > ~/file-types-sorted.txt
+$ awk -F':' '{print $1,$2,$3,($3/(1024*1024*1024)),$3/$2/(1024*1024)}' \
+    ~/file-types-sorted.txt
+~~~~
+
+which reveals to us that the biggest offenders are zip archives
+(~300GB), mp3 and ogg files (~230GB total), images (~140GB total) and
+video files (a bit over 20GB), with the rest being pdf, epub, iso
+etc. and ~40GB representing actual text files.
+
+Of course, the text files can be further cut by removing e.g. copyright
+headers[^2], the illustrations can be given a thorough review to
+establish what's actually worth keeping, but we'll leave these for the
+next episodes of this saga.
+
+[^1]: Which I'm not sure of either, given that much of the text might be
+    garbage. To be honest, I just don't know.
+
+[^2]: If you're not aware of it by now, let me remind you that there is
+    only one type of "[intellectual property][intellectual-ownership]":
+    that in which something is owned by being thoroughly examined and
+    deeply understood, [layer by layer][re-reading].  Sure, this is
+    further away from "copyright" (and other such fictions) than the
+    average orc mind can conceive; or, in other words, if you're not
+    aware of it by now, then there is very little hope left for you in
+    this world.
+
+    And now, assuming you've understood the previous paragraph and have
+    come to a realization regarding the value of actual intellectual
+    work, imagine the entirety of modern journalism and popular culture
+    produced in the last few decades, the music and movie "industries",
+    all the "best-sellers", the [academic wank][hogwash] and so on and
+    so forth. [Imagine][inchipuiti-va] that and how worthless a pile of
+    junk it is.
+
+    I'm sure Gutenberg himself would have been in awe looking at all the
+    multilaterally valuable intellectual valuables published in the late
+    20th/early 21st century!
+
+[gutenberg-i]: /posts/y05/083-gutenberg-rsync.html
+[btcbase-1894708]: http://btcbase.org/log/2019-02-10#1894708
+[btcbase-1894768]: http://btcbase.org/log/2019-02-10#1894768
+[dissect-guten.bash]: TODO
+[file-types.txt]: TODO
+[intellectual-ownership]: /posts/y04/069-on-intellectual-ownership.html
+[re-reading]: http://trilema.com/2017/re-reading-is-the-most-powerful-tool/#selection-21.114-21.433
+[hogwash]: /posts/y02/045-academic-hogwash.html
+[inchipuiti-va]: http://trilema.com/2009/inchipuiti-va/
diff --git a/uploads/2019/02/dissect-guten.bash b/uploads/2019/02/dissect-guten.bash
new file mode 100644 (file)
index 0000000..424a151
--- /dev/null
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+declare -A num
+declare -A siz
+
+print_data() {
+       echo "--------------"
+       for key in "${!num[@]}"; do
+               echo ${key}:${num[${key}]}:${siz[${key}]}
+       done
+
+       exit 1
+}
+
+# Print data if we press ctrl+c
+trap print_data SIGINT
+
+# Can't use find guten | while read line ... because this will spawn a
+# sub-shell where all our updates will be lost.
+while read line; do
+       fn=$(basename -- $line)
+       ext="${fn##*.}"
+       if [ ! \( \( "${fn}" = "${ext}" \) -o \( -z "${ext}" \) \) ]; then
+               fsize=$(stat -c "%s" $line)
+               if [ -z ${num["${ext}"]} ]; then
+                       num+=(["${ext}"]=1)
+                       siz+=(["${ext}"]="${fsize}")
+               else
+                       num+=(["${ext}"]=$(echo ${num["${ext}"]} + 1 \
+                               | BC_LINE_LENGTH=0 bc))
+                       siz+=(["${ext}"]=$(echo ${siz["${ext}"]} + ${fsize} \
+                               | BC_LINE_LENGTH=0 bc))
+               fi
+               >&2 echo "${ext}:${num["${ext}"]}:${siz["${ext}"]}"
+       fi
+done < <(find guten/)
+print_data
diff --git a/uploads/2019/02/file-types.txt b/uploads/2019/02/file-types.txt
new file mode 100644 (file)
index 0000000..8d3ac7b
--- /dev/null
@@ -0,0 +1,547 @@
+aagc:2:30000000
+aadv:2:30000000
+aaim:2:30000000
+20090106:4:375489
+clo:1:8606
+aadw:2:30000000
+aail:2:30000000
+aagb:2:30000000
+ALL:1:7147188
+20090107:6:564851
+static:1:12
+aaga:2:30000000
+aadt:2:30000000
+aaio:2:30000000
+20090104:2:468396
+aain:2:30000000
+aadu:2:30000000
+STD:1:259
+tex:170:76690109
+spx:9478:7802768569
+aaii:2:30000000
+aagg:2:30000000
+aadr:2:30000000
+20110408:2:1569150
+20020512:1:677500
+gut:2:664306
+20090102:3:210821
+aads:2:30000000
+aagf:2:30000000
+aaih:2:30000000
+idx:1:0
+20090103:12:3458052
+aage:2:30000000
+aadp:2:30000000
+aaik:2:30000000
+aaij:2:30000000
+aadq:2:30000000
+aagd:2:30000000
+html:2808:943053927
+aaie:2:30000000
+aagk:2:30000000
+20011219:1:81385
+musicxml:1:37557
+20040301:3:2158977
+aaid:2:30000000
+aagj:2:30000000
+20050327:5:1677392
+19043-h:1:4096
+aagi:2:30000000
+aaig:2:30000000
+ocp:2:1812
+svg:237:1407571
+20050116:2:432066
+aaif:2:30000000
+aagh:2:30000000
+19970801:1:761814
+aaia:2:30000000
+aago:2:30000000
+aadz:2:30000000
+97:1:4138
+aagn:2:30000000
+clb:1:2725
+tiff:452:18980584
+jpg:760462:82699193930
+aadx:2:30000000
+aaic:2:30000000
+aagm:2:30000000
+template:1:8727
+bak:4:6454
+db:37:4446720
+aady:2:30000000
+aaib:2:30000000
+aagl:2:30000000
+jpe:1:36612
+notes:1:56
+20090109:4:297527
+aags:2:30000000
+aadf:2:30000000
+20021106:1:459075
+aagr:2:30000000
+aadg:2:30000000
+bat:7:1003
+mp4:2:353822523
+aadd:2:30000000
+aagq:2:30000000
+jbf:1:196522
+tei:412:282767733
+aagp:2:30000000
+aade:2:30000000
+bin:10:3380
+aadb:2:30000000
+aagw:2:30000000
+aaiy:2:30000000
+20091202:1:873429
+aaix:2:30000000
+aagv:2:30000000
+aadc:2:30000000
+MUS:138:20045383
+ogg~:1:5840527
+aagu:2:30000000
+MP3:23:190147563
+mp3:24032:204379929592
+ico:1:4286
+aada:2:30000000
+aaiz:2:30000000
+aagt:2:30000000
+aadn:2:30000000
+aaiu:2:30000000
+xp:427:398392
+txt:99553:41513156922
+aait:2:30000000
+aado:2:30000000
+aagz:2:30000000
+aagy:2:30000000
+aaiw:2:30000000
+aadl:2:30000000
+config:1:239
+20060824:3:1295690
+04012004:3:1545513
+aagx:2:30000000
+aaiv:2:30000000
+aadm:2:30000000
+otp:2:1487
+tmx:1:76467
+aadj:2:30000000
+aaiq:2:30000000
+cls:1:27378
+200110xx:4:3187261
+aaip:2:30000000
+aadk:2:30000000
+aadh:2:30000000
+aais:2:30000000
+mov:3:16664047
+cache:1:12
+aair:2:30000000
+aadi:2:30000000
+999:1:30235
+aafp:2:30000000
+aaau:2:30000000
+aakg:2:30000000
+aahz:2:30000000
+AUS:1:709631
+aaat:2:30000000
+aakf:2:30000000
+aafq:2:30000000
+aafr:2:30000000
+aaaw:2:30000000
+aake:2:30000000
+aahx:2:30000000
+20030313:4:1108858
+GIF:401:14255919
+2005-04-11:4:360453
+aakd:2:30000000
+aaav:2:30000000
+aafs:2:30000000
+aahy:2:30000000
+ps:46:19062520
+aakc:2:30000000
+aaaq:2:30000000
+aaft:2:30000000
+299:1:16752
+eng:1:1946
+aakb:2:30000000
+aafu:2:30000000
+aaap:2:30000000
+aafv:2:30000000
+aaas:2:30000000
+aaka:2:30000000
+aaar:2:30000000
+aafw:2:30000000
+mpeg:1:4127671704
+aako:2:30000000
+aafx:2:30000000
+aahr:2:30000000
+aafy:2:30000000
+aakn:2:30000000
+aahs:2:30000000
+HTM:53:4884353
+aafz:2:30000000
+aakm:2:30000000
+aahp:2:30000000
+wav:3:31938936
+new:2:12792
+279:1:11115
+aakl:2:30000000
+aahq:2:30000000
+ps2:1:170506
+rst:704:271503319
+aakk:2:30000000
+aahv:2:30000000
+aaay:2:30000000
+flv:1:195062870
+aahw:2:30000000
+aaax:2:30000000
+aakj:2:30000000
+JPG:468:28316807
+aaki:2:30000000
+aaht:2:30000000
+20041201:6:764953
+aakh:2:30000000
+aahu:2:30000000
+aaaz:2:30000000
+20081231:2:1539829
+aakw:1:15000000
+aaae:2:30000000
+aahj:2:30000000
+brl:1:5245
+aakv:1:15000000
+aafa:2:30000000
+aahk:2:30000000
+aaad:2:30000000
+1:3:1200093
+19990816:1:75694
+20041128:1:915233
+aaku:1:15000000
+aaag:2:30000000
+aafb:2:30000000
+aahh:2:30000000
+2:1:76622
+06092010:12:15660818
+19990815:1:75893
+aakt:1:15000000
+aaaf:2:30000000
+aahi:2:30000000
+aafc:2:30000000
+xml:1714:363212393
+aaks:1:15000000
+aaaa:2:30000000
+aahn:2:30000000
+aafd:2:30000000
+20080301:1:357253
+aakr:1:15000000
+aafe:2:30000000
+aaho:2:30000000
+jfif:9:140358
+mus:49:3822565
+lnk:2:3056
+aakq:1:15000000
+aaff:2:30000000
+aaac:2:30000000
+aahl:2:30000000
+denemo:24:159314
+LOG:1:686
+txt~:1:150472
+aafg:2:30000000
+aaab:2:30000000
+aakp:2:29646976
+aahm:2:30000000
+20050123:4:3283343
+19990810:1:356556
+aaam:2:30000000
+aafh:2:30000000
+aahb:2:30000000
+TXT:44:10358769
+aafi:2:30000000
+aahc:2:30000000
+aaal:2:30000000
+jigdo:1:18838
+aafj:2:30000000
+aaao:2:30000000
+cc:1:764
+pdf:3968:2393179034
+aafk:2:30000000
+aaha:2:30000000
+aaan:2:30000000
+fen:3:18858359
+aaai:2:30000000
+aafl:2:30000000
+aahf:2:30000000
+md5:8:32929
+20030625:1:676472
+dic:1:4447
+aakz:1:15000000
+aaah:2:30000000
+aafm:2:30000000
+aahg:2:30000000
+20090518:2:597429
+DS_Store:12:95280
+pdb:1:301709
+aaky:1:15000000
+aafn:2:30000000
+aahd:2:30000000
+aaak:2:30000000
+sty:2:63099
+mid:3228:8613599
+aakx:1:15000000
+aahe:2:30000000
+aaaj:2:30000000
+aafo:2:30000000
+20030626:1:568985
+2019:1:33247
+aajt:2:30000000
+aaco:2:30000000
+xlsm:2:11548954
+rtf:221:224109246
+aaju:2:30000000
+aacn:2:30000000
+2018:1:297865
+bmp:9:1078050
+aajv:2:30000000
+aacm:2:30000000
+20030915:1:759775
+raw:3:25378
+1997:1:40633
+sib:263:8625706
+mpg:12:952040757
+gif:17388:349344171
+aajw:2:30000000
+aacl:2:30000000
+1996:1:74658
+aajp:2:30000000
+aack:2:30000000
+aacj:2:30000000
+aajq:2:30000000
+aaci:2:30000000
+aajr:2:30000000
+499:1:17211
+20020818:4:2875132
+iso:2:8765898752
+aach:2:30000000
+aajs:2:30000000
+20100608:2:1158949
+rar:40:4318561394
+aacg:2:30000000
+2011:1:456196
+20100607:4:2191264
+aacf:2:30000000
+2010:1:498444
+ogg:9525:44598232311
+tif:223:26165688
+htm:50497:23061934768
+aace:2:30000000
+2013:1:362747
+aux:1:8
+NEW:1:1871
+eepic:787:5141206
+nfo:2:8445952
+sfv:2:312
+aacd:2:30000000
+2012:1:412312
+20120606:4:1569746
+1999:1:54721
+aajx:2:30000000
+aacc:2:30000000
+2015:1:396426
+2014:1:422008
+aacb:2:30000000
+aajy:2:30000000
+1998:1:69088
+2017:1:319872
+20041119:6:1349045
+aajz:2:30000000
+aaca:2:30000000
+20011226:1:569352
+2016:1:391232
+19981229:1:295433
+COM:2:18428
+2006:1:368186
+aajd:2:30000000
+n99:1:16843
+PAR2:2:74892
+aaje:2:30000000
+2007:1:467937
+mxl:8:40697
+opf:1:48933
+20121214:1:80053
+aajf:2:30000000
+2004:1:501038
+20041107:2:1205286
+aajg:2:30000000
+2005:1:397007
+2002:1:223830
+20020916:1:776273
+20060325:2:154961
+aacz:2:30000000
+aaja:2:30000000
+20140505:1:219384
+20121023:1:1458995
+2003:1:413042
+srt:1:63042
+zip:152121:323843807012
+20030902:2:96651
+aacy:2:30000000
+aajb:2:30000000
+2000:1:78085
+20090221:4:246584
+ini:1:170
+aajc:2:30000000
+aacx:2:30000000
+199:1:13572
+mobi:995:2158743095
+2001:1:122367
+aajl:2:30000000
+aacw:2:30000000
+699:1:17895
+aacv:2:30000000
+aajm:2:30000000
+20071106:2:1238399
+aajn:2:30000000
+aacu:2:30000000
+20020918:2:1560532
+epub:916:582549150
+doc:63:461211236
+aact:2:30000000
+aajo:2:30000000
+599:1:7717
+aacs:2:30000000
+aajh:2:30000000
+aacr:2:30000000
+aaji:2:30000000
+20041109:2:987489
+20050503:2:813324
+PNG:113:4320455
+aacq:2:30000000
+aajj:2:30000000
+2008:1:445272
+aajk:2:30000000
+aacp:2:30000000
+2009:1:407308
+gz:1:18409805
+aaln:1:15000000
+aaei:2:30000000
+exe:1:204955
+aalo:1:15000000
+20060621:3:1610307
+aaeh:2:30000000
+09162010:6:22106871
+sh:2:313
+aall:1:15000000
+aaek:2:30000000
+pgw:1:13193
+aalm:1:15000000
+20061116:1:631774
+aaej:2:30000000
+thumbs:1:12288
+lit:67:6759907
+aalj:1:15000000
+aaem:2:30000000
+aabx:2:30000000
+GUT:19:322557
+20081027:2:1569039
+aalk:1:15000000
+aael:2:30000000
+aaby:2:30000000
+bk2:1:284049
+prc:62:9041770
+aalh:1:15000000
+aabz:2:30000000
+aaeo:2:30000000
+bk1:3:1149089
+20031105:2:362591
+aali:1:15000000
+ISO:1:713037824
+aaen:2:30000000
+mscz:11:38164
+aalf:1:15000000
+aaea:2:30000000
+aabt:2:30000000
+h:3:6587
+csv:3:8343395
+aalg:1:15000000
+TIF:35:9546424
+aabu:2:30000000
+399:1:13520
+message~:1:580
+aald:1:15000000
+aaec:2:30000000
+aabv:2:30000000
+aale:1:15000000
+aabw:2:30000000
+aaeb:2:30000000
+aalb:1:15000000
+aaee:2:30000000
+aabp:2:30000000
+20130402:3:2155470
+aalc:1:15000000
+aaed:2:30000000
+aabq:2:30000000
+xsl:16:388999
+css:396:868795
+message:16:10195
+dcs:1:9
+aaeg:2:30000000
+aabr:2:30000000
+eps:177:13856840
+aala:1:15000000
+aaef:2:30000000
+aabs:2:30000000
+11xx1998:2:144544
+txt20020407:1:849861
+2006116:2:1272899
+aabl:2:30000000
+aaey:2:30000000
+aabm:2:30000000
+aaex:2:30000000
+PAR:2:348
+20090118:6:1216345
+aabn:2:30000000
+20010904:2:1079514
+aabo:2:30000000
+aaez:2:30000000
+md:1:1726
+midi:1260:1978171
+aabh:2:30000000
+aabi:2:30000000
+o99:1:24379
+avi:2:23038808
+aabj:2:30000000
+aabk:2:30000000
+20020407:1:301813
+aalv:1:15000000
+aabd:2:30000000
+aaeq:2:30000000
+20090111:1:29072
+m4a:368:2201764589
+aalw:1:6251776
+aabe:2:30000000
+aaep:2:30000000
+20090110:2:304175
+20010118:1:541990
+aalt:1:15000000
+aaes:2:30000000
+aabf:2:30000000
+899:1:16876
+jpeg:303:51389627
+aalu:1:15000000
+aaer:2:30000000
+aabg:2:30000000
+20090112:1:2161108
+m4b:9277:22453442288
+aalr:1:15000000
+aaeu:2:30000000
+aals:1:15000000
+aaba:2:30000000
+aaet:2:30000000
+799:1:29554
+psd:4:1763402
+20030523:1:356667
+ly:555:864002
+png:1006017:63393768509
+aalp:1:15000000
+aaew:2:30000000
+aabb:2:30000000
+20030828:2:548930
+aalq:1:15000000
+aaev:2:30000000
+aabc:2:30000000