From: Lucian Mogosanu Date: Mon, 11 Feb 2019 22:05:07 +0000 (+0200) Subject: posts: 084, draft X-Git-Tag: v0.11~103 X-Git-Url: https://git.mogosanu.ro/?a=commitdiff_plain;h=d5dce0210a7e4265eb5b798d782a477455ae9eff;p=thetarpit.git posts: 084, draft --- diff --git a/posts/y05/084-gutenberg-ii.markdown b/posts/y05/084-gutenberg-ii.markdown new file mode 100644 index 0000000..8763525 --- /dev/null +++ b/posts/y05/084-gutenberg-ii.markdown @@ -0,0 +1,138 @@ +--- +postid: 084 +title: gutenberg.org, part zwei, a dissection +date: February 12, 2019 +author: Lucian Mogoșanu +tags: tmsr +--- + +Now that we, and by "we" I mean "I", have a full copy of the +[Project Gutenberg archive][gutenberg-i], the first thing I can do +before publishing it is to have a look at what's actually there, keeping +in mind the objective of [separating][btcbase-1894708] +[the wheat from the chaff][btcbase-1894768]. However, before finding out +what's there, there's the legitimate question of what "finding out +what's there" even means: ideally, one'd catalogue everything that's +there, which is, I guess, not such a difficult task, given that the +archive already contains some metadata; also, it's conceivable that one +would want to verify the authenticity of (some of) the writings, which +is hard, or maybe downright sisyphean if we're to consider every single +text that's in there[^1]. + +For now, I'm going to narrow down the scope of this "finding out what's +there" to discovering what file types there are, how many of them and +how much they weigh in bytes. To this end, I'm publishing a script +called [dissect-guten.bash][dissect-guten.bash], that, for each file in +the gutenberg archive, it: + +1. looks at its extension, and +2. if the extension exists, i.e. if the file is of the form + `$name.$ext`, then +3. it increments a counter `num[$ext]`, and +4. it adds the size of the file to a byte-counter `siz[$ext]` + +The full code: + +~~~~ {.bash} +#!/bin/bash + +declare -A num +declare -A siz + +print_data() { + echo "--------------" + for key in "${!num[@]}"; do + echo ${key}:${num[${key}]}:${siz[${key}]} + done + + exit 1 +} + +# Print data if we press ctrl+c +trap print_data SIGINT + +# Can't use find guten | while read line ... because this will spawn a +# sub-shell where all our updates will be lost. +while read line; do + fn=$(basename -- $line) + ext="${fn##*.}" + if [ ! \( \( "${fn}" = "${ext}" \) -o \( -z "${ext}" \) \) ]; then + fsize=$(stat -c "%s" $line) + if [ -z ${num["${ext}"]} ]; then + num+=(["${ext}"]=1) + siz+=(["${ext}"]="${fsize}") + else + num+=(["${ext}"]=$(echo ${num["${ext}"]} + 1 \ + | BC_LINE_LENGTH=0 bc)) + siz+=(["${ext}"]=$(echo ${siz["${ext}"]} + ${fsize} \ + | BC_LINE_LENGTH=0 bc)) + fi + >&2 echo "${ext}:${num["${ext}"]}:${siz["${ext}"]}" + fi +done < <(find guten/) +print_data +~~~~ + +The script is unfortunately a set of shitty bash-isms, but otherwise it +is fairly short and readable. Running this in a directory where `guten/` +is present and redirecting standard output to, say, +[file-types.txt][file-types.txt], yields a result that I won't recount +in full in the post, but that, in short, consists of a set of lines of +the form: + +~~~~ +ext:n:siz +~~~~ + +where `ext` is an extension, `n` is the number of files with that +extension and `siz` is the total size in bytes. We can then e.g.: + +~~~~ +$ sort -t: -nk3 file-types.txt > ~/file-types-sorted.txt +$ awk -F':' '{print $1,$2,$3,($3/(1024*1024*1024)),$3/$2/(1024*1024)}' \ + ~/file-types-sorted.txt +~~~~ + +which reveals to us that the biggest offenders are zip archives +(~300GB), mp3 and ogg files (~230GB total), images (~140GB total) and +video files (a bit over 20GB), with the rest being pdf, epub, iso +etc. and ~40GB representing actual text files. + +Of course, the text files can be further cut by removing e.g. copyright +headers[^2], the illustrations can be given a thorough review to +establish what's actually worth keeping, but we'll leave these for the +next episodes of this saga. + +[^1]: Which I'm not sure of either, given that much of the text might be + garbage. To be honest, I just don't know. + +[^2]: If you're not aware of it by now, let me remind you that there is + only one type of "[intellectual property][intellectual-ownership]": + that in which something is owned by being thoroughly examined and + deeply understood, [layer by layer][re-reading]. Sure, this is + further away from "copyright" (and other such fictions) than the + average orc mind can conceive; or, in other words, if you're not + aware of it by now, then there is very little hope left for you in + this world. + + And now, assuming you've understood the previous paragraph and have + come to a realization regarding the value of actual intellectual + work, imagine the entirety of modern journalism and popular culture + produced in the last few decades, the music and movie "industries", + all the "best-sellers", the [academic wank][hogwash] and so on and + so forth. [Imagine][inchipuiti-va] that and how worthless a pile of + junk it is. + + I'm sure Gutenberg himself would have been in awe looking at all the + multilaterally valuable intellectual valuables published in the late + 20th/early 21st century! + +[gutenberg-i]: /posts/y05/083-gutenberg-rsync.html +[btcbase-1894708]: http://btcbase.org/log/2019-02-10#1894708 +[btcbase-1894768]: http://btcbase.org/log/2019-02-10#1894768 +[dissect-guten.bash]: TODO +[file-types.txt]: TODO +[intellectual-ownership]: /posts/y04/069-on-intellectual-ownership.html +[re-reading]: http://trilema.com/2017/re-reading-is-the-most-powerful-tool/#selection-21.114-21.433 +[hogwash]: /posts/y02/045-academic-hogwash.html +[inchipuiti-va]: http://trilema.com/2009/inchipuiti-va/ diff --git a/uploads/2019/02/dissect-guten.bash b/uploads/2019/02/dissect-guten.bash new file mode 100644 index 0000000..424a151 --- /dev/null +++ b/uploads/2019/02/dissect-guten.bash @@ -0,0 +1,37 @@ +#!/bin/bash + +declare -A num +declare -A siz + +print_data() { + echo "--------------" + for key in "${!num[@]}"; do + echo ${key}:${num[${key}]}:${siz[${key}]} + done + + exit 1 +} + +# Print data if we press ctrl+c +trap print_data SIGINT + +# Can't use find guten | while read line ... because this will spawn a +# sub-shell where all our updates will be lost. +while read line; do + fn=$(basename -- $line) + ext="${fn##*.}" + if [ ! \( \( "${fn}" = "${ext}" \) -o \( -z "${ext}" \) \) ]; then + fsize=$(stat -c "%s" $line) + if [ -z ${num["${ext}"]} ]; then + num+=(["${ext}"]=1) + siz+=(["${ext}"]="${fsize}") + else + num+=(["${ext}"]=$(echo ${num["${ext}"]} + 1 \ + | BC_LINE_LENGTH=0 bc)) + siz+=(["${ext}"]=$(echo ${siz["${ext}"]} + ${fsize} \ + | BC_LINE_LENGTH=0 bc)) + fi + >&2 echo "${ext}:${num["${ext}"]}:${siz["${ext}"]}" + fi +done < <(find guten/) +print_data diff --git a/uploads/2019/02/file-types.txt b/uploads/2019/02/file-types.txt new file mode 100644 index 0000000..8d3ac7b --- /dev/null +++ b/uploads/2019/02/file-types.txt @@ -0,0 +1,547 @@ +aagc:2:30000000 +aadv:2:30000000 +aaim:2:30000000 +20090106:4:375489 +clo:1:8606 +aadw:2:30000000 +aail:2:30000000 +aagb:2:30000000 +ALL:1:7147188 +20090107:6:564851 +static:1:12 +aaga:2:30000000 +aadt:2:30000000 +aaio:2:30000000 +20090104:2:468396 +aain:2:30000000 +aadu:2:30000000 +STD:1:259 +tex:170:76690109 +spx:9478:7802768569 +aaii:2:30000000 +aagg:2:30000000 +aadr:2:30000000 +20110408:2:1569150 +20020512:1:677500 +gut:2:664306 +20090102:3:210821 +aads:2:30000000 +aagf:2:30000000 +aaih:2:30000000 +idx:1:0 +20090103:12:3458052 +aage:2:30000000 +aadp:2:30000000 +aaik:2:30000000 +aaij:2:30000000 +aadq:2:30000000 +aagd:2:30000000 +html:2808:943053927 +aaie:2:30000000 +aagk:2:30000000 +20011219:1:81385 +musicxml:1:37557 +20040301:3:2158977 +aaid:2:30000000 +aagj:2:30000000 +20050327:5:1677392 +19043-h:1:4096 +aagi:2:30000000 +aaig:2:30000000 +ocp:2:1812 +svg:237:1407571 +20050116:2:432066 +aaif:2:30000000 +aagh:2:30000000 +19970801:1:761814 +aaia:2:30000000 +aago:2:30000000 +aadz:2:30000000 +97:1:4138 +aagn:2:30000000 +clb:1:2725 +tiff:452:18980584 +jpg:760462:82699193930 +aadx:2:30000000 +aaic:2:30000000 +aagm:2:30000000 +template:1:8727 +bak:4:6454 +db:37:4446720 +aady:2:30000000 +aaib:2:30000000 +aagl:2:30000000 +jpe:1:36612 +notes:1:56 +20090109:4:297527 +aags:2:30000000 +aadf:2:30000000 +20021106:1:459075 +aagr:2:30000000 +aadg:2:30000000 +bat:7:1003 +mp4:2:353822523 +aadd:2:30000000 +aagq:2:30000000 +jbf:1:196522 +tei:412:282767733 +aagp:2:30000000 +aade:2:30000000 +bin:10:3380 +aadb:2:30000000 +aagw:2:30000000 +aaiy:2:30000000 +20091202:1:873429 +aaix:2:30000000 +aagv:2:30000000 +aadc:2:30000000 +MUS:138:20045383 +ogg~:1:5840527 +aagu:2:30000000 +MP3:23:190147563 +mp3:24032:204379929592 +ico:1:4286 +aada:2:30000000 +aaiz:2:30000000 +aagt:2:30000000 +aadn:2:30000000 +aaiu:2:30000000 +xp:427:398392 +txt:99553:41513156922 +aait:2:30000000 +aado:2:30000000 +aagz:2:30000000 +aagy:2:30000000 +aaiw:2:30000000 +aadl:2:30000000 +config:1:239 +20060824:3:1295690 +04012004:3:1545513 +aagx:2:30000000 +aaiv:2:30000000 +aadm:2:30000000 +otp:2:1487 +tmx:1:76467 +aadj:2:30000000 +aaiq:2:30000000 +cls:1:27378 +200110xx:4:3187261 +aaip:2:30000000 +aadk:2:30000000 +aadh:2:30000000 +aais:2:30000000 +mov:3:16664047 +cache:1:12 +aair:2:30000000 +aadi:2:30000000 +999:1:30235 +aafp:2:30000000 +aaau:2:30000000 +aakg:2:30000000 +aahz:2:30000000 +AUS:1:709631 +aaat:2:30000000 +aakf:2:30000000 +aafq:2:30000000 +aafr:2:30000000 +aaaw:2:30000000 +aake:2:30000000 +aahx:2:30000000 +20030313:4:1108858 +GIF:401:14255919 +2005-04-11:4:360453 +aakd:2:30000000 +aaav:2:30000000 +aafs:2:30000000 +aahy:2:30000000 +ps:46:19062520 +aakc:2:30000000 +aaaq:2:30000000 +aaft:2:30000000 +299:1:16752 +eng:1:1946 +aakb:2:30000000 +aafu:2:30000000 +aaap:2:30000000 +aafv:2:30000000 +aaas:2:30000000 +aaka:2:30000000 +aaar:2:30000000 +aafw:2:30000000 +mpeg:1:4127671704 +aako:2:30000000 +aafx:2:30000000 +aahr:2:30000000 +aafy:2:30000000 +aakn:2:30000000 +aahs:2:30000000 +HTM:53:4884353 +aafz:2:30000000 +aakm:2:30000000 +aahp:2:30000000 +wav:3:31938936 +new:2:12792 +279:1:11115 +aakl:2:30000000 +aahq:2:30000000 +ps2:1:170506 +rst:704:271503319 +aakk:2:30000000 +aahv:2:30000000 +aaay:2:30000000 +flv:1:195062870 +aahw:2:30000000 +aaax:2:30000000 +aakj:2:30000000 +JPG:468:28316807 +aaki:2:30000000 +aaht:2:30000000 +20041201:6:764953 +aakh:2:30000000 +aahu:2:30000000 +aaaz:2:30000000 +20081231:2:1539829 +aakw:1:15000000 +aaae:2:30000000 +aahj:2:30000000 +brl:1:5245 +aakv:1:15000000 +aafa:2:30000000 +aahk:2:30000000 +aaad:2:30000000 +1:3:1200093 +19990816:1:75694 +20041128:1:915233 +aaku:1:15000000 +aaag:2:30000000 +aafb:2:30000000 +aahh:2:30000000 +2:1:76622 +06092010:12:15660818 +19990815:1:75893 +aakt:1:15000000 +aaaf:2:30000000 +aahi:2:30000000 +aafc:2:30000000 +xml:1714:363212393 +aaks:1:15000000 +aaaa:2:30000000 +aahn:2:30000000 +aafd:2:30000000 +20080301:1:357253 +aakr:1:15000000 +aafe:2:30000000 +aaho:2:30000000 +jfif:9:140358 +mus:49:3822565 +lnk:2:3056 +aakq:1:15000000 +aaff:2:30000000 +aaac:2:30000000 +aahl:2:30000000 +denemo:24:159314 +LOG:1:686 +txt~:1:150472 +aafg:2:30000000 +aaab:2:30000000 +aakp:2:29646976 +aahm:2:30000000 +20050123:4:3283343 +19990810:1:356556 +aaam:2:30000000 +aafh:2:30000000 +aahb:2:30000000 +TXT:44:10358769 +aafi:2:30000000 +aahc:2:30000000 +aaal:2:30000000 +jigdo:1:18838 +aafj:2:30000000 +aaao:2:30000000 +cc:1:764 +pdf:3968:2393179034 +aafk:2:30000000 +aaha:2:30000000 +aaan:2:30000000 +fen:3:18858359 +aaai:2:30000000 +aafl:2:30000000 +aahf:2:30000000 +md5:8:32929 +20030625:1:676472 +dic:1:4447 +aakz:1:15000000 +aaah:2:30000000 +aafm:2:30000000 +aahg:2:30000000 +20090518:2:597429 +DS_Store:12:95280 +pdb:1:301709 +aaky:1:15000000 +aafn:2:30000000 +aahd:2:30000000 +aaak:2:30000000 +sty:2:63099 +mid:3228:8613599 +aakx:1:15000000 +aahe:2:30000000 +aaaj:2:30000000 +aafo:2:30000000 +20030626:1:568985 +2019:1:33247 +aajt:2:30000000 +aaco:2:30000000 +xlsm:2:11548954 +rtf:221:224109246 +aaju:2:30000000 +aacn:2:30000000 +2018:1:297865 +bmp:9:1078050 +aajv:2:30000000 +aacm:2:30000000 +20030915:1:759775 +raw:3:25378 +1997:1:40633 +sib:263:8625706 +mpg:12:952040757 +gif:17388:349344171 +aajw:2:30000000 +aacl:2:30000000 +1996:1:74658 +aajp:2:30000000 +aack:2:30000000 +aacj:2:30000000 +aajq:2:30000000 +aaci:2:30000000 +aajr:2:30000000 +499:1:17211 +20020818:4:2875132 +iso:2:8765898752 +aach:2:30000000 +aajs:2:30000000 +20100608:2:1158949 +rar:40:4318561394 +aacg:2:30000000 +2011:1:456196 +20100607:4:2191264 +aacf:2:30000000 +2010:1:498444 +ogg:9525:44598232311 +tif:223:26165688 +htm:50497:23061934768 +aace:2:30000000 +2013:1:362747 +aux:1:8 +NEW:1:1871 +eepic:787:5141206 +nfo:2:8445952 +sfv:2:312 +aacd:2:30000000 +2012:1:412312 +20120606:4:1569746 +1999:1:54721 +aajx:2:30000000 +aacc:2:30000000 +2015:1:396426 +2014:1:422008 +aacb:2:30000000 +aajy:2:30000000 +1998:1:69088 +2017:1:319872 +20041119:6:1349045 +aajz:2:30000000 +aaca:2:30000000 +20011226:1:569352 +2016:1:391232 +19981229:1:295433 +COM:2:18428 +2006:1:368186 +aajd:2:30000000 +n99:1:16843 +PAR2:2:74892 +aaje:2:30000000 +2007:1:467937 +mxl:8:40697 +opf:1:48933 +20121214:1:80053 +aajf:2:30000000 +2004:1:501038 +20041107:2:1205286 +aajg:2:30000000 +2005:1:397007 +2002:1:223830 +20020916:1:776273 +20060325:2:154961 +aacz:2:30000000 +aaja:2:30000000 +20140505:1:219384 +20121023:1:1458995 +2003:1:413042 +srt:1:63042 +zip:152121:323843807012 +20030902:2:96651 +aacy:2:30000000 +aajb:2:30000000 +2000:1:78085 +20090221:4:246584 +ini:1:170 +aajc:2:30000000 +aacx:2:30000000 +199:1:13572 +mobi:995:2158743095 +2001:1:122367 +aajl:2:30000000 +aacw:2:30000000 +699:1:17895 +aacv:2:30000000 +aajm:2:30000000 +20071106:2:1238399 +aajn:2:30000000 +aacu:2:30000000 +20020918:2:1560532 +epub:916:582549150 +doc:63:461211236 +aact:2:30000000 +aajo:2:30000000 +599:1:7717 +aacs:2:30000000 +aajh:2:30000000 +aacr:2:30000000 +aaji:2:30000000 +20041109:2:987489 +20050503:2:813324 +PNG:113:4320455 +aacq:2:30000000 +aajj:2:30000000 +2008:1:445272 +aajk:2:30000000 +aacp:2:30000000 +2009:1:407308 +gz:1:18409805 +aaln:1:15000000 +aaei:2:30000000 +exe:1:204955 +aalo:1:15000000 +20060621:3:1610307 +aaeh:2:30000000 +09162010:6:22106871 +sh:2:313 +aall:1:15000000 +aaek:2:30000000 +pgw:1:13193 +aalm:1:15000000 +20061116:1:631774 +aaej:2:30000000 +thumbs:1:12288 +lit:67:6759907 +aalj:1:15000000 +aaem:2:30000000 +aabx:2:30000000 +GUT:19:322557 +20081027:2:1569039 +aalk:1:15000000 +aael:2:30000000 +aaby:2:30000000 +bk2:1:284049 +prc:62:9041770 +aalh:1:15000000 +aabz:2:30000000 +aaeo:2:30000000 +bk1:3:1149089 +20031105:2:362591 +aali:1:15000000 +ISO:1:713037824 +aaen:2:30000000 +mscz:11:38164 +aalf:1:15000000 +aaea:2:30000000 +aabt:2:30000000 +h:3:6587 +csv:3:8343395 +aalg:1:15000000 +TIF:35:9546424 +aabu:2:30000000 +399:1:13520 +message~:1:580 +aald:1:15000000 +aaec:2:30000000 +aabv:2:30000000 +aale:1:15000000 +aabw:2:30000000 +aaeb:2:30000000 +aalb:1:15000000 +aaee:2:30000000 +aabp:2:30000000 +20130402:3:2155470 +aalc:1:15000000 +aaed:2:30000000 +aabq:2:30000000 +xsl:16:388999 +css:396:868795 +message:16:10195 +dcs:1:9 +aaeg:2:30000000 +aabr:2:30000000 +eps:177:13856840 +aala:1:15000000 +aaef:2:30000000 +aabs:2:30000000 +11xx1998:2:144544 +txt20020407:1:849861 +2006116:2:1272899 +aabl:2:30000000 +aaey:2:30000000 +aabm:2:30000000 +aaex:2:30000000 +PAR:2:348 +20090118:6:1216345 +aabn:2:30000000 +20010904:2:1079514 +aabo:2:30000000 +aaez:2:30000000 +md:1:1726 +midi:1260:1978171 +aabh:2:30000000 +aabi:2:30000000 +o99:1:24379 +avi:2:23038808 +aabj:2:30000000 +aabk:2:30000000 +20020407:1:301813 +aalv:1:15000000 +aabd:2:30000000 +aaeq:2:30000000 +20090111:1:29072 +m4a:368:2201764589 +aalw:1:6251776 +aabe:2:30000000 +aaep:2:30000000 +20090110:2:304175 +20010118:1:541990 +aalt:1:15000000 +aaes:2:30000000 +aabf:2:30000000 +899:1:16876 +jpeg:303:51389627 +aalu:1:15000000 +aaer:2:30000000 +aabg:2:30000000 +20090112:1:2161108 +m4b:9277:22453442288 +aalr:1:15000000 +aaeu:2:30000000 +aals:1:15000000 +aaba:2:30000000 +aaet:2:30000000 +799:1:29554 +psd:4:1763402 +20030523:1:356667 +ly:555:864002 +png:1006017:63393768509 +aalp:1:15000000 +aaew:2:30000000 +aabb:2:30000000 +20030828:2:548930 +aalq:1:15000000 +aaev:2:30000000 +aabc:2:30000000