--- /dev/null
+---
+postid: 084
+title: gutenberg.org, part zwei, a dissection
+date: February 12, 2019
+author: Lucian Mogoșanu
+tags: tmsr
+---
+
+Now that we, and by "we" I mean "I", have a full copy of the
+[Project Gutenberg archive][gutenberg-i], the first thing I can do
+before publishing it is to have a look at what's actually there, keeping
+in mind the objective of [separating][btcbase-1894708]
+[the wheat from the chaff][btcbase-1894768]. However, before finding out
+what's there, there's the legitimate question of what "finding out
+what's there" even means: ideally, one'd catalogue everything that's
+there, which is, I guess, not such a difficult task, given that the
+archive already contains some metadata; also, it's conceivable that one
+would want to verify the authenticity of (some of) the writings, which
+is hard, or maybe downright sisyphean if we're to consider every single
+text that's in there[^1].
+
+For now, I'm going to narrow down the scope of this "finding out what's
+there" to discovering what file types there are, how many of them and
+how much they weigh in bytes. To this end, I'm publishing a script
+called [dissect-guten.bash][dissect-guten.bash], that, for each file in
+the gutenberg archive, it:
+
+1. looks at its extension, and
+2. if the extension exists, i.e. if the file is of the form
+ `$name.$ext`, then
+3. it increments a counter `num[$ext]`, and
+4. it adds the size of the file to a byte-counter `siz[$ext]`
+
+The full code:
+
+~~~~ {.bash}
+#!/bin/bash
+
+declare -A num
+declare -A siz
+
+print_data() {
+ echo "--------------"
+ for key in "${!num[@]}"; do
+ echo ${key}:${num[${key}]}:${siz[${key}]}
+ done
+
+ exit 1
+}
+
+# Print data if we press ctrl+c
+trap print_data SIGINT
+
+# Can't use find guten | while read line ... because this will spawn a
+# sub-shell where all our updates will be lost.
+while read line; do
+ fn=$(basename -- $line)
+ ext="${fn##*.}"
+ if [ ! \( \( "${fn}" = "${ext}" \) -o \( -z "${ext}" \) \) ]; then
+ fsize=$(stat -c "%s" $line)
+ if [ -z ${num["${ext}"]} ]; then
+ num+=(["${ext}"]=1)
+ siz+=(["${ext}"]="${fsize}")
+ else
+ num+=(["${ext}"]=$(echo ${num["${ext}"]} + 1 \
+ | BC_LINE_LENGTH=0 bc))
+ siz+=(["${ext}"]=$(echo ${siz["${ext}"]} + ${fsize} \
+ | BC_LINE_LENGTH=0 bc))
+ fi
+ >&2 echo "${ext}:${num["${ext}"]}:${siz["${ext}"]}"
+ fi
+done < <(find guten/)
+print_data
+~~~~
+
+The script is unfortunately a set of shitty bash-isms, but otherwise it
+is fairly short and readable. Running this in a directory where `guten/`
+is present and redirecting standard output to, say,
+[file-types.txt][file-types.txt], yields a result that I won't recount
+in full in the post, but that, in short, consists of a set of lines of
+the form:
+
+~~~~
+ext:n:siz
+~~~~
+
+where `ext` is an extension, `n` is the number of files with that
+extension and `siz` is the total size in bytes. We can then e.g.:
+
+~~~~
+$ sort -t: -nk3 file-types.txt > ~/file-types-sorted.txt
+$ awk -F':' '{print $1,$2,$3,($3/(1024*1024*1024)),$3/$2/(1024*1024)}' \
+ ~/file-types-sorted.txt
+~~~~
+
+which reveals to us that the biggest offenders are zip archives
+(~300GB), mp3 and ogg files (~230GB total), images (~140GB total) and
+video files (a bit over 20GB), with the rest being pdf, epub, iso
+etc. and ~40GB representing actual text files.
+
+Of course, the text files can be further cut by removing e.g. copyright
+headers[^2], the illustrations can be given a thorough review to
+establish what's actually worth keeping, but we'll leave these for the
+next episodes of this saga.
+
+[^1]: Which I'm not sure of either, given that much of the text might be
+ garbage. To be honest, I just don't know.
+
+[^2]: If you're not aware of it by now, let me remind you that there is
+ only one type of "[intellectual property][intellectual-ownership]":
+ that in which something is owned by being thoroughly examined and
+ deeply understood, [layer by layer][re-reading]. Sure, this is
+ further away from "copyright" (and other such fictions) than the
+ average orc mind can conceive; or, in other words, if you're not
+ aware of it by now, then there is very little hope left for you in
+ this world.
+
+ And now, assuming you've understood the previous paragraph and have
+ come to a realization regarding the value of actual intellectual
+ work, imagine the entirety of modern journalism and popular culture
+ produced in the last few decades, the music and movie "industries",
+ all the "best-sellers", the [academic wank][hogwash] and so on and
+ so forth. [Imagine][inchipuiti-va] that and how worthless a pile of
+ junk it is.
+
+ I'm sure Gutenberg himself would have been in awe looking at all the
+ multilaterally valuable intellectual valuables published in the late
+ 20th/early 21st century!
+
+[gutenberg-i]: /posts/y05/083-gutenberg-rsync.html
+[btcbase-1894708]: http://btcbase.org/log/2019-02-10#1894708
+[btcbase-1894768]: http://btcbase.org/log/2019-02-10#1894768
+[dissect-guten.bash]: TODO
+[file-types.txt]: TODO
+[intellectual-ownership]: /posts/y04/069-on-intellectual-ownership.html
+[re-reading]: http://trilema.com/2017/re-reading-is-the-most-powerful-tool/#selection-21.114-21.433
+[hogwash]: /posts/y02/045-academic-hogwash.html
+[inchipuiti-va]: http://trilema.com/2009/inchipuiti-va/