From: Lucian Mogosanu <lucian.mogosanu@gmail.com>
Date: Sun, 17 Feb 2019 14:44:01 +0000 (+0200)
Subject: posts: 084
X-Git-Tag: v0.11~102
X-Git-Url: https://git.mogosanu.ro/?a=commitdiff_plain;h=6be9638ae0dc6d21459eb9cb10ea54b5870335c6;p=thetarpit.git

posts: 084
---

diff --git a/posts/y05/084-gutenberg-ii.markdown b/posts/y05/084-gutenberg-ii.markdown
index 8763525..c5aadf8 100644
--- a/posts/y05/084-gutenberg-ii.markdown
+++ b/posts/y05/084-gutenberg-ii.markdown
@@ -1,7 +1,7 @@
 ---
 postid: 084
-title: gutenberg.org, part zwei, a dissection
-date: February 12, 2019
+title: gutenberg.org part zwei, a dissection
+date: February 17, 2019
 author: Lucian MogoÈanu
 tags: tmsr
 ---
@@ -11,19 +11,19 @@ Now that we, and by "we" I mean "I", have a full copy of the
 before publishing it is to have a look at what's actually there, keeping
 in mind the objective of [separating][btcbase-1894708]
 [the wheat from the chaff][btcbase-1894768]. However, before finding out
-what's there, there's the legitimate question of what "finding out
-what's there" even means: ideally, one'd catalogue everything that's
-there, which is, I guess, not such a difficult task, given that the
-archive already contains some metadata; also, it's conceivable that one
-would want to verify the authenticity of (some of) the writings, which
-is hard, or maybe downright sisyphean if we're to consider every single
-text that's in there[^1].
-
-For now, I'm going to narrow down the scope of this "finding out what's
-there" to discovering what file types there are, how many of them and
-how much they weigh in bytes. To this end, I'm publishing a script
+what's there, there's the question of what "finding out what's there"
+even means: ideally, one'd catalogue everything, which is, I guess, not
+such a difficult task, given that the archive already contains some
+metadata; also, it's conceivable that one would want to verify the
+authenticity of (some of) the writings, which is hard, or maybe
+downright sisyphean if we're to consider every single text on
+gutenberg.org[^1].
+
+For now, we're going to narrow down the scope of this "finding out
+what's there" to discovering what file types there are, how many of them
+and how much they weigh in bytes. To this end, I'm publishing a script
 called [dissect-guten.bash][dissect-guten.bash], that, for each file in
-the gutenberg archive, it:
+the gutenberg.org archive:
 
 1. looks at its extension, and
 2. if the extension exists, i.e. if the file is of the form
@@ -84,7 +84,7 @@ the form:
 ext:n:siz
 ~~~~
 
-where `ext` is an extension, `n` is the number of files with that
+where `ext` is a file extension, `n` is the number of files bearing said
 extension and `siz` is the total size in bytes. We can then e.g.:
 
 ~~~~
@@ -93,46 +93,62 @@ $ awk -F':' '{print $1,$2,$3,($3/(1024*1024*1024)),$3/$2/(1024*1024)}' \
     ~/file-types-sorted.txt
 ~~~~
 
-which reveals to us that the biggest offenders are zip archives
+and further massage the data set to obtain a fancy graph, such as:
+
+<img align="middle"
+class="thumb" src="/uploads/2019/02/file-types-dot.png">
+
+which reveals to us that the biggest offenders are zip archives[^2]
 (~300GB), mp3 and ogg files (~230GB total), images (~140GB total) and
 video files (a bit over 20GB), with the rest being pdf, epub, iso
-etc. and ~40GB representing actual text files.
+etc. and, *finally*, ~40GB representing actual text files.
 
 Of course, the text files can be further cut by removing e.g. copyright
-headers[^2], the illustrations can be given a thorough review to
+headers[^3], the illustrations can be given a thorough review to
 establish what's actually worth keeping, but we'll leave these for the
 next episodes of this saga.
 
-[^1]: Which I'm not sure of either, given that much of the text might be
-    garbage. To be honest, I just don't know.
-
-[^2]: If you're not aware of it by now, let me remind you that there is
-    only one type of "[intellectual property][intellectual-ownership]":
-    that in which something is owned by being thoroughly examined and
-    deeply understood, [layer by layer][re-reading].  Sure, this is
-    further away from "copyright" (and other such fictions) than the
-    average orc mind can conceive; or, in other words, if you're not
-    aware of it by now, then there is very little hope left for you in
-    this world.
+[^1]: Although I'm not sure that's worth the effort, given that much of
+    the writing might be garbage. To be honest, I have no idea.
+
+[^2]: Looking a bit deeper at this, the zip files contain various
+    "versions" of the book, e.g. the HTML files plus illustrations. This
+    yields a lot of duplicate data. For example:
+
+    ~~~~
+    $ ls -1 guten/2/2/2/2/22229/22229*.zip
+    guten/2/2/2/2/22229/22229-8.zip
+    guten/2/2/2/2/22229/22229-h.zip
+    guten/2/2/2/2/22229/22229.zip
+    ~~~~
+    
+    Where the `-h.zip` file contains the HTML book, the `-8.zip` file
+    contains the same thing in some non-ASCII text format and finally
+    `22229.zip` contains the ASCII text file.
+
+[^3]: Let us remember that there is only one type of
+    "[intellectual property][intellectual-ownership]": that in which
+    something is owned by being thoroughly examined and understood in
+    depth, [layer by layer][re-reading]. Sure, this is further away from
+    "copyright" (and other such fictions) than the average orc mind can
+    conceive; but really, who cares? By now there is little hope left
+    for those who aren't aware of it.
 
     And now, assuming you've understood the previous paragraph and have
     come to a realization regarding the value of actual intellectual
     work, imagine the entirety of modern journalism and popular culture
     produced in the last few decades, the music and movie "industries",
     all the "best-sellers", the [academic wank][hogwash] and so on and
-    so forth. [Imagine][inchipuiti-va] that and how worthless a pile of
-    junk it is.
-
+    so forth. Imagine all that and how worthless a pile of junk it is --
     I'm sure Gutenberg himself would have been in awe looking at all the
-    multilaterally valuable intellectual valuables published in the late
+    multilaterally valuable intellectual stuff published in the late
     20th/early 21st century!
 
 [gutenberg-i]: /posts/y05/083-gutenberg-rsync.html
 [btcbase-1894708]: http://btcbase.org/log/2019-02-10#1894708
 [btcbase-1894768]: http://btcbase.org/log/2019-02-10#1894768
-[dissect-guten.bash]: TODO
-[file-types.txt]: TODO
+[dissect-guten.bash]: /uploads/2019/02/dissect-guten.bash
+[file-types.txt]: /uploads/2019/02/file-types.txt
 [intellectual-ownership]: /posts/y04/069-on-intellectual-ownership.html
 [re-reading]: http://trilema.com/2017/re-reading-is-the-most-powerful-tool/#selection-21.114-21.433
 [hogwash]: /posts/y02/045-academic-hogwash.html
-[inchipuiti-va]: http://trilema.com/2009/inchipuiti-va/
diff --git a/uploads/2019/02/file-types-dot.png b/uploads/2019/02/file-types-dot.png
new file mode 100644
index 0000000..cbf1449
Binary files /dev/null and b/uploads/2019/02/file-types-dot.png differ
diff --git a/uploads/2019/02/plot.R b/uploads/2019/02/plot.R
new file mode 100644
index 0000000..313bd4b
--- /dev/null
+++ b/uploads/2019/02/plot.R
@@ -0,0 +1,16 @@
+d <- read.table("file-types-aggregated.txt", sep=":", header=FALSE)
+
+sum=sum(d$V3)
+d2 <- data.frame(V1=d$V1, V2=d$V3 / sum * 10^2)
+
+# Sort data?
+d2 <- d2[order(d2$V2),]
+
+# Draw vertical lines for data above threshold
+d3 <- d2[d2$V2 > 5,]
+
+# Thumbnail
+png("file-types-dot.png", width=690, height=800, res=90)
+dotchart(d2$V2, labels=d2$V1, xlim=c(0,40), xlab="% of total guten/ size")
+abline(v=d3$V2, lty=3)
+text(x=d3$V2, y=match(d3$V2,d2$V2), labels=sprintf("%.01f%%", trunc(d3$V2 * 10)/10), pos=4)
diff --git a/uploads/2019/02/plot.sh b/uploads/2019/02/plot.sh
new file mode 100644
index 0000000..f1dcab7
--- /dev/null
+++ b/uploads/2019/02/plot.sh
@@ -0,0 +1,15 @@
+#!/bin/sh
+
+# Everything under the 1GB threshold goes into the "other" category
+cat file-types.txt | \
+    awk -F':' 'BEGIN {OFS=":"} \
+              { if ($3 >= (1024 * 1024 * 1024)) print $1,$2,$3 }' > bigger.txt
+cat file-types.txt | \
+    awk -F':' 'BEGIN { OFS=":"; sum2=0; sum3=0; } \
+               { if ($3 < (1024 * 1024 * 1024)) { sum2+=$2; sum3+=$3; } } \
+               END { print "others",sum2,sum3 }' > smaller.txt
+cat bigger.txt smaller.txt > file-types-aggregated.txt
+rm bigger.txt smaller.txt
+
+Rscript plot.R
+rm file-types-aggregated.txt