drafts: 088

author Lucian Mogosanu <lucian.mogosanu@gmail.com>

Wed, 6 Mar 2019 22:05:40 +0000 (00:05 +0200)

committer Lucian Mogosanu <lucian.mogosanu@gmail.com>

Wed, 6 Mar 2019 22:05:40 +0000 (00:05 +0200)
author Lucian Mogosanu <lucian.mogosanu@gmail.com>
Wed, 6 Mar 2019 22:05:40 +0000 (00:05 +0200)
committer Lucian Mogosanu <lucian.mogosanu@gmail.com>
Wed, 6 Mar 2019 22:05:40 +0000 (00:05 +0200)
diff --git a/drafts/088-gutenberg-iv.markdown b/drafts/088-gutenberg-iv.markdown

new file mode 100644 (file)

index 0000000..f5e8009
--- /dev/null
+++ b/drafts/088-gutenberg-iv.markdown
@@ -0,0 +1,149 @@
+---
+postid: 088
+title: Gutenberg ASCII archive updated, now with 0.4% less junk
+date: March 9, 2019
+author: Lucian Mogoșanu
+tags: tmsr
+---
+
+The updated ASCII text archive of Project Gutenberg is available at
+[lucian.mogosanu.ro][lmogo.xyz].
+
+~~~~
+TODO: signed ksum
+~~~~
+
+Read further for details, technical or otherwise.
+
+The differences brought by this version over the
+[previous one][gutenberg-iii] have to do mostly with a lack of crap to
+bleed the reader's eyes, or, to be [more precise][btcbase-1894785]:
+
+> **mircea_popescu**: BingoBoingo and it is VERY HARMFUL fucking
+> junk. having "All donations should be made to "Project Gutenberg/CMU":
+> and are tax deductible to the extent allowable by law. (CMU =
+> Carnegie- Mellon University)." or "Copyright laws are changing all
+> over the world, be sure to check the copyright laws for your country
+> before posting these files!!" in the lede of "The Merchant of Venice
+> by William Shakespeare" promotes a most harmful and in any case
+> uncountenable view whereby the fucktarded usgistan is at least more
+> important than fucking shakespeare.  
+> **mircea_popescu**: it very well fucking is not. it's not even
+> remotely as important. having usg.cmu or usg.anything-else spew on
+> actual literature is nothing short of vandalism. i don't want their
+> grafitti, and i don't care why they think they're owed it.  
+> **mircea_popescu**: this without even going into ridiculous nonsense a
+> la "We produce about two million dollars for each hour we work. The
+> time it takes us, a rather conservative estimate, is fifty hours to
+> get any etext selected, entered, proofread, edited, copyright searched
+> and analyzed, the copyright letters written, etc. This projected
+> audience is one hundred million readers. If our value per text is
+> nominally estimated at one dollar then we produce $2 million dollars
+> per hour" ; apparently nobody fucking there bothered to EVER confront
+> [http://btcbase.org/log/2017-05-15#1656097][btcbase-1656097] or
+> [http://btcbase.org/log/2017-07-15#1684170][btcbase-1684170] etcetera.
+
+Taking a cursory look at some of the books, one can immediately notice a
+message along the lines of:
+
+~~~~
+*** START OF THIS PROJECT GUTENBERG EBOOK yadda-yadda ***
+
+--- or:
+
+***START**THE SMALL PRINT!**FOR PUBLIC DOMAIN EBOOKS**START***
+... legalese, followed by:
+*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*
+~~~~
+
+which brings into our mind the name of one Guillotin, and of one device
+that sharply separates the head(er) from the body -- in our case, the
+embodiment of this device being ye olde text processing tools `grep`,
+`head` and `tail`. Thus, we can get some preliminary results by running
+the following ugly but very effective bash snippet:
+
+~~~~ {.bash}
+marker='\*\*\* \?START OF \(THIS\|THE\) PROJECT GUTENBERG EBOOK'
+
+# For each file, attempt guillotine
+find ${GUTENDIR} -name '*.txt' | while read f; do
+        fname=$(basename ${f})
+        dirname=$(dirname ${f} | sed "s/^${GUTENDIR}\///g")
+
+        # Look for end-header marker
+        mloc=$(grep -n -m 1 "$marker" $f)
+        if [ ! -z "${mloc}" ]; then
+                # If found, say something
+                >&2 echo "$dirname/$fname -- found:$mloc"
+                
+                # Copy guillotined file from source to target directory;
+                # comment the lines below to do a dry run.
+                linenum=$(echo $mloc | cut -d":" -f1)
+
+                mkdir -p ${TARGETDIR}/${dirname}
+
+                >&2 echo "Guillotining ${f} into ${TARGETDIR}/${dirname}/${fname}"
+                tail -n +$(($linenum + 1)) ${f} > ${TARGETDIR}/${dirname}/${fname}
+        else
+                # If not found, say something else
+                >&2 echo "$dirname/$fname -- not-found"
+
+                # Copy file as-is; comment the lines below to do a dry run.
+                mkdir -p ${TARGETDIR}/${dirname}
+                >&2 echo "Copying ${f} to ${TARGETDIR}/${dirname}/${fname}"
+                cp ${f} ${TARGETDIR}/${dirname}/${fname}
+        fi
+done
+~~~~
+
+After giving this a try, we observe that at some point our script fails,
+because for some files the output of `grep` is "Binary file ... matches"
+instead of what we'd expect. The reason for this is that some of the
+files are *not* actually ASCII, so we go and `sed` the letters to their
+ASCII equivalents (e.g. é to e), or, where orcograms cannot be easily
+disposed of, we use some encoding that our grep can recognize.
+
+This first episode successfully ended, we rerun the script, finding out
+that not all headers have been processed, judging by the fact that
+they're in the "not-found" set. We run once again, setting `$marker` to
+some other relevant value, and upon this we observe that some output
+files in `$TARGETDIR` are now empty.
+
+The reason for this is that not only there isn't a single "standard"
+copyrast header inserted into each book, but that some "headers" are
+actually footers. Thus we use a heuristic to determine whether the end
+marker of a "small print" notice is at the beginning or the end of a
+file[^1] and we cut accordingly, leaving us with another batch of
+non-standard copyshit snippets.
+
+After repeated pruning, we're left with a few (96, more precisely)
+files that we -- by which "we" mean I -- checked manually to make sure
+that they're clean, which they were. Hopefully I didn't miss anything,
+please complain if I did.
+
+Bottom line: a total of 46150 files were processed, of which 46028 were
+deheadered, leaving us with 96 books that had no headers to begin with
+(or readmes, addenda to musical scores, etc.) and 26 index files. The
+total size of the headers was a not-so-measly 88.9MB that are now
+forever gone into the void.
+
+[^1]: For example:
+
+    ~~~~ {.bash}
+    mloc=$(grep -n -m 1 "$marker" $f)
+    tnloc=$(wc -l $f | cut -d" " -f1)
+    mnloc=$(echo $mloc | cut -d":" -f1)
+    if [ $(($tnloc - $mnloc)) -le 10 ]; then
+            echo "at-end"
+    else
+            echo "at-beginning"
+    fi
+    ~~~~
+
+    where the magic value of 10 is conveniently chosen. And so on.
+
+[lmogo.xyz]: http://lmogo.xyz/randomio/gutentext.tar.xz
+[gutenberg-iii]: /posts/y05/085-gutenberg-iii.html
+[btcbase-1894785]: http://btcbase.org/log/2019-02-10#1894785
+[btcbase-1656097]: http://btcbase.org/log/2017-05-15#1656097
+[btcbase-1684170]: http://btcbase.org/log/2017-07-15#1684170
author	Lucian Mogosanu <lucian.mogosanu@gmail.com>
	Wed, 6 Mar 2019 22:05:40 +0000 (00:05 +0200)
committer	Lucian Mogosanu <lucian.mogosanu@gmail.com>
	Wed, 6 Mar 2019 22:05:40 +0000 (00:05 +0200)