+++ /dev/null
----
-postid: 088
-title: Gutenberg ASCII archive updated, now with 0.4% less junk
-date: March 9, 2019
-author: Lucian Mogoșanu
-tags: tmsr
----
-
-The updated ASCII text archive of Project Gutenberg is available at
-[lucian.mogosanu.ro][lmogo.xyz].
-
-~~~~
-TODO: signed ksum
-~~~~
-
-Read further for details, technical or otherwise.
-
-The differences brought by this version over the
-[previous one][gutenberg-iii] have to do mostly with a lack of crap to
-bleed the reader's eyes, or, to be [more precise][btcbase-1894785]:
-
-> **mircea_popescu**: BingoBoingo and it is VERY HARMFUL fucking
-> junk. having "All donations should be made to "Project Gutenberg/CMU":
-> and are tax deductible to the extent allowable by law. (CMU =
-> Carnegie- Mellon University)." or "Copyright laws are changing all
-> over the world, be sure to check the copyright laws for your country
-> before posting these files!!" in the lede of "The Merchant of Venice
-> by William Shakespeare" promotes a most harmful and in any case
-> uncountenable view whereby the fucktarded usgistan is at least more
-> important than fucking shakespeare.
-> **mircea_popescu**: it very well fucking is not. it's not even
-> remotely as important. having usg.cmu or usg.anything-else spew on
-> actual literature is nothing short of vandalism. i don't want their
-> grafitti, and i don't care why they think they're owed it.
-> **mircea_popescu**: this without even going into ridiculous nonsense a
-> la "We produce about two million dollars for each hour we work. The
-> time it takes us, a rather conservative estimate, is fifty hours to
-> get any etext selected, entered, proofread, edited, copyright searched
-> and analyzed, the copyright letters written, etc. This projected
-> audience is one hundred million readers. If our value per text is
-> nominally estimated at one dollar then we produce $2 million dollars
-> per hour" ; apparently nobody fucking there bothered to EVER confront
-> [http://btcbase.org/log/2017-05-15#1656097][btcbase-1656097] or
-> [http://btcbase.org/log/2017-07-15#1684170][btcbase-1684170] etcetera.
-
-Taking a cursory look at some of the books, one can immediately notice a
-message along the lines of:
-
-~~~~
-*** START OF THIS PROJECT GUTENBERG EBOOK yadda-yadda ***
-
---- or:
-
-***START**THE SMALL PRINT!**FOR PUBLIC DOMAIN EBOOKS**START***
-... legalese, followed by:
-*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*
-~~~~
-
-which brings into our mind the name of one Guillotin, and of one device
-that sharply separates the head(er) from the body -- in our case, the
-embodiment of this device being ye olde text processing tools `grep`,
-`head` and `tail`. Thus, we can get some preliminary results by running
-the following ugly but very effective bash snippet:
-
-~~~~ {.bash}
-marker='\*\*\* \?START OF \(THIS\|THE\) PROJECT GUTENBERG EBOOK'
-
-# For each file, attempt guillotine
-find ${GUTENDIR} -name '*.txt' | while read f; do
- fname=$(basename ${f})
- dirname=$(dirname ${f} | sed "s/^${GUTENDIR}\///g")
-
- # Look for end-header marker
- mloc=$(grep -n -m 1 "$marker" $f)
- if [ ! -z "${mloc}" ]; then
- # If found, say something
- >&2 echo "$dirname/$fname -- found:$mloc"
-
- # Copy guillotined file from source to target directory;
- # comment the lines below to do a dry run.
- linenum=$(echo $mloc | cut -d":" -f1)
-
- mkdir -p ${TARGETDIR}/${dirname}
-
- >&2 echo "Guillotining ${f} into ${TARGETDIR}/${dirname}/${fname}"
- tail -n +$(($linenum + 1)) ${f} > ${TARGETDIR}/${dirname}/${fname}
- else
- # If not found, say something else
- >&2 echo "$dirname/$fname -- not-found"
-
- # Copy file as-is; comment the lines below to do a dry run.
- mkdir -p ${TARGETDIR}/${dirname}
- >&2 echo "Copying ${f} to ${TARGETDIR}/${dirname}/${fname}"
- cp ${f} ${TARGETDIR}/${dirname}/${fname}
- fi
-done
-~~~~
-
-After giving this a try, we observe that at some point our script fails,
-because for some files the output of `grep` is "Binary file ... matches"
-instead of what we'd expect. The reason for this is that some of the
-files are *not* actually ASCII, so we go and `sed` the letters to their
-ASCII equivalents (e.g. é to e), or, where orcograms cannot be easily
-disposed of, we use some encoding that our grep can recognize.
-
-This first episode successfully ended, we rerun the script, finding out
-that not all headers have been processed, judging by the fact that
-they're in the "not-found" set. We run once again, setting `$marker` to
-some other relevant value, and upon this we observe that some output
-files in `$TARGETDIR` are now empty.
-
-The reason for this is that not only there isn't a single "standard"
-copyrast header inserted into each book, but that some "headers" are
-actually footers. Thus we use a heuristic to determine whether the end
-marker of a "small print" notice is at the beginning or the end of a
-file[^1] and we cut accordingly, leaving us with another batch of
-non-standard copyshit snippets.
-
-After repeated pruning, we're left with a few (96, more precisely)
-files that we -- by which "we" mean I -- checked manually to make sure
-that they're clean, which they were. Hopefully I didn't miss anything,
-please complain if I did.
-
-Bottom line: a total of 46150 files were processed, of which 46028 were
-deheadered, leaving us with 96 books that had no headers to begin with
-(or readmes, addenda to musical scores, etc.) and 26 index files. The
-total size of the headers was a not-so-measly 88.9MB that are now
-forever gone into the void.
-
-[^1]: For example:
-
- ~~~~ {.bash}
- mloc=$(grep -n -m 1 "$marker" $f)
- tnloc=$(wc -l $f | cut -d" " -f1)
- mnloc=$(echo $mloc | cut -d":" -f1)
- if [ $(($tnloc - $mnloc)) -le 10 ]; then
- echo "at-end"
- else
- echo "at-beginning"
- fi
- ~~~~
-
- where the magic value of 10 is conveniently chosen. And so on.
-
-[lmogo.xyz]: http://lmogo.xyz/randomio/gutentext.tar.xz
-[gutenberg-iii]: /posts/y05/085-gutenberg-iii.html
-[btcbase-1894785]: http://btcbase.org/log/2019-02-10#1894785
-[btcbase-1656097]: http://btcbase.org/log/2017-05-15#1656097
-[btcbase-1684170]: http://btcbase.org/log/2017-07-15#1684170
--- /dev/null
+---
+postid: 088
+title: Gutenberg ASCII archive updated, now with 0.5% less junk
+date: March 9, 2019
+author: Lucian Mogoșanu
+tags: tmsr
+---
+
+The updated ASCII text archive of Project Gutenberg is available at
+[lucian.mogosanu.ro][lmogo.xyz].
+
+~~~~
+-----BEGIN PGP SIGNED MESSAGE-----
+Hash: SHA512
+
+$ ksum gutentext.tar.xz > gutentext.tar.xz.ksum
+$ cat gutentext.tar.xz.ksum
+e0e3bbc7677365f8503a5a10c7e1a3ab28864dd7c8e87d1aace7b11c4985b4dd3383dfca616ee96ad0cc1f1d36312b0c7ba544a29189ed2d2ea36cb13a687df9 gutentext.tar.xz
+-----BEGIN PGP SIGNATURE-----
+
+iQIcBAEBCgAGBQJcg7CjAAoJEL2unQUaPTuVMwIQAJV7J8JeHuiU6ZnqXSAdO4aC
+n7nzs4mchEGEaGXttTMFnunvaZ5GgpejGB/puGwYKjhUXPcTdoGgPMowLJV4F4Y4
+Ispev9b6K2/7AIDTbAZEI+rqkf1aE4sJob68KjBQjrOFgBNbgEvCHIjdTY7x/zbp
+Jz19yo06/31E8TUUMDTW2BSwPC4gzAK15OBvSwjE6fUJVIt/ffMs1y/HX++09jO0
+H/1bYEdQ9WOSGxHkO7siSaQa3uKyW6K7Le3XPK+bp4XGJX4z0k7ZNAOC7Ard8Izl
+uLJi4ROtOV+UqLv7oR2cPXOgSCNEWnnqNxyRopHOesUB1rdboGylYmTC/z49qXS2
+nqCmzvu9xCxApkhv6oxf9swhSTpw/2c6ioP5Ze/LBWEoVUe3l8EWtc9TIAUXpXu7
+Bfk6XUhRFGpLC46Y7MG8Bj3bOmypu5lH3ksgo5QaUkcVecRpOYj6Mp2IWHlvNgvp
+cnd0iuiIaK24rIl74elEvi2xyN3W8IwGtYhv8CBwYI1rvnsDcJrWD7xvQGVwf/SD
+PiklJ/IM2Dev4AJubnT0U3N5xdo7mXVBhzu3Ky4qjJiRl1CYcdVNHxPRxT8XHqay
+KzLvY6NgfFrdpL5uGRop9F5qi0Ax1dNlg4u/6oqd7ryKA6g5X8nuQi7IHMc4B3J+
+pGobf0U4ao162tm9jyBV
+=Jy3B
+-----END PGP SIGNATURE-----
+~~~~
+
+Read further for details, technical or otherwise.
+
+The differences brought by this version over the
+[previous one][gutenberg-iii] have to do mostly with a lack of crap to
+bleed the reader's eyes, or, to be [more precise][btcbase-1894785]:
+
+> **mircea_popescu**: BingoBoingo and it is VERY HARMFUL fucking
+> junk. having "All donations should be made to "Project Gutenberg/CMU":
+> and are tax deductible to the extent allowable by law. (CMU =
+> Carnegie- Mellon University)." or "Copyright laws are changing all
+> over the world, be sure to check the copyright laws for your country
+> before posting these files!!" in the lede of "The Merchant of Venice
+> by William Shakespeare" promotes a most harmful and in any case
+> uncountenable view whereby the fucktarded usgistan is at least more
+> important than fucking shakespeare.
+> **mircea_popescu**: it very well fucking is not. it's not even
+> remotely as important. having usg.cmu or usg.anything-else spew on
+> actual literature is nothing short of vandalism. i don't want their
+> grafitti, and i don't care why they think they're owed it.
+> **mircea_popescu**: this without even going into ridiculous nonsense a
+> la "We produce about two million dollars for each hour we work. The
+> time it takes us, a rather conservative estimate, is fifty hours to
+> get any etext selected, entered, proofread, edited, copyright searched
+> and analyzed, the copyright letters written, etc. This projected
+> audience is one hundred million readers. If our value per text is
+> nominally estimated at one dollar then we produce $2 million dollars
+> per hour" ; apparently nobody fucking there bothered to EVER confront
+> [http://btcbase.org/log/2017-05-15#1656097][btcbase-1656097] or
+> [http://btcbase.org/log/2017-07-15#1684170][btcbase-1684170] etcetera.
+
+Taking a cursory look at some of the books, one can immediately notice a
+message along the lines of:
+
+~~~~
+*** START OF THIS PROJECT GUTENBERG EBOOK yadda-yadda ***
+~~~~
+
+or maybe:
+
+~~~~
+***START**THE SMALL PRINT!**FOR PUBLIC DOMAIN EBOOKS**START***
+... screens of legalese, followed by:
+*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*
+~~~~
+
+which brings into our mind the name of one Guillotin, and of one device
+that sharply separates the head(er) from the body -- in our case, the
+embodiment of this device being ye olde text processing tools `grep`,
+`head` and `tail`. We can thusly get some preliminary results by running
+the following ugly but very effective bash snippet:
+
+~~~~ {.bash}
+marker='\*\*\* \?START OF \(THIS\|THE\) PROJECT GUTENBERG EBOOK'
+
+# For each file, attempt guillotine
+find ${GUTENDIR} -name '*.txt' | while read f; do
+ fname=$(basename ${f})
+ dirname=$(dirname ${f} | sed "s/^${GUTENDIR}\///g")
+
+ # Look for end-header marker
+ mloc=$(grep -n -m 1 "$marker" $f)
+ if [ ! -z "${mloc}" ]; then
+ # If found, say something
+ >&2 echo "$dirname/$fname -- found:$mloc"
+
+ # Copy guillotined file from source to target directory;
+ # comment the lines below to do a dry run.
+ linenum=$(echo $mloc | cut -d":" -f1)
+
+ mkdir -p ${TARGETDIR}/${dirname}
+
+ >&2 echo "Guillotining ${f} into ${TARGETDIR}/${dirname}/${fname}"
+ tail -n +$(($linenum + 1)) ${f} > ${TARGETDIR}/${dirname}/${fname}
+ else
+ # If not found, say something else
+ >&2 echo "$dirname/$fname -- not-found"
+
+ # Copy file as-is; comment the lines below to do a dry run.
+ mkdir -p ${TARGETDIR}/${dirname}
+ >&2 echo "Copying ${f} to ${TARGETDIR}/${dirname}/${fname}"
+ cp ${f} ${TARGETDIR}/${dirname}/${fname}
+ fi
+done
+~~~~
+
+After giving this a try, we observe that at some point our script fails,
+because for some files the output of `grep` is "Binary file ... matches"
+instead of what we'd expect. The reason for this is that some of the
+files are *not* actually ASCII, so we go and `sed` the letters to their
+ASCII equivalents (e.g. é to e), or, where orcograms cannot be easily
+disposed of, we use some encoding that our grep can recognize.
+
+This first episode successfully ended, we rerun the script, finding out
+that not all headers have been processed, judging by the fact that
+they're in the "not-found" set. We run once again, setting `$marker` to
+some other relevant value, and upon this we observe that some output
+files in `$TARGETDIR` are now empty.
+
+The reason for this is that not only there isn't a single "standard"
+copyrast header inserted into each book, but that some "headers" are
+actually footers. Thus we use a heuristic to determine whether the end
+marker of a "small print" notice is at the beginning or the end of a
+file[^1] and we cut accordingly, leaving us with another batch of
+non-standard copyshit snippets.
+
+After repeated pruning, we're left with a few (96, more precisely)
+files that we -- by which "we" mean I -- checked manually to make sure
+that they're clean, which they were. Hopefully I didn't miss anything,
+please complain if I did.
+
+Bottom line: a total of 46150 files were processed, of which 46028 were
+deheadered, leaving us with 96 books that had no headers to begin with
+(or readmes, addenda to musical scores, etc.) and 26 index files. The
+total size of the headers was a not-so-measly 88.9MB that are now
+forever gone into the void.
+
+[^1]: For example:
+
+ ~~~~ {.bash}
+ mloc=$(grep -n -m 1 "$marker" $f)
+ tnloc=$(wc -l $f | cut -d" " -f1)
+ mnloc=$(echo $mloc | cut -d":" -f1)
+ if [ $(($tnloc - $mnloc)) -le 10 ]; then
+ echo "at-end"
+ else
+ echo "at-beginning"
+ fi
+ ~~~~
+
+ where the magic value of 10 is conveniently chosen. And so on.
+
+[lmogo.xyz]: http://lmogo.xyz/randomio/gutentext.tar.xz
+[gutenberg-iii]: /posts/y05/085-gutenberg-iii.html
+[btcbase-1894785]: http://btcbase.org/log/2019-02-10#1894785
+[btcbase-1656097]: http://btcbase.org/log/2017-05-15#1656097
+[btcbase-1684170]: http://btcbase.org/log/2017-07-15#1684170