From: Lucian Mogosanu Date: Sat, 9 Mar 2019 12:52:49 +0000 (+0200) Subject: posts: 088 X-Git-Tag: v0.11~83 X-Git-Url: https://git.mogosanu.ro/?a=commitdiff_plain;h=692befcfe6d4efb488ca55fa52c11108c4f71142;p=thetarpit.git posts: 088 --- diff --git a/drafts/088-gutenberg-iv.markdown b/drafts/088-gutenberg-iv.markdown deleted file mode 100644 index f5e8009..0000000 --- a/drafts/088-gutenberg-iv.markdown +++ /dev/null @@ -1,149 +0,0 @@ ---- -postid: 088 -title: Gutenberg ASCII archive updated, now with 0.4% less junk -date: March 9, 2019 -author: Lucian Mogoșanu -tags: tmsr ---- - -The updated ASCII text archive of Project Gutenberg is available at -[lucian.mogosanu.ro][lmogo.xyz]. - -~~~~ -TODO: signed ksum -~~~~ - -Read further for details, technical or otherwise. - -The differences brought by this version over the -[previous one][gutenberg-iii] have to do mostly with a lack of crap to -bleed the reader's eyes, or, to be [more precise][btcbase-1894785]: - -> **mircea_popescu**: BingoBoingo and it is VERY HARMFUL fucking -> junk. having "All donations should be made to "Project Gutenberg/CMU": -> and are tax deductible to the extent allowable by law. (CMU = -> Carnegie- Mellon University)." or "Copyright laws are changing all -> over the world, be sure to check the copyright laws for your country -> before posting these files!!" in the lede of "The Merchant of Venice -> by William Shakespeare" promotes a most harmful and in any case -> uncountenable view whereby the fucktarded usgistan is at least more -> important than fucking shakespeare. -> **mircea_popescu**: it very well fucking is not. it's not even -> remotely as important. having usg.cmu or usg.anything-else spew on -> actual literature is nothing short of vandalism. i don't want their -> grafitti, and i don't care why they think they're owed it. -> **mircea_popescu**: this without even going into ridiculous nonsense a -> la "We produce about two million dollars for each hour we work. The -> time it takes us, a rather conservative estimate, is fifty hours to -> get any etext selected, entered, proofread, edited, copyright searched -> and analyzed, the copyright letters written, etc. This projected -> audience is one hundred million readers. If our value per text is -> nominally estimated at one dollar then we produce $2 million dollars -> per hour" ; apparently nobody fucking there bothered to EVER confront -> [http://btcbase.org/log/2017-05-15#1656097][btcbase-1656097] or -> [http://btcbase.org/log/2017-07-15#1684170][btcbase-1684170] etcetera. - -Taking a cursory look at some of the books, one can immediately notice a -message along the lines of: - -~~~~ -*** START OF THIS PROJECT GUTENBERG EBOOK yadda-yadda *** - ---- or: - -***START**THE SMALL PRINT!**FOR PUBLIC DOMAIN EBOOKS**START*** -... legalese, followed by: -*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END* -~~~~ - -which brings into our mind the name of one Guillotin, and of one device -that sharply separates the head(er) from the body -- in our case, the -embodiment of this device being ye olde text processing tools `grep`, -`head` and `tail`. Thus, we can get some preliminary results by running -the following ugly but very effective bash snippet: - -~~~~ {.bash} -marker='\*\*\* \?START OF \(THIS\|THE\) PROJECT GUTENBERG EBOOK' - -# For each file, attempt guillotine -find ${GUTENDIR} -name '*.txt' | while read f; do - fname=$(basename ${f}) - dirname=$(dirname ${f} | sed "s/^${GUTENDIR}\///g") - - # Look for end-header marker - mloc=$(grep -n -m 1 "$marker" $f) - if [ ! -z "${mloc}" ]; then - # If found, say something - >&2 echo "$dirname/$fname -- found:$mloc" - - # Copy guillotined file from source to target directory; - # comment the lines below to do a dry run. - linenum=$(echo $mloc | cut -d":" -f1) - - mkdir -p ${TARGETDIR}/${dirname} - - >&2 echo "Guillotining ${f} into ${TARGETDIR}/${dirname}/${fname}" - tail -n +$(($linenum + 1)) ${f} > ${TARGETDIR}/${dirname}/${fname} - else - # If not found, say something else - >&2 echo "$dirname/$fname -- not-found" - - # Copy file as-is; comment the lines below to do a dry run. - mkdir -p ${TARGETDIR}/${dirname} - >&2 echo "Copying ${f} to ${TARGETDIR}/${dirname}/${fname}" - cp ${f} ${TARGETDIR}/${dirname}/${fname} - fi -done -~~~~ - -After giving this a try, we observe that at some point our script fails, -because for some files the output of `grep` is "Binary file ... matches" -instead of what we'd expect. The reason for this is that some of the -files are *not* actually ASCII, so we go and `sed` the letters to their -ASCII equivalents (e.g. é to e), or, where orcograms cannot be easily -disposed of, we use some encoding that our grep can recognize. - -This first episode successfully ended, we rerun the script, finding out -that not all headers have been processed, judging by the fact that -they're in the "not-found" set. We run once again, setting `$marker` to -some other relevant value, and upon this we observe that some output -files in `$TARGETDIR` are now empty. - -The reason for this is that not only there isn't a single "standard" -copyrast header inserted into each book, but that some "headers" are -actually footers. Thus we use a heuristic to determine whether the end -marker of a "small print" notice is at the beginning or the end of a -file[^1] and we cut accordingly, leaving us with another batch of -non-standard copyshit snippets. - -After repeated pruning, we're left with a few (96, more precisely) -files that we -- by which "we" mean I -- checked manually to make sure -that they're clean, which they were. Hopefully I didn't miss anything, -please complain if I did. - -Bottom line: a total of 46150 files were processed, of which 46028 were -deheadered, leaving us with 96 books that had no headers to begin with -(or readmes, addenda to musical scores, etc.) and 26 index files. The -total size of the headers was a not-so-measly 88.9MB that are now -forever gone into the void. - -[^1]: For example: - - ~~~~ {.bash} - mloc=$(grep -n -m 1 "$marker" $f) - tnloc=$(wc -l $f | cut -d" " -f1) - mnloc=$(echo $mloc | cut -d":" -f1) - if [ $(($tnloc - $mnloc)) -le 10 ]; then - echo "at-end" - else - echo "at-beginning" - fi - ~~~~ - - where the magic value of 10 is conveniently chosen. And so on. - -[lmogo.xyz]: http://lmogo.xyz/randomio/gutentext.tar.xz -[gutenberg-iii]: /posts/y05/085-gutenberg-iii.html -[btcbase-1894785]: http://btcbase.org/log/2019-02-10#1894785 -[btcbase-1656097]: http://btcbase.org/log/2017-05-15#1656097 -[btcbase-1684170]: http://btcbase.org/log/2017-07-15#1684170 diff --git a/posts/y05/088-gutenberg-iv.markdown b/posts/y05/088-gutenberg-iv.markdown new file mode 100644 index 0000000..c879a0d --- /dev/null +++ b/posts/y05/088-gutenberg-iv.markdown @@ -0,0 +1,172 @@ +--- +postid: 088 +title: Gutenberg ASCII archive updated, now with 0.5% less junk +date: March 9, 2019 +author: Lucian Mogoșanu +tags: tmsr +--- + +The updated ASCII text archive of Project Gutenberg is available at +[lucian.mogosanu.ro][lmogo.xyz]. + +~~~~ +-----BEGIN PGP SIGNED MESSAGE----- +Hash: SHA512 + +$ ksum gutentext.tar.xz > gutentext.tar.xz.ksum +$ cat gutentext.tar.xz.ksum +e0e3bbc7677365f8503a5a10c7e1a3ab28864dd7c8e87d1aace7b11c4985b4dd3383dfca616ee96ad0cc1f1d36312b0c7ba544a29189ed2d2ea36cb13a687df9 gutentext.tar.xz +-----BEGIN PGP SIGNATURE----- + +iQIcBAEBCgAGBQJcg7CjAAoJEL2unQUaPTuVMwIQAJV7J8JeHuiU6ZnqXSAdO4aC +n7nzs4mchEGEaGXttTMFnunvaZ5GgpejGB/puGwYKjhUXPcTdoGgPMowLJV4F4Y4 +Ispev9b6K2/7AIDTbAZEI+rqkf1aE4sJob68KjBQjrOFgBNbgEvCHIjdTY7x/zbp +Jz19yo06/31E8TUUMDTW2BSwPC4gzAK15OBvSwjE6fUJVIt/ffMs1y/HX++09jO0 +H/1bYEdQ9WOSGxHkO7siSaQa3uKyW6K7Le3XPK+bp4XGJX4z0k7ZNAOC7Ard8Izl +uLJi4ROtOV+UqLv7oR2cPXOgSCNEWnnqNxyRopHOesUB1rdboGylYmTC/z49qXS2 +nqCmzvu9xCxApkhv6oxf9swhSTpw/2c6ioP5Ze/LBWEoVUe3l8EWtc9TIAUXpXu7 +Bfk6XUhRFGpLC46Y7MG8Bj3bOmypu5lH3ksgo5QaUkcVecRpOYj6Mp2IWHlvNgvp +cnd0iuiIaK24rIl74elEvi2xyN3W8IwGtYhv8CBwYI1rvnsDcJrWD7xvQGVwf/SD +PiklJ/IM2Dev4AJubnT0U3N5xdo7mXVBhzu3Ky4qjJiRl1CYcdVNHxPRxT8XHqay +KzLvY6NgfFrdpL5uGRop9F5qi0Ax1dNlg4u/6oqd7ryKA6g5X8nuQi7IHMc4B3J+ +pGobf0U4ao162tm9jyBV +=Jy3B +-----END PGP SIGNATURE----- +~~~~ + +Read further for details, technical or otherwise. + +The differences brought by this version over the +[previous one][gutenberg-iii] have to do mostly with a lack of crap to +bleed the reader's eyes, or, to be [more precise][btcbase-1894785]: + +> **mircea_popescu**: BingoBoingo and it is VERY HARMFUL fucking +> junk. having "All donations should be made to "Project Gutenberg/CMU": +> and are tax deductible to the extent allowable by law. (CMU = +> Carnegie- Mellon University)." or "Copyright laws are changing all +> over the world, be sure to check the copyright laws for your country +> before posting these files!!" in the lede of "The Merchant of Venice +> by William Shakespeare" promotes a most harmful and in any case +> uncountenable view whereby the fucktarded usgistan is at least more +> important than fucking shakespeare. +> **mircea_popescu**: it very well fucking is not. it's not even +> remotely as important. having usg.cmu or usg.anything-else spew on +> actual literature is nothing short of vandalism. i don't want their +> grafitti, and i don't care why they think they're owed it. +> **mircea_popescu**: this without even going into ridiculous nonsense a +> la "We produce about two million dollars for each hour we work. The +> time it takes us, a rather conservative estimate, is fifty hours to +> get any etext selected, entered, proofread, edited, copyright searched +> and analyzed, the copyright letters written, etc. This projected +> audience is one hundred million readers. If our value per text is +> nominally estimated at one dollar then we produce $2 million dollars +> per hour" ; apparently nobody fucking there bothered to EVER confront +> [http://btcbase.org/log/2017-05-15#1656097][btcbase-1656097] or +> [http://btcbase.org/log/2017-07-15#1684170][btcbase-1684170] etcetera. + +Taking a cursory look at some of the books, one can immediately notice a +message along the lines of: + +~~~~ +*** START OF THIS PROJECT GUTENBERG EBOOK yadda-yadda *** +~~~~ + +or maybe: + +~~~~ +***START**THE SMALL PRINT!**FOR PUBLIC DOMAIN EBOOKS**START*** +... screens of legalese, followed by: +*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END* +~~~~ + +which brings into our mind the name of one Guillotin, and of one device +that sharply separates the head(er) from the body -- in our case, the +embodiment of this device being ye olde text processing tools `grep`, +`head` and `tail`. We can thusly get some preliminary results by running +the following ugly but very effective bash snippet: + +~~~~ {.bash} +marker='\*\*\* \?START OF \(THIS\|THE\) PROJECT GUTENBERG EBOOK' + +# For each file, attempt guillotine +find ${GUTENDIR} -name '*.txt' | while read f; do + fname=$(basename ${f}) + dirname=$(dirname ${f} | sed "s/^${GUTENDIR}\///g") + + # Look for end-header marker + mloc=$(grep -n -m 1 "$marker" $f) + if [ ! -z "${mloc}" ]; then + # If found, say something + >&2 echo "$dirname/$fname -- found:$mloc" + + # Copy guillotined file from source to target directory; + # comment the lines below to do a dry run. + linenum=$(echo $mloc | cut -d":" -f1) + + mkdir -p ${TARGETDIR}/${dirname} + + >&2 echo "Guillotining ${f} into ${TARGETDIR}/${dirname}/${fname}" + tail -n +$(($linenum + 1)) ${f} > ${TARGETDIR}/${dirname}/${fname} + else + # If not found, say something else + >&2 echo "$dirname/$fname -- not-found" + + # Copy file as-is; comment the lines below to do a dry run. + mkdir -p ${TARGETDIR}/${dirname} + >&2 echo "Copying ${f} to ${TARGETDIR}/${dirname}/${fname}" + cp ${f} ${TARGETDIR}/${dirname}/${fname} + fi +done +~~~~ + +After giving this a try, we observe that at some point our script fails, +because for some files the output of `grep` is "Binary file ... matches" +instead of what we'd expect. The reason for this is that some of the +files are *not* actually ASCII, so we go and `sed` the letters to their +ASCII equivalents (e.g. é to e), or, where orcograms cannot be easily +disposed of, we use some encoding that our grep can recognize. + +This first episode successfully ended, we rerun the script, finding out +that not all headers have been processed, judging by the fact that +they're in the "not-found" set. We run once again, setting `$marker` to +some other relevant value, and upon this we observe that some output +files in `$TARGETDIR` are now empty. + +The reason for this is that not only there isn't a single "standard" +copyrast header inserted into each book, but that some "headers" are +actually footers. Thus we use a heuristic to determine whether the end +marker of a "small print" notice is at the beginning or the end of a +file[^1] and we cut accordingly, leaving us with another batch of +non-standard copyshit snippets. + +After repeated pruning, we're left with a few (96, more precisely) +files that we -- by which "we" mean I -- checked manually to make sure +that they're clean, which they were. Hopefully I didn't miss anything, +please complain if I did. + +Bottom line: a total of 46150 files were processed, of which 46028 were +deheadered, leaving us with 96 books that had no headers to begin with +(or readmes, addenda to musical scores, etc.) and 26 index files. The +total size of the headers was a not-so-measly 88.9MB that are now +forever gone into the void. + +[^1]: For example: + + ~~~~ {.bash} + mloc=$(grep -n -m 1 "$marker" $f) + tnloc=$(wc -l $f | cut -d" " -f1) + mnloc=$(echo $mloc | cut -d":" -f1) + if [ $(($tnloc - $mnloc)) -le 10 ]; then + echo "at-end" + else + echo "at-beginning" + fi + ~~~~ + + where the magic value of 10 is conveniently chosen. And so on. + +[lmogo.xyz]: http://lmogo.xyz/randomio/gutentext.tar.xz +[gutenberg-iii]: /posts/y05/085-gutenberg-iii.html +[btcbase-1894785]: http://btcbase.org/log/2019-02-10#1894785 +[btcbase-1656097]: http://btcbase.org/log/2017-05-15#1656097 +[btcbase-1684170]: http://btcbase.org/log/2017-07-15#1684170