[texhax] Radical Philosophy html file ref needs correction
Daniel Nemenyi
daniel at pompo.co
Mon Apr 30 11:41:43 CEST 2018
Hello Carlos!
Carlos writes:
> I was just browsing the texhax archives. There was a thread about a
> publication named Radical Philosophy.
I'm the person behind Radical Philosophy's migration to LaTeX. Nice to
see it discussed again on texhax :)
> There's a disclosure on the page on the article that "The following text
> has been automatically reproduced by an Optical Character Recognition
> (OCR) algorithm. It may not have been checked over by human eyes. For
> matters of precision please consult the original pdf."
>
> The article I'm referring to is at https://www.radicalphilosophy.com/article/a-monument-to-the-unknown-worker
>
> But even the source shows a
>
> <a href="#ref- -b" id="ref- -a" class="reflink body">[ ]</a>
>
> which throws off the footnotes. And don't we like seeing footnotes on
> all TeX produced materials? Eh? hehe.
Actually what happened was that we built a script to extract html from
the original PDFs of our 45 year old archive, rather than recycling the
html we had already made for some of them. So the old html from the
archive site with the "#ref-" style footnotes was not the source.
The PDFs of Radical Philosophy were created in inDesign since the 1990s
until we moved to LaTeX, and before that by god knows what -- the early
ones from the 1970s were typewriter, scissor and glue jobs. For this
migration we used pdftohtml to extract html from the inDesign PDFs, and
pdftotext for the rest. We couldn't work out how to prevent the output
of pdftohtml from being filled with noise and excessive html tags so we
used... a lot of SED to clear things up! Probably should have taken a
structured XML parsing approach, but anyway. I enjoy the Rubik's Cube
quality of SED. And we added things like reference and waybackmachine
links.
The quality of it all... varies... some of it is fine, some of it is
fine for a search engine but not a human. The pdftotext output
especially isn't very good. Maybe we should re-ocr the whole lot and try
again, but probably we'll correct things manually as we go along. So for
all of these items we put up that warning.
As for the new content produced by LaTeX, that of course coverts really
nicely into html. Via a wrapper on the admin panel of our Wordpress
site, we use pandoc to convert our tex files to html, and we use a bit
of SED to standardise our tex files before submitting it to pandoc,
since pandoc can be quite sensitive:
exec("sed -i " . $filenametmp . " -e 's/\\includegraphics\[[htbH]]*]/\includegraphics/g;' \
-e 's/\\begin{figure\*}\[[^]*\]//g;' \
-e 's/\\end{figure\*}//g;' ");
And some more SED once its out the other end:
exec("sed -i " . $htmlfilenametmp . " -e 's/\[[ht]*\]//g;' \
-e 's/<hr \/>/<h2 class=\"notes\">Notes<\/h2>/g;' \
-e 's/<span>2<\/span>//g;' \
-e 's/<p><\/p>//g;' \
-e 's/<li id=\"fn/<li class=\"footnote\" id=\"fn/' \
-e 's/class=\"emoji\"/class=\"reflink reffoot\"/g;' \
-e 's/<a href=\"#fn/ <a href=\"#fn/g;' \
-e 's/↩/^/g;' \
-e 's/>^<\/a>/ class=\"reffoot footnoteLink\">^<\/a>/g' \
-e 's/><sup>/>/g;' \
-e 's/<\/sup></</g;' \
-e 's/class=\"footnoteRef\"/class=\"footnoteRef footnoteLink\"/g;' \
-e 's/<h2/<h3/g;' \
-e 's/<\/h2>/<\/h3>/g;' \
-e 's/<img /<!--<img /g;' \
-e 's/ height=\"[0-9]*\"//g;' \
-e 's/ width=\"[0-9]*\"//g;' \
-e 's/ alt=\"image\" \/>/\/>-->/g;' \
-e 's/<p>[ ]*<!--/<!--/g;' \
-e 's/-->[ ]*<\/p>/-->/g;' \
");
We're in the slow process of putting our codebases up on
github... Anyway, don't know if any of that reply overkill will be
useful to anyone out there.
> Anyhow. Interesting article. Interesting. I enjoyed reading most of it.
Glad you enjoyed /most/ of the article ;)
Daniel
>
> Here's the git diff on the file
>
> --- a/bolano_original.html
> +++ b/Bolano/bolano_add_ref.html
> @@ -25,7 +25,8 @@ Argentine Ricardo Piglia) and postmagical realist (after, for example,
> the Paraguayan Augusto Roa Bastos) cognitive mapping. In doing so,
> <i>2666 </i>suggests, in a kind of high-modernist vein, an
> out-of-kilter realism re-presenting reality – that is, a capitalist
> -world – gone awry. Bolaño’s novel <i>2666</i> is an inorganic work
> +world – gone awry. <a href=#fn2" id="fnref2" class="footnoteRef
> +footnoteLink">[2]</a> Bolaño’s novel <i>2666</i> is an inorganic work
> written in five ‘parts’, a quintet that does not quite make a whole,
> and whose unity is given paradoxically in narrative proliferation and
> dispersal. <a href="#fn3" id="fnref3" class="footnoteRef
>
>
> Thanks a lot.
> _______________________________________________
> TeX FAQ: http://www.tex.ac.uk/faq
> Mailing list archives: http://tug.org/pipermail/texhax/
> More links: http://tug.org/begin.html
>
> Automated subscription management: http://tug.org/mailman/listinfo/texhax
> Human mailing list managers: postmaster at tug.org
More information about the texhax
mailing list