anyone used headless browsers for scraping bibtex from webpages ?
Mike Marchywka
marchywka at hotmail.com
Thu May 21 11:45:22 CEST 2020
On Wed, May 20, 2020 at 05:51:00PM -0400, John Scott wrote:
> I don't know about specifically for BibTeX, but for web scripting or doing
> basic forms cURL is pretty handy. For activating elements on a web page,
> you'll probably want to look at saving/using cookies with --cookie-jar and --
> cookie, and how to send POST requests.
>
> For example I recently wrote a script to allow me to do a form and complete a
> CAPTCHA all from the CLI. So I did
> curl --cookie-jar jar.txt http://foo.com/do.php
> to get it to save the cookie for my session. Then I'd recycle this cookie to
> get my CAPTCHA:
> curl --cookie jar.txt -o image.png http://foo.com/captcha.php
> and lastly after reading it, send the request (figure out the field names from
> Inspect Element in browser)
> curl --cookie jar.txt -X POST -F 'captcha_code=FfFfFf' http://foo.com/
> do.php
>
> For help with particular sites, please feel free to share details on or off-
> list.
The publisher finally reverted back, that usually happens.
But, I did find the headless browser output to a pdf file could then
be converted to text and i could get the doi ...
Thanks.
--
mike marchywka
306 charles cox
canton GA 30115
USA, Earth
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X
More information about the texhax
mailing list.