Finding a DOI in a document or page

前端 未结 7 1855
悲&欢浪女
悲&欢浪女 2021-01-29 21:43

The DOI system places basically no useful limitations on what constitutes a reasonable identifier. However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful

7条回答
  •  暖寄归人
    2021-01-29 22:03

    I'm sure it's not super-helpful for the OP at this point, but I figured I'd post what I am trying in case anyone else like me stumbles upon this:

    (10.(\d)+/(\S)+)
    

    This matches: "10 dot number slash anything-not-whitespace"

    But for my use (scraping HTML), this was finding false-positives, so I had to match the above, plus get rid of quotes and greater-than/less-than:

    (10.(\d)+/([^(\s\>\"\<)])+)
    

    I'm still testing these out, but I'm feeling hopeful thus far.

提交回复
热议问题