Finding a DOI in a document or page

前端 未结 7 1833
悲&欢浪女
悲&欢浪女 2021-01-29 21:43

The DOI system places basically no useful limitations on what constitutes a reasonable identifier. However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful

7条回答
  •  情歌与酒
    2021-01-29 22:00

    @Silas The sanity checking is a good idea. However, the regex doesn't cover all DOIs. The first element must (currently) be 10, and the second element must (currently) be numeric, but the third element is barely restricted at all:

    "Legal characters are the legal graphic characters of Unicode. This specifically excludes the control character ranges 0x00-0x1F and 0x80-0x9F..."

    and that's where the real problem lies. In practice, I've never seen whitespace used, but the spec specifically allows for it. Basically, there doesn't seem to be a sensible way of detecting the end of a DOI.

提交回复
热议问题