Finding a DOI in a document or page

前端未结

关注

 7  1853

悲&欢浪女 2021-01-29 21:43

The DOI system places basically no useful limitations on what constitutes a reasonable identifier. However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful

7条回答

情歌与酒 (楼主)

2021-01-29 22:00

@Silas The sanity checking is a good idea. However, the regex doesn't cover all DOIs. The first element must (currently) be 10, and the second element must (currently) be numeric, but the third element is barely restricted at all:

"Legal characters are the legal graphic characters of Unicode. This specifically excludes the control character ranges 0x00-0x1F and 0x80-0x9F..."

and that's where the real problem lies. In practice, I've never seen whitespace used, but the spec specifically allows for it. Basically, there doesn't seem to be a sensible way of detecting the end of a DOI.

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...