Finding a DOI in a document or page
The DOI system places basically no useful limitations on what constitutes a reasonable identifier . However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful for citation information, etc. Is there a reliable way to identify a DOI in a block of text without assuming the 'doi:' prefix? (any language acceptable, regexes preferred, and avoiding false positives a must) Alix Axel Ok, I'm currently extracting thousands of DOIs from free form text (XML) and I realized that my previous approach had a few problems, namely regarding encoded entities and trailing punctuation, so I