How do we get the document file url using the Watson Discovery Service?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-01 11:28:56

If you need to store the original source/file URL, you can include it as a field within your documents in the Discovery service, then you will be able to query that field back out when needed.

I also struggled with this request but ultimately got it working using Python bindings into Watson Discovery. The online documentation and API reference is very poor; here's what I used to get it working:

(Assume you have a Watson Discovery service and have a created collection):

# Programmatic upload and retrieval of documents and metadata with Watson Discovery

from watson_developer_cloud import DiscoveryV1
import os
import json

discovery = DiscoveryV1(
    version='2017-11-07',
    iam_apikey='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    url='https://gateway-syd.watsonplatform.net/discovery/api'
)

environments = discovery.list_environments().get_result()
print(json.dumps(environments, indent=2))

This gives you your environment ID. Now append to your code:

collections = discovery.list_collections('{environment-id}').get_result()
print(json.dumps(collections, indent=2))

This will show you the collection ID for uploading documents into programmatically. You should have a document to upload (in my case, an MS Word document), and its accompanying URL from your own source document system. I'll use a trivial fictitious example.

NOTE: the documentation DOES NOT tell you to append , 'rb' to the end of the open statement, but it is required when uploading a Word document, as in my example below. Raw text / HTML documents can be uploaded without the 'rb' parameter.

url = {"source_url":"http://mysite/dis030.docx"}
with open(os.path.join(os.getcwd(), '{path to your document folder with trailing / }', 'dis030.docx'), 'rb') as fileinfo:
    add_doc = discovery.add_document('{environment-id}', '{collections-id}', metadata=json.dumps(url), file=fileinfo).get_result()
    print(json.dumps(add_doc, indent=2))
    print(add_doc["document_id"])

Note the setting up of the metadata as a JSON dictionary, and then encoding it using json.dumps within the parameters. So far I've only wanted to store the original source URL but you could extend this with other parameters as your own use case requires.

This call to Discovery gives you the document ID.

You can now query the collection and extract the metadata using something like a Discovery query:

my_query = discovery.query('{environment-id}', '{collection-id}', natural_language_query="chlorine safety")
print(json.dumps(my_query.result["results"][0]["metadata"], indent=2))

Note - I'm extracting just the stored metadata here from within the overall returned results - if you instead just had: print(my_query) you'll get the full response from Discovery ... but ... there's a lot to go through to identify just your own custom metadata.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!