Connecting Power BI to S3 Bucket

问题

Need some guidance as I am new to Power BI and Redshift ..

My Raw JSON data is stored in Amazon S3 bucket in the form of .gz files (Each .gz file has multiple rows of JSON data) I wanted to connect Power BI to Amazon s3 Bucket. As of now based on my research I got three ways:

Amazon S3 is a web service and supports the REST API. We can try to use web data source to get data

Question: Is it possible to unzip the .gz file (inside the S3 bucket or Inside Power BI), extract JSON data from S3 and connect to Power BI

Importing data from Amazon S3 into Amazon Redshift. Do all data manipulation inside Redshift using SQL workbench. Use Amazon Redshift connector to get data in Power BI

Question 1: Does Redshift Allows Loading .gzzipped JSON data from the S3 bucket? If Yes, is it directly possible or do I have to write any code for it?

Question 2: I have the S3 account, do I have to separately purchase Redshift Account/Space? What is the cost?

Move data from an AWS S3 bucket to the Azure Data Lake Store via Azure Data Factory, transform the data with Azure Data Lake Analytics (U-SQL), and then output the data to PowerBI

U-SQL recognize GZip compressed files with the file extension .gz and automatically decompress them as the part of the Extraction process. Is this process valid, if my gzipped files contain JSON data rows?

Please let me if there is any other method, also your valuable suggestions on this post.

Thanks in Advance.

回答1:

About your first Question: I've just faced a similar issue recently (but extracting a csv) and I would like to register my solution.

Power BI still don't have a direct plugin to download S3 buckets, but you can do it using a python script. Get data --> Python Script

PS.: make sure that boto3 and pandas libraries are installed in the same folder (or subfolders) of the Python home directory you informed in Power BI options, OR in Anaconda library folder (c:\users\USERNAME\anaconda3\lib\site-packages).

Power BI window for Python scripts options

import boto3
import pandas as pd

bucket_name= 'your_bucket'
folder_name= 'the folder inside your bucket/'
file_name = r'file_name.csv'  # or .json in your case
key=folder_name+file_name

s3 = boto3.resource(
    service_name='s3',
    region_name='your_bucket_region',  ## ex: 'us-east-2'
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)

obj = s3.Bucket(bucket_name).Object(key).get()
df = pd.read_csv(obj['Body'])   # or pd.read_json(obj['Body']) in your case

The dataframe will be imported as a new query( named "df", in this example case)

Apparently pandas library can also also get a zipped file (.gz for example). See the following topic: How can I read tar.gz file using pandas read_csv with gzip compression option?

来源：https://stackoverflow.com/questions/51801521/connecting-power-bi-to-s3-bucket

标签

amazon-web-services

azure

amazon-s3

powerbi

u-sql