问题
Need some guidance as I am new to Power BI and Redshift ..
My Raw JSON data is stored in Amazon S3 bucket in the form of .gz files (Each .gz file has multiple rows of JSON data) I wanted to connect Power BI to Amazon s3 Bucket. As of now based on my research I got three ways:
- Amazon S3 is a web service and supports the REST API. We can try to use web data source to get data
Question: Is it possible to unzip the .gz file (inside the S3 bucket or Inside Power BI), extract JSON data from S3 and connect to Power BI
- Importing data from Amazon S3 into Amazon Redshift. Do all data manipulation inside Redshift using SQL workbench. Use Amazon Redshift connector to get data in Power BI
Question 1: Does Redshift Allows Loading .gzzipped JSON data from the S3 bucket? If Yes, is it directly possible or do I have to write any code for it?
Question 2: I have the S3 account, do I have to separately purchase Redshift Account/Space? What is the cost?
- Move data from an AWS S3 bucket to the Azure Data Lake Store via Azure Data Factory, transform the data with Azure Data Lake Analytics (U-SQL), and then output the data to PowerBI
U-SQL recognize GZip compressed files with the file extension .gz and automatically decompress them as the part of the Extraction process. Is this process valid, if my gzipped files contain JSON data rows?
Please let me if there is any other method, also your valuable suggestions on this post.
Thanks in Advance.
回答1:
About your first Question: I've just faced a similar issue recently (but extracting a csv) and I would like to register my solution.
Power BI still don't have a direct plugin to download S3 buckets, but you can do it using a python script. Get data --> Python Script
PS.: make sure that boto3 and pandas libraries are installed in the same folder (or subfolders) of the Python home directory you informed in Power BI options, OR in Anaconda library folder (c:\users\USERNAME\anaconda3\lib\site-packages).
Power BI window for Python scripts options
import boto3
import pandas as pd
bucket_name= 'your_bucket'
folder_name= 'the folder inside your bucket/'
file_name = r'file_name.csv' # or .json in your case
key=folder_name+file_name
s3 = boto3.resource(
service_name='s3',
region_name='your_bucket_region', ## ex: 'us-east-2'
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
obj = s3.Bucket(bucket_name).Object(key).get()
df = pd.read_csv(obj['Body']) # or pd.read_json(obj['Body']) in your case
The dataframe will be imported as a new query( named "df", in this example case)
Apparently pandas library can also also get a zipped file (.gz for example). See the following topic: How can I read tar.gz file using pandas read_csv with gzip compression option?
来源:https://stackoverflow.com/questions/51801521/connecting-power-bi-to-s3-bucket