Cloud Analytics is nowadays the most modern and competitive way adapted by many enterprises to do data analytics on Cloud. It can be done by using various cloud services to analyze massive data for corporate reporting and decision making.
Why to use Cloud Analytics?
Cloud Analytics is a modern data warehousing technique since it brings more flexibility, power, and scalability.
INTRODUCTION OF CLOUD ANALYTICS TOOLS
1. AWS Cloud Analytics tools
AWS EMR is used to deploy various big data tools such as Spark, Hive, Pig, Sqoop, Impala and many more that can be used for processing and analytics purpose. Hive/Impala can be used as a data warehouse tool in EMR cluster that can be run interactive analytics queries against data.
AWS cloud provides various analytical tools such as Athena, Redshift, EMR.
AWS Athena is a serverless, managed and pay as you go service provided by AWS that can be used by data analyst to run interactive queries against data stored in Amazon S3. You simply need to define the schema for data residing in S3 buckets and starts querying against it using SQL queries. It supports various file formats such as CSV, JSON, ORC, Avro, and Parquet. It can be used to run complex analytics queries in the fastest manner. In this blog, we will take a deeper look on how to provision Athena to run queries.
Redshift is a modern data warehouse tool that can be used to query petabytes of structured and semi-structured data across your data lake. It is easily integrated with various other AWS services such as Amazon Athena, Aurora, and RDS to provide analysis across various data stored in various sources using simple SQL queries. It can also integrate with AWS QuickSight for reporting purposes. Redshift provides the flexibility to store results back to Amazon S3 buckets from where it can feed to various other big data tools such as Spark on EMR for further processing.
2. Google Cloud Analytics tools
Google BigQuery is a fully managed and serverless analytical service provided by google cloud to analyze petabytes of data using SQL queries. It has inbuilt machine learning and access control capabilities.
Google Cloud Dataflow is a fast, fully managed, & serverless service provided by google cloud to run batch & stream analytics jobs in a unified way using Apache Beam pipelines.
Google Cloud Dataproc can used to spawn Spark, Hadoop, Presto or other custom clusters to run analytics & processing jobs. It is fully managed clusters and provides pay as you go model.
Google Data Studio can be used to unite and explore your data in one place. It can provide beautiful data visualizations to derive more insights from the data and can be easily integrated with spreadsheets, Google Ads, BigQuery and more.
3. Microsoft Cloud Analytics tools
HDInsight provides managed Hadoop clusters that can be used to run Apache Spark, Hive, Flume, Kafka and many other open-source big data frameworks for data processing and data analytics.
ML Studio is a GUI based drag and drop tool for running machine learning algorithms.
Data Lake Analytics is again an analytics job service to ease big data and to run massive jobs parallelly on petabytes of data without incurring any local infrastructure cost. It provides the pay as go model, inbuilt query optimization and security capabilities.
Building First Application Using AWS Athena
Step 1: Select Athena
Athena is categorized under Analytics section in AWS services console. You can click it to open Athena console where you can set up your data for analysis.
Step 2: Get Started with Athena
Using Athena console, you will be able to select S3 bucket in which you have data for analysis. It can be integrated with AWS Glue to crawl over your S3 bucket and define the schema automatically or you can define your schema manually. Once you have schema and S3 bucket defined, you can run SQL queries on top of it.
Step 3: Athena Editor
Athena Editor provides you query editor to run interactive queries over your data stored in S3 using SQL/ Hive DDL. To save the results and meta-information about your query, you must set up S3 buckets using “set up query result location in Amazon S3” link displayed on above screenshot.
Step 4: S3 Setup
After you click on “set up query result location in Amazon S3” link, you will be redirected on this window to set up your S3 bucket. You can also choose to encrypt your query results.
Step 5: Create Database/Table
You can upload your files in your S3 bucket in any format supported by Athena such as Parquet, Json, CSV etc. Once the data is on S3 bucket, you define the schema for your data either manually or using AWS Glue. AWS glue is a unified metadata repository that can be integrated with various AWS services such as Athena, Redshift, RDS and more. AWS Glue crawlers can automatically scan over S3 buckets and infer schema and store that in the Data Catalog. Schema can further be modified according to your files.
Step 6: Query Data
Once data & schema are defined, you can run SQL or HQL on top of your data using Athena query editor. It can be used to perform any complex operations on your data such as an aggregation, joins, window etc.
Step 7: View History
Athena also provides you with the feature to view the history of previously executed queries using the history tab. This history tab also shows information about the last failed or succeeded query, run time, submit time, data scanned and other information’s.
Every day, cloud providers are coming up with new competitive services that are easing the process of analytics and processing. With Infrastructure as service, Analyst can focus on data and can run massively parallel analytics jobs in less time as they can provision more machines as per their need. Analytics on cloud is becoming more intuitive nowadays as infrastructure on cloud is a new normal for modern enterprise. Which cloud provider will win this race is still unclear as they all are bringing different values to the world of Big Data analytics & processing?
Author: Navdeep Kaur
For similar articles, please read further: