aws notes

Summary of AWS Analytics services circa Feb 2019

by davidy 7 years ago 6 min read

In preparing to take the “AWS Certified Solutions Architect - Professional” exam, I found myself reading through the whitepapers and reference architectures.

The “Overview of Amazon Webservices” whitepaper describes the 140 (!) currently available AWS services, so I thought I’d note down below my own description of each service and how it’s priced.

This post covers only the “Analytics” services. See followup posts for more services ;)

Analytics (this post)
Application Integration
AR and VR
AWS Cost management
Blockchain
Business Applications
Compute Services
Customer Engagement
Database
Desktop and App Streaming
Developer Tools
Game Tech
Internet of Things (IoT)
Machine Learning
Management and Governance
Media Services
Migration and Transfer
Mobile Services
Networking and Content Delivery
Robotics
Satellite
Security, Identity, and Compliance
Storage

Table of Contents

Amazon Athena (Serverlessly query structured data with SQL)
Amazon EMR (Managed Big Data framework)
Amazon CloudSearch (Managed search service for your website/apps)
Amazon Elasticsearch Service (Managed Elasticsearch cluster)
Amazon Kinesis
Amazon Kinesis Data Streams (Low-level, flexible platform for streamed data processing)
Amazon Kinesis Data Firehose (Simple version of Data Streams for common use case)
Amazon Kinesis Data Analytics (Perform realtime analytics on data stream)
Amazon Kinesis Video Streams (Video-specific stream processing featuring machine learning and recognition)
Amazon Redshift (Managed data warehouse)
Amazon Quicksight (Managed business intelligence / reporting)
AWS Data Pipeline (Automate transfer / transform of data in/out of AWS services)
AWS Glue (Like Data Pipeline but vastly improved with ML, serverless, etc)
AWS Lake Formation (Automates and simplifies the setup of data lakes)
Amazon Managed Streaming for Kafka (MSK) (Managed Kafka cluster)

Amazon Athena (Serverlessly query structured data with SQL)

Amazon Athena makes structured data (CSV, log files) stored in S3 queryable via SQL. The use case might be analysing raw data (think millions of rows of CSV) without having to “massage” the data into a relational database first. You’re charged $5/TB of data scanned for your query, with a minimum of 10MB (so smallest charge would be $0.00005).

Amazon EMR (Managed Big Data framework)

Amazon EMR (Elastic MapReduce) is a managed data analytics framework for performing big-data analysis of your S3 “data lake” (structured/unstructured data sprayed into an S3 bucket). It’s a managed service in that EMR handles the building/management/scaling of the Hadoop / Spark cluster (by spinning up/down EC2 instances), leaving you to play with the higher functions like plugging in 3rd-party analysis tools, running queries, and using the cluster to perform your big data analysis.

You’re charged per-second per cluster instance (there are a few instance variants), and you pay standard EC2 pricing for the EC2 instances required to run your instance.

(TIL: Hadoop was named after co-creator Doug Cutting’s son’s toy elephant)

Amazon CloudSearch (Managed search service for your website/apps)

Amazon CloudSearch is a managed service for providing a “search” function in an app or website. It includes all the machinery scrape your target data (say, 1,000,000 cat photos in S3, or your 10-year-old forum history), build indexes, execute queries, and scale up / down based on demand. You’re charged per hour that CloudSearch instance is running (there are a few instance variants), and you pay “as you go” to upload data into the instance, rebuild indexes, and transfer data “in” and “out” of CloudSearch.

Amazon Elasticsearch Service (Managed Elasticsearch cluster)

Amazon Elasticsearch Service is a managed ElasticSearch cluster for performing analysis of data (i.e., logs via logstash). You access your instance via the Elasticsearch API, so it “just works” with popular Elasticsearch cohorts like Kibana, logstash. The service integrates with your other AWS services (Lambda, Firehose, Cloudwatch, etc). You’re charged per hour per cluster instance (there are many instance variants), and you can run the t2.small.elasticsearch instance under the AWS Free Tier.

Amazon Kinesis

Amazon Kinesis is a managed service to process streaming data (e.g. telemetry from IoT devices, or video streams from traffic cams). There are 4 capabilities available:

Amazon Kinesis Data Streams (Low-level, flexible platform for streamed data processing)

Amazon Kinesis Data Streams is the base service which acts as the “building block” for the subsequent 3 capabilities. Data Streams receives your streaming data (scaling on demand), and provides the capability to perform realtime analytics, trigger Lambda functions, send it to Apache Spark in EMR, etc. You are charged based on your number of “shards” (relates to throughput) as well as PUTs. You can also pay additional “shard hours” to increase your stream data retention from 24 hours (default) all the way to 72 hours (maximum). Here’s a diagram:

Amazon Kinesis Data Firehose (Simple version of Data Streams for common use case)

Amazon Kinesis Data Firehose is a “simplified” service built on Data Streams (above), intended to be easier to use for the most common use-case : streaming data into S3, RedShift, ElasticSearch or Splunk. There’s no analytics in the pipeline - any analysis you want to do on the data has to be done after the data has been sent to its destination (you can send the same data to multiple destinations, i.e., S3 and Redshift). However, you can use Lamba to do lightweight transformation of the data before it’s delivered to S3 and friends. Pricing is simple - you’re charged for the amount of data ingested (S3, Redshift etc are priced separately). Here’s a diagram:

Amazon Kinesis Data Analytics (Perform realtime analytics on data stream)

Amazon Kinesis Data Analytics is the analysis component of Kinesis, which can perform realtime analytics against incoming Data Stream. Data can be analysed using SQL queries, or Java libraries which integrate with a suite of other AWS tools (S3, DynamoDB, etc). You’re charged per hour per Kinesis Processing Unit (KPU), being an instance with 1 vCPU and 4GB RAM. KPUs auto-scale based on demand, and you’re charged an additional KPU if you use Java, for application orchestration. Here’s a diagram:

Amazon Kinesis Video Streams (Video-specific stream processing featuring machine learning and recognition)

Amazon Kinesis Video Streams provides the ability to ingest video data for analysis and storage. It provides a SDK-based method for manufacturers to get data into AWS, and integrates with Amazon’s Machine Learning and video recognition services. You are charged simply based on data ingested, consumed, and stored. Here’s a diagram:

Amazon Redshift (Managed data warehouse)

Amazon Redshift is a managed data warehouse service (“data warehouses” are for structured data, while “data lakes” for raw data). It plugs into traditional BI tools, scales up while offering excellent performance at low prices. Interestingly, “Redshift Spectrum” allows you to run queries against structured data stored in S3, as well as traditional EBS volumes which support Redshift. Snapshots of your data warehouse can also be backed up to S3. You’re charged nothing for the RedShift service, just the underlying EC2 and EBS elements which comprise your cluster.

Amazon Quicksight (Managed business intelligence / reporting)

Amazon Quicksight is a managed business intelligence (BI) service. You feed it your data (be it twitter trends, on-premise spreadsheets, or existing data in AWS), and use it to “author” dashboards for your users to consume. There’s some machine learning (ML) involved, to augment your analytics. You’re charged per-author on a monthly basis, and per-session on a user basis.

AWS Data Pipeline (Automate transfer / transform of data in/out of AWS services)

AWS Data Pipeline is a managed ETL (Extract-Transform-Load) service, which automates the movement of your data. Data Pipeline includes a “visual pipeline builder, and you might use it, for example, to grab your CloudWatch logs from S3, transform them into a structured format, and then push them into Redshift for data warehousing. Pricing seems almost a giveaway, and obviously you pay for the related AWS services that Data Pipeline would utilise (S3, RedShift, etc).

AWS Glue (Like Data Pipeline but vastly improved with ML, serverless, etc)

AWS Glue is the “cool cousin” of Data Pipeline. It’s also used for ETL, but it can “discover” the structure of your semi-structured data by crawling it, and then generates code (which you can change) to perform transformations. For example, you could use Glue to ingest your webserver logs from S3, “enrich” the data by tying customer account ID to your CRM data, and then export the data in a structured form into Redshift. As a “serverless” service, you pay for the resources used to run your ETL jobs only while they’re running. You also pay minimally for crawling and storing schemas.

AWS Lake Formation (Automates and simplifies the setup of data lakes)

AWS Lake Formation (in preview) is a high-level service which abstracts much of the complexity of creating and managing a “data lake”. Lake Formation covers crawling your existing datasets for schemas, cleaning/de-duping/sorting and then migrating data into the data lake, plus all the security elements re who may access what data.

Amazon Managed Streaming for Kafka (MSK) (Managed Kafka cluster)

Amazon Managed Streaming for Kafka (MSK) is a managed Kafka cluster service (Kafka supports traditional message brokering as well as “streams“). MSK provides a ready-to-go Kafka cluster configured to best practices. You pay per-broker, per hour, based on broker instance size.