Simplify Amazon Redshift RA3 Migration Evaluation With Simple Replay Utility

Republished By Plato

Followers: 0

Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that allows you to process exabytes of data across your data warehouse, operational database, and data lake using standard SQL. It offers different node types to accommodate various workloads; you can choose from RA3, DC2, and DS2 depending on your requirements. RA3 is the newest instance type, which allows you to scale and pay for compute and storage independently and also allows advanced features like cross-cluster data sharing and cross-Availability Zone cluster relocation. For more information about our recommendations on node count and type when upgrading, see Upgrading to RA3 node types.

The posts Scale your cloud data warehouse and reduce costs with the new Amazon Redshift RA3 nodes with managed storage and Amazon Redshift Benchmarking: Comparison of RA3 vs. DS2 Instance Types cover the benefits of migrating from DS2 to RA3 in great detail. After learning about the benefits with RA3, many of our existing customers on DC2 have migrated to RA3 after successfully evaluating their performance. However, this evaluation is mostly done manually, which requires you to replicate your workload to evaluate performance on the new nodes.

You can use the Simple Replay tool to conduct a what-if analysis and evaluate how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.

In this post, we walk through the steps to automate the evaluation of Amazon Redshift RA3 instances through the Amazon Redshift Simple Replay utility. If you’re running production workloads in Amazon Redshift using older generation DS2 and DC2 node types, you can use this solution to automatically extract your workload logs from the source production cluster and replay them in an isolated environment, which allows you to perform a direct comparison between these two Amazon Redshift clusters seamlessly.

Prerequisites

This solution uses AWS CloudFormation to automatically provision all the required resources in your AWS accounts. For more information, see Getting started with AWS CloudFormation.

As a prerequisite for this solution, you need to complete the following steps. You may need administrator access on your AWS account to perform these steps and the subsequent deployment of this solution.

On the Amazon Redshift console, turn on audit logging in your source Amazon Redshift cluster and specify an Amazon Simple Storage Service (Amazon S3) bucket location to save the log files. See Database audit logging for more information.
Change the parameter group enable_user_activity_logging to true. See Managing parameter groups using the console for more information.
Reboot the cluster.
Create an Amazon Elastic Compute Cloud (Amazon EC2) key pair on the AWS account where you plan to deploy the CloudFormation template. For more information, see Create a key pair using Amazon EC2.

Solution overview

The solution comprises two CloudFormation templates to perform extract and replay of your workload. You can deploy both templates in the same account where your Amazon Redshift cluster is hosted, which is our recommended approach. Alternatively, you can run the extract template on your prod account and the replay template on an isolated dev account to perform this evaluation, as shown in the following diagram.

The process uses AWS Step Functions and AWS Lambda to orchestrate the end-to-end workflow for extract and replay. The first template deploys the extract application in your source account. This extracts the audit logs from your S3 bucket for Amazon Redshift audit logging. It stores them in a new S3 bucket created for the replay-extract process. It also creates a manual snapshot of your cluster and authorizes the replay account to restore it.

The second template deploys the replay application in your dev account (if you choose not to run the replay in your source account). It uses the files in Amazon S3 from the extract application and generates an automated evaluation summary of the replays.

End-to-end workflow: Extract process

The extract process starts when you input the time interval in which you want to run this process. It automatically extracts audit logs from the source cluster and stores these logs in a new S3 bucket in that account. It also deploys an Amazon EC2 instance of size m5.large with the Simple Replay utility installed. The following diagram illustrates the solution architecture.

The following diagram shows the AWS Step Functions State Machine for the extract process.

The state machine performs the following steps to extract the source cluster metadata into the extract S3 bucket:

Wait for the source cluster to be in an available state.
Create a manual snapshot from the source cluster using identifier ra3-migration-evaluation-snapshot- concatenated with the cluster identifier string.
Authorize the snapshot to the target account where you plan to run the replay process.
Upload the source cluster configuration parameters to the extract S3 bucket.
Run the extract process to get logs from the source cluster and put them in the extract S3 bucket.

The extract CloudFormation template automatically initiates the first iteration of the extract process, but you can rerun the process any time by submitting the state machine with start_time and end_time input parameters, as in the following code:

{"input": {"start_time": "<<Extract_Start_Time>>", "end_time": "<<Extract_End_Time>>"}}

Replace the values for start_date and end_date with the actual dates in ISO-8601 format (for example, 2021-03-05T12:30:00+00:00). The following screenshot shows the state machine run inputs.

You need to deploy the CloudFormation template for the extract process in the same account where the source cluster is hosted. This template requires you to provide the following parameters:

SourceRedshiftClusterEndpoint – The non-RA3 source cluster endpoint, including port number and database name.
AccountIdForReplay– If you plan to run the replay process in a different account, enter the 12-digit AWS account ID in this parameter. Enter N/A if you’re running the extract and replay processes in the same account.
SimpleReplayStartTime – The start date in ISO-8601 format (for example, 2021-01-20T21:41:16+00:00) for when you want to run the first iteration of the extract process from the source cluster. You can change it later in the input JSON of your extract state machine.
SimpleReplayEndTime – The end date and time in ISO-8601 format for when you want to extract from the source cluster and replay in the target RA3 cluster. You can change it later in the input JSON of your extract state machine. Please make sure the difference between start and end time does not exceed 24 hours.
ExtractSystemTables – This is an optional step if you want to extract the source cluster system tables for reference. We recommend setting this parameter to No because it adds an AWS Identity and Access Management (IAM) role to the source cluster to perform unload system tables from the source cluster.
EndUserIamRoleName – The existing IAM role names for the end-user, who may be running the extract-replay evaluation. You can use this parameter to allow non-admin users to run the extract-replay state machine without any other permissions on the AWS resources. Enter N/A if you don’t want to provide any end-user permissions.
EC2InstanceAMI – The AMI for the Amazon Linux 2 based EC2 instance. We recommend keeping the default AMI for this parameter unless it’s needed for compliance requirements.

After you deploy the template, navigate to the Outputs tab of the template, which lists some relevant parameters needed for the replay process deployment.

End-to-end workflow: Replay process

The second part of this solution is to deploy the replay process using the CloudFormation template either in the same account where the extract process was run or a different account in the same Region.

This process provisions two Amazon Redshift clusters: one replica cluster, with exact same configuration as your source cluster, and another target cluster, with the RA3 configuration. It deploys two EC2 instances of the M5 family with the Simple Replay utility installed and replays the extracted workload simultaneously in these clusters. Because the replay process preserves the time interval between queries and transactions to mimic the exact workload from the source cluster, this process takes about the same time as the duration between start_time and end_time that you provided while running the extract process. The following diagram illustrates the architecture of the solution.

The following diagram shows the Step Functions state machine for the replay process.

The state machine performs the following steps to replay the extracted workload from the extract S3 bucket:

Update the Amazon Redshift parameter group to the same configuration as the source cluster parameter group, which was saved in the extract S3 bucket as part of the extract process.
Initiate the create cluster process for the replica and target clusters in parallel if they don’t exist. The replica cluster is created with the exact same configuration as the source cluster, and the target cluster is created with RA3 configuration if the source cluster is compatible for elastic resize for the RA3 configuration, which you specified when deploying the CloudFormation template. If the target RA3 configuration isn’t compatible for elastic resize, it creates the target cluster with the same configuration as the replica cluster.
If the previous step created the target cluster with a non-RA3 configuration due to incompatibility with elastic resize, it performs a classic resize on that cluster when it becomes available.
If the target or replica cluster is in a paused state, it resumes the cluster.
It the target or replica cluster is in an available state and any restore operation (if applicable) has been completed on that cluster, it runs a SQL script to set up some Amazon Redshift objects in the public schema of the cluster to perform automated performance comparison between the clusters.
When the setup process is complete for both target and replica clusters, it runs the replay process in both the clusters simultaneously, which runs all SQLs extracted from the source cluster, maintaining the same transaction order and time interval as the source cluster.
When the replay process is complete, it unloads query statistics from the replica cluster and loads them to the target RA3 cluster, enabling direct performance comparison between the environments from within the RA3 cluster.

The CloudFormation template for the replay process automatically initiates the first iteration of the replay process, but you can rerun the process any time by submitting the state machine without any parameters. This template requires you to provide the following parameters:

SourceAccountNumber – The source account number where the extract process ran. You can find it on the Outputs tab of the extract stack.
SourceAccountSimpleReplayS3Bucket – The extract S3 bucket, which was created by the extract template (available on the stack Outputs tab).
SourceRedshiftClusterEndpoint – The non-RA3 source cluster endpoint, including port number and database name (available on the stack Outputs tab).
SourceRedshiftClusterKMSKeyARN – AWS Key Management Service (KMS) Key ARN (Amazon Resource Name) if your source Redshift cluster is encrypted (available on the stack Outputs tab). You need to run extract and replay in the same account, if your source cluster is encrypted.
SourceRedshiftClusterMasterUsername – The user name that is associated with the primary user account for the source cluster (available on the stack Outputs tab).
SourceRedshiftClusterPrimaryDatabase – The primary database name in the source cluster for which you want to replay the workload. Amazon Redshift automatically creates a default database named dev, which may not be your primary database. Enter the correct value based on your deployment. If you have multiple databases, you need to run extract and replay for these databases one at a time.
TargetRedshiftClusterNodeType – The type of RA3 node to be provisioned. We recommend using the node type and node count as suggested in Upgrading to RA3 node types.
TargetRedshiftClusterNumberOfNodes – The number of compute nodes in the cluster.
EndUserIamRoleName – The existing IAM role names for the end-user, who may be running the extract-replay evaluation. You can use this parameter to allow non-admin users to run the extract-replay state machine without any other permissions on the AWS resources. Enter N/A if you don’t want to provide any end-user permissions
GrantS3ReadOnlyAccessToRedshift – If you’re deploying extract and replay processes in the same account, you may enter Yes for this parameter, which grants AmazonS3ReadOnlyAccess to the Amazon Redshift target and replica clusters to replay copy statements from Amazon Redshift within the account. Otherwise, you need to manually copy the files and adjust the copy_replacement.csv file in the extract S3 bucket’s latest extract folder and also set the parameter for copy statements to true in the replay S3 bucket’s config/replay.yaml file.
VPC – An existing Amazon Virtual Private Cloud (Amazon VPC) where you want to deploy the clusters and EC2 instances.
SubnetId – An existing subnet within the VPC in which you deploy the clusters and EC2 instances.
KeyPairName – An existing key pair to allow SSH to the replay EC2 instance.
OnPremisesCIDR – The IP range (CIDR notation) for your existing infrastructure to access the target and replica clusters from a SQL client. If unsure, enter your corporate desktop’s CIDR address. For instance, if your desktop’s IP address is 10.156.87.45, enter 10.156.87.45/32.
EC2InstanceType – The EC2 instance type to host the Simple Replay utility codebase. You can use the large instance type if the data size in your cluster is less than 1 TB. We recommend larger instance types for larger workloads to stop the EC2 instance from becoming a bottleneck when fetching query results from the clusters.
EC2InstanceVolumeGiB – The size of the EC2 instance volume in GiBs. We recommend keeping it 30 GiB or more.
EC2InstanceAMI – The AMI for the Amazon Linux 2 based EC2 instance. Don’t change this parameter unless you need to for compliance requirements.

Access permissions and security

To deploy this solution with AWS CloudFormation, you need administrator access on the AWS accounts where you plan to deploy the extract and replay processes. Both templates provide the input parameter EndUserIamRoleName, which you can use to allow non-admin users to run the processes without any extensive permissions on the system resources.

The CloudFormation templates provision all the required resources using security best practices based on the principle of least privileges and hosts all resources within your account VPC. The EC2 instances and Amazon Redshift clusters share the same security group and SSH access isn’t allowed on the EC2 instances. Access to the Amazon Redshift target and replica clusters are controlled with the CloudFormation template parameter OnPremisesCIDR, which you need to provide to allow on-premises users to connect to the new clusters using their SQL clients on the Amazon Redshift port.

Access permissions for all the resources are controlled using IAM roles granting appropriate permissions to Amazon Redshift, Lambda, Step Functions, and Amazon EC2. Read and write access privileges are granted on the S3 bucket created by the extract process to the AWS account used for the replay process so that it can read and update the configurations from that bucket.

Evaluate RA3 performance

After you complete the first run of the replay state machine, you should be able to view the RA3 target and non-RA3 replica clusters on the Amazon Redshift console of your AWS account where the replay template was deployed. Each iteration of the replay automatically populates the following tables and views in the public schema of the target RA3 cluster, allowing you to directly compare performance between the clusters:

source_target_comparison – Provides a comparison summary between the time taken by the two clusters to replay the workloads. It provides the column total_query_time_saved_seconds, grouped by Amazon Redshift queue and user name, which may be very useful in your final evaluation.
source_target_comparison_raw – Provides a detailed comparison between the time taken by the two clusters for each query.
replica_cluster_query_stats – Stores query-level metrics for the replay that ran on the replica cluster.
target_cluster_query_stats – Stores query-level metrics for the replay that ran on the RA3 cluster.
source_cluster_query_stats – Stores query-level metrics from the source cluster. This table may be empty because it relies on STL log views in the source cluster, which are retained only for 2–5 days. For more information, see STL views for logging.
detailed_query_stats – Populates the query_stats tables and provides the logic we use to populate these statistics from STL log views.

Cost and timeline considerations

Running this template in your AWS account has some cost implications because it provisions new Amazon Redshift clusters and three EC2 instances, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the CloudFormation stack. This deletes all associated resources except the two S3 buckets for extract and replay. We also recommend pausing your clusters when not in use. For more information, see Amazon Redshift Pricing and Amazon EC2 pricing.

Limitations

Simple Replay and this automation process have some known limitations:

The extract process may fail if there is lag in delivery of audit logs to Amazon S3. In that case, you need to rerun the extract state machine by selecting a different time interval from the past.
Dependent SQL queries across connections aren’t guaranteed to run in the original order.
Redshift Spectrum queries aren’t replayed if the target cluster doesn’t have access to external tables.
Queries with BIND variables aren’t replayed.
Replay using JDBC isn’t supported.
Concurrent large fetches might cause strain on the Amazon EC2 client. A larger EC2 instance might be required for these cases.
Audit logs might contain SQLs that weren’t submitted to the production cluster. These SQLs will be replayed.

You can minimize the impact of these limitations by using the replica and target clusters for comparison rather than direct comparison with the production Amazon Redshift cluster.

Conclusion

Amazon Redshift RA3 instances offer many added benefits in comparison to its predecessors. If you’re trying to migrate to the RA3 instance type but are concerned about the evaluation efforts, the Amazon Redshift Simple Replay utility can make it easy and seamless to perform this evaluation and successfully migrate to RA3 nodes.

If you’re satisfied with the performance of the RA3 instance type, you can perform a resize operation on your production cluster to move it to the RA3 platform. The time taken for resizing your production cluster is close to the time taken to create the test RA3 cluster based on whether it used elastic resize or classic resize. We recommend creating a manual snapshot before performing the resize operation of your production cluster.

About the Authors

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data driven solutions in different database and data warehousing technologies for over fifteen years. He loves to learn new technologies and solving, automating and simplifying customer problems with easy-to-use cloud data solutions on AWS.

Srinath Madabushi is a Senior Database Engineer at Amazon Redshift. He has worked in various data warehousing, BI and database technologies for over 16 years. He is an avid big data enthusiast who collaborates with customers around the globe to meet their data warehousing needs.