earnbit

Accelerate your data warehouse migration to Amazon Redshift – Part 2

Source Node: 1858661

This is the second post in a multi-part series. We’re excited to shared dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially reduce your overall cost to migrate to Amazon Redshift. Check out the first post Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about automated macro conversion, case-insensitive string comparison, case-sensitive identifiers, and other exciting new features.

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services like Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating their self-managed data warehouse engines to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share additional enhancements to the AWS Schema Conversion Tool (AWS SCT) to automate your migrations to Amazon Redshift. These enhancements reduce the recoding needed for your data tables, and more importantly, the manual work needed for views, stored procedures, scripts, and other application code that use those tables.

In this post, we introduce automation for INTERVAL and PERIOD data types, automatic type casting, binary data support, and some other enhancements that have been requested by customers. We show you how to use AWS SCT to convert objects from a Teradata data warehouse and provide links to relevant documentation so you can continue exploring these new capabilities.

INTERVAL data types

An INTERVAL is an unanchored duration of time, like “1 year” or “2 hours,” that doesn’t have a specific start or end time. In Teradata, INTERVAL data is implemented as 13 distinct data types depending on the granularity of time being represented. The following table summarizes these types.

Year intervals Month intervals Day intervals Hour intervals Minute intervals Second intervals

INTERVAL YEAR

INTERVAL YEAR TO MONTH

INTERVAL MONTH

 

INTERVAL DAY

INTERVAL DAY TO HOUR

INTERVAL DAY TO MINUTE

INTERVAL DAY TO SECOND

INTERVAL HOUR

INTERVAL HOUR TO MINUTE

INTERVAL HOUR TO SECOND

INTERVAL MINUTE

INTERVAL MINUTE TO SECOND

INTERVAL SECOND

 

Amazon Redshift doesn’t support INTERVAL data types natively. Previously, if you used INTERVAL types in your data warehouse, you had to develop custom code as part of the database conversion process.

Now, AWS SCT automatically converts INTERVAL data types for you. AWS SCT converts an INTERVAL column into a CHARACTER VARYING column in Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the INTERVAL semantics.

For example, consider the following Teradata table, which has a MONTH interval column. This table store different types of leaves of absences and the allowable duration for each.

CREATE TABLE testschema.loa_durations ( loa_type_id INTEGER
, loa_name VARCHAR(100) CHARACTER SET LATIN
, loa_duration INTERVAL MONTH(2))
PRIMARY INDEX (loa_type_id);

AWS SCT converts the table to Amazon Redshift as follows. Because Amazon Redshift doesn’t have a native INTERVAL data type, AWS SCT replaces it with a VARCHAR data type.

CREATE TABLE testschema.loa_durations( loa_type_id INTEGER
, loa_name VARCHAR(100)
, loa_duration VARCHAR(64)
)
DISTSTYLE KEY
DISTKEY
(
loa_type_id
)
SORTKEY
(
loa_type_id
);

Now, let’s suppose your application code uses the loa_duration column, like the following Teradata view. Here, the INTERVAL MONTH field is added to the current date to compute when a leave of absence ends if it starts today.

REPLACE VIEW testschema.loa_projected_end_date AS
SELECT loa_type_id loa_type_id
, loa_name loa_name
, loa_duration
, current_date AS today
, current_date + loa_duration AS end_date
FROM
testschema.loa_durations
;

Because the data is stored as CHARACTER VARYING, AWS SCT injects the proper type CAST into the Amazon Redshift code to interpret the string values as a MONTH interval. It then converts the arithmetic using Amazon Redshift date functions.

CREATE OR REPLACE VIEW testschema.loa_projected_end_date (loa_type_id, loa_name, loa_duration, today, end_date) AS
SELECT loa_type_id AS loa_type_id
, loa_name AS loa_name
, loa_duration
, CURRENT_DATE AS today
, dateadd(MONTH, CAST (loa_duration AS INTEGER),CURRENT_DATE)::DATE AS end_date
FROM testschema.loa_durations
;

Also, as a bonus, AWS SCT automatically converts any literal INTERVAL values that you might be using in your code.

For example, consider the following Teradata table. The table contains a DATE column, which records the last date when an employee was promoted.

CREATE TABLE TESTSCHEMA.employees ( id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD'
)
UNIQUE PRIMARY INDEX ( id );

Now, suppose the database contains a view that computes the next date an employee is eligible for a promotion. We implement a business rule that employees who have never been promoted are eligible for promotion in 1.5 years. All other employees become eligible 2.5 years after their last promotion. See the following code:

REPLACE VIEW testschema.eligible_for_promo AS
SELECT id
, name
, last_promo_date
, CASE WHEN last_promo_date is NULL THEN current_date + INTERVAL '18' MONTH ELSE last_promo_date + INTERVAL '2-06' YEAR TO MONTH END eligible_date
FROM employees
;

AWS SCT converts the INTERVAL values used in the CASE statement and translates the date expressions accordingly:

CREATE OR REPLACE VIEW testschema.eligible_for_promo (id, name, last_promo_date, eligible_date) AS
SELECT id
, name
, last_promo_date
, CASE WHEN last_promo_date IS NULL THEN dateadd(MONTH, 18, CURRENT_DATE)::DATE ELSE dateadd(MONTH, 30, last_promo_date)::DATE END AS eligible_date
FROM testschema.employees
;

We’re excited about INTERVAL automation in AWS SCT and encourage you to give it a try. For more information about getting started with AWS SCT, see Installing, verifying, and updating AWS SCT.

PERIOD data type

A PERIOD data value represents a duration of time with a specified start and end. For example, the Teradata literal “(2021-01-01 to 2021-01-31)” is a period with a duration of 31 days that starts and ends on the first and last day of January 2021, respectively. PERIOD data types can have three different granularities: DATE, TIME, or TIMESTAMP. The following table provides some examples.

Period Type Example
PERIOD(DATE) “(2021-01-01 to 2021-01-31) “
PERIOD(TIME) “(12:00:00 to 13:00:00)”
PERIOD(TIMESTAMP) “(2021-01-31 00:00:00 to 2021-01-31 23:59:59)”

As with INTERVAL, the PERIOD data type isn’t natively supported by Amazon Redshift. Previously, if you used these data types in your tables, you had to write custom code as part of the database conversion process.

Now, AWS SCT automatically converts PERIOD data types for you. AWS SCT converts a PERIOD column into two DATE (or TIME or TIMESTAMP) columns as appropriate on Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the source engine semantics.

For example, consider the following Teradata table:

CREATE SET TABLE testschema.period_table ( id INTEGER
, period_col PERIOD(timestamp)) UNIQUE PRIMARY INDEX (id);

AWS SCT converts the PERIOD(TIMESTAMP) column into two TIMESTAMP columns in Amazon Redshift:

CREATE TABLE IF NOT EXISTS testschema.period_table( id INTEGER
, period_col_begin TIMESTAMP
, period_col_end TIMESTAMP
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

Now, let’s look at a simple example of how you can use AWS SCT to convert your application code. A common operation in Teradata is to extract the starting (or ending) timestamps in a PERIOD value using the BEGIN and END built-in functions:

REPLACE VIEW testschema.period_view_begin_end AS SELECT BEGIN(period_col) AS period_start
, END(period_col) AS period_end FROM testschema.period_table
;

AWS SCT converts the view to reference the transformed table columns:

CREATE OR REPLACE VIEW testschema.period_view_begin_end (period_start, period_end) AS
SELECT period_col_begin AS period_start
, period_col_end AS period_end
FROM testschema.period_table;

We’ll continue to build automation for PERIOD data conversion, so stay tuned for more improvements. In the meantime, you can try out the PERIOD data type conversion features in AWS SCT now. For more information, see Installing, verifying, and updating AWS SCT.

Type casting

Some data warehouse engines, like Teradata, provide an extensive set of rules to cast data values in expressions. These rules permit implicit casts, where the target data type is inferred from the expression, and explicit casts, which typically use a function to perform the type conversion.

Previously, you had to manually convert implicit cast operations in your SQL code. Now, we’re happy to share that AWS SCT automatically converts implicit casts as needed. This feature is available now for the following set of high-impact Teradata data types.

Category Source data type Target data types
Numeric CHAR BIGINT
NUMBER
TIMESTAMP
VARCHAR NUMBER
NUMERIC
DEC
CHAR
GEOMETRY
INTEGER DATE
DEC
BIGINT DATE
NUMBER CHARACTER
VARCHAR
DEC
DECIMAL DATE
TIMESTAMP
SMALLINT
DOUBLE PRECISION
FLOAT DEC
Time DATE BIGINT
INTEGER
DECIMAL
FLOAT
NUMBER
CHARACTER
TIMESTAMP
INTERVAL NUMBER
BIGINT
INTEGER
Other GEOMETRY DECIMAL

Let’s look at how to cast numbers to DATE. Many Teradata applications treat numbers and DATE as equivalent values. Internally, Teradata stores DATE values as INTEGER. The rules to convert between an INTEGER and a DATE are well-known and developers have commonly exploited this information to perform date calculations using INTEGER arithmetic.

For example, consider the following Teradata table:

CREATE TABLE testschema.employees ( id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD')
UNIQUE PRIMARY INDEX ( id );

We insert a single row of data into the table:

select * from employees; *** Query completed. One row found. 4 columns returned. *** Total elapsed time was 1 second. id name manager_id last_promo_date
----------- -------------------- ----------- --------------- 112 Britney 201 ?

We use a macro to update the last_promo_date field for id = 112. The macro accepts a BIGINT parameter to populate the DATE field.

replace macro testschema.set_last_promo_date(emp_id integer, emp_promo_date bigint) AS (
update testschema.employees
set last_promo_date = :emp_promo_date
where id = :emp_id;
);

Now, we run the macro and check the value of the last_promo_date attribute:

exec testschema.set_last_promo_date(112, 1410330); *** Update completed. One row changed. *** Total elapsed time was 1 second. select * from employees; *** Query completed. One row found. 4 columns returned. *** Total elapsed time was 1 second. id name manager_id last_promo_date
----------- -------------------- ----------- --------------- 112 Britney 201 41/03/30

You can see the last_promo_date attribute is set to the date March 30, 2041.

Now, let’s use AWS SCT to convert the table and macro to Amazon Redshift. As we saw in Part 1 of this series, AWS SCT converts the Teradata macro into an Amazon Redshift stored procedure:

CREATE TABLE IF NOT EXISTS testschema.employees( id INTEGER
, name CHARACTER VARYING(20) , manager_id INTEGER
, last_promo_date DATE
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id); CREATE OR REPLACE PROCEDURE testschema.set_last_promo_date(par_emp_id INTEGER, par_emp_promo_date BIGINT)
AS $BODY$
BEGIN UPDATE testschema.employees SET last_promo_date = TO_DATE((par_emp_promo_date + 19000000), 'YYYYMMDD') WHERE id = par_emp_id;
END;
$BODY$
LANGUAGE plpgsql;

Note that 20410330 = 1410330 + 19000000; so adding 19,000,000 to the input returns the correct date value 2041-03-30.

Now, when we run the stored procedure, it updates the last_promo_date as expected:

myredshift=# select * from testschema.employees; id | name | manager_id | last_promo_date 112 | Britney | 201 |
(1 row) myredshift=# call testschema.set_last_promo_date(112, 1410330);
CALL myredshift=# select * from testschema.employees; id | name | manager_id | last_promo_date 112 | Britney | 201 | 2041-03-30
(1 row)

Automatic data type casting is available in AWS SCT now. You can download the latest version and try it out.

BLOB data

Amazon Redshift doesn’t have native support for BLOB columns, which you use to store large binary objects like text or images.

Previously, if you were migrating a table with a BLOB column, you had to manually move the BLOB values to file storage, like Amazon Simple Storage Service (Amazon S3), then add a reference to the S3 file in the table. Using Amazon S3 as the storage target for binary objects is a best practice because these objects are large and typically have low analytic value.

We’re happy to share that AWS SCT now automates this process for you. AWS SCT replaces the BLOB column with a CHARACTER VARYING column on the target table. Then, when you use the AWS SCT data extractors to migrate your data, the extractors upload the BLOB value to Amazon S3 and insert a reference to the BLOB into the target table.

For example, let’s create a table in Teradata and populate it with some data:

CREATE SET TABLE TESTSCHEMA.blob_table ( id INTEGER
, blob_col BLOB(10485760))
PRIMARY INDEX ( id ); select * from blob_table; *** Query completed. 2 rows found. 2 columns returned. *** Total elapsed time was 1 second. id blob_col
----------- --------------------------------------------------------------- 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Now, we convert the table with AWS SCT and build it on Amazon Redshift:

myredshift=# d testschema.blob_table; Table "testschema.blob_table"
Column | Type | Collation | Nullable | Default id | integer |. | | blob_col | character varying(1300) | | |

Then we use the AWS SCT data extractors to migrate the table data from Teradata to Amazon Redshift.

When we look at the table in Amazon Redshift, you can see the paths to the S3 files that contain the BLOB values:

myredshift=# select * from testschema.blob_table;
(2 rows) id | blob_col 2 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/308b6f0a902941e793212058570cdda5.dat 1 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/a7686067af5549479b52d81e83c3871e.dat

And on Amazon S3, you can see the actual data files. There are two, one for each BLOB value:

$ aws s3 ls s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/
2021-05-13 23:59:47 23 522fee54fda5472fbae790f43e36cba1.dat
2021-05-13 23:59:47 24 5de6c53831f741629476e2c2cbc6b226.dat

BLOB support is available now in AWS SCT and the AWS SCT data extractors. Download the latest version of the application and try it out today.

Multi-byte CHARACTER conversion

Teradata supports multibyte characters in CHARACTER data columns, which are fixed length fields. Amazon Redshift supports multibyte characters in CHARACTER VARYING fields but not in fixed-length CHARACTER columns.

Previously, if you had fixed-length CHARACTER columns, you had to determine if they contained multibyte character data, and increase the target column size as appropriate.

AWS SCT now bridges this gap for you. If your Teradata tables contain CHARACTER columns with multibyte characters, AWS SCT automatically converts these columns to Amazon Redshift CHARACTER VARYING fields and sets the column sizes accordingly. Consider the following example, which contains four columns, a LATIN column that contains only single-byte characters, and UNICODE, GRAPHIC, and KANJISJIS columns that can contain multi-byte characters:

create table testschema.char_table ( latin_col char(70) character set latin
, unicode_col char(70) character set unicode
, graphic_col char(70) character set graphic
, kanjisjis_col char(70) character set kanjisjis
);

AWS SCT translates the LATIN column to a fixed length CHARACTER column. The multi-byte columns are upsized and converted to CHARACTER VARYING:

CREATE TABLE IF NOT EXISTS testschema.char_table ( latin_col CHARACTER(70)
, unicode_col CHARACTER VARYING(210)
, graphic_col CHARACTER VARYING(210)
, kanjisjis_col CHARACTER VARYING(210)
)
DISTSTYLE KEY
DISTKEY
(col1)
SORTKEY
(col1);

Automatic conversion for multibyte CHARACTER columns is available in AWS SCT now.

GEOMETRY data type size

Amazon Redshift has long supported geospatial data with a GEOMETRY data type and associated spatial functions.

Previously, Amazon Redshift restricted the maximum size of a GEOMETRY column to 64 KB, which constrained some customers with large objects. Now, we’re happy to share that the maximum size of GEOMETRY objects has been increased to just under 1 MB (specifically, 1,048,447 bytes).

For example, consider the following Teradata table:

create table geometry_table ( id INTEGER
, geometry_col1 ST_GEOMETRY , geometry_col2 ST_GEOMETRY(1000)
, geometry_col3 ST_GEOMETRY(1048447) , geometry_col4 ST_GEOMETRY(10484470)
, geometry_col5 ST_GEOMETRY INLINE LENGTH 1000
)
;

You can use AWS SCT to convert it to Amazon Redshift. The converted table definition is as follows. A size specification isn’t needed on the converted columns because Amazon Redshift implicitly sets the column size.

CREATE TABLE IF NOT EXISTS testschema.geometry_table(
id INTEGER,
geometry_col1 GEOMETRY,
geometry_col2 GEOMETRY,
geometry_col3 GEOMETRY,
geometry_col4 GEOMETRY,
geometry_col5 GEOMETRY
)
DISTSTYLE KEY
DISTKEY
(
id
)
SORTKEY
(
id
);
ALTER TABLE testschema.geometry_table ALTER DISTSTYLE AUTO;
ALTER TABLE testschema.geometry_table ALTER SORTKEY AUTO;

Large GEOMETRY columns are available in Amazon Redshift now. For more information, see Querying spatial data in Amazon Redshift.

Conclusion

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports, applications, and ETL, as well as accelerate your query performance.

This post described a few of the dozens of new features we have recently introduced to automate your data warehouse migrations to Amazon Redshift. We will share more in upcoming posts. You’ll hear about additional SQL automation, a purpose-built scripting language for Amazon Redshift with BTEQ compatibility, and automated support for proprietary SQL features.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool on the AWS website. Happy migrating!


About the Author

Michael Soo is a database engineer with the AWS DMS and AWS SCT team at Amazon Web Services.

Source: https://aws.amazon.com/blogs/big-data/part-2-accelerate-your-data-warehouse-migration-to-amazon-redshift/

Time Stamp:

More from AWS