Query Snowflake Using Athena Federated Query And Join With Data In Your Amazon S3 Data Lake

Genudgivet af Platon

Abonnenter: 0

Hvis du bruger datasøer i Amazon Simple Storage Service (Amazon S3) and use Snowflake as your data warehouse solution, you may need to join your data in your data lake with Snowflake. For example, you may want to build a dashboard by joining historical data in your Amazon S3 data lake and the latest data in your Snowflake data warehouse or create consolidated reporting.

I sådanne brugstilfælde, Amazon Athena Federated Query allows you to seamlessly access the data from Snowflake without building ETL pipelines to copy or unload the data to the S3 data lake or Snowflake. This removes the overhead of creating additional extract, transform, and load (ETL) processes and shortens the development cycle.

I dette indlæg vil vi lede dig gennem en trin-for-trin konfiguration for at konfigurere Athena Federated Query vha. AWS Lambda to access data in a Snowflake data warehouse.

For this post, we are using the Snowflake connector for Amazon Athena developed by Trianz.

Lad os starte med at diskutere løsningen og derefter detaljere de involverede trin.

Løsningsoversigt

Data Federation refers to the capability to query data in another data store using a single interface (Amazonas Athena). The following diagram depicts how a single Amazon Athena federated query uses Lambda to query the underlying data source and parallelizes execution across many workers.

Athena er en interaktiv forespørgselstjeneste, der gør det nemt at analysere data i Amazon S3 ved hjælp af standard SQL. Hvis du har data i andre kilder end Amazon S3, kan du bruge Athena Federated Query til at forespørge dataene på plads eller bygge pipelines for at udtrække data fra flere datakilder og gemme dem i Amazon S3. Med Athena Federated Query kan du køre SQL-forespørgsler på tværs af data, der er gemt i relationelle, ikke-relationelle, objekt- og tilpassede datakilder.

Når en fødereret forespørgsel køres, identificerer Athena de dele af forespørgslen, der skal dirigeres til datakildeforbindelsen, og udfører dem med Lambda. Datakildeforbindelsen opretter forbindelse til kilden, kører forespørgslen og returnerer resultaterne til Athena. Hvis dataene ikke passer ind i Lambda RAM-runtime-hukommelsen, spilder de dataene til Amazon S3 og tilgås senere af Athena.

Athena bruger datakildeforbindelser, som internt bruger Lambda til at køre fødererede forespørgsler. Datakildeforbindelser er forudbyggede og kan implementeres fra Athena-konsollen eller fra det serverløse applikationslager. Baseret på brugeren, der sender forespørgslen, kan connectors give eller begrænse adgang til specifikke dataelementer.

For at implementere denne løsning gennemfører vi følgende trin:

Create a secret for the Snowflake instance using AWS Secrets Manager.
Opret en S3-spand og undermappe, som Lambda kan bruge.
Configure Athena federation with the Snowflake instance.
Kør fødererede forespørgsler med Athena.

Forudsætninger

Before getting started, make sure you have a Snowflake data warehouse up and running.

Create a secret for the Snowflake instance

Our first step is to create a secret for the Snowflake instance with a username and password using Secrets Manager.

På Secrets Manager-konsollen skal du vælge hemmeligheder.
Vælg Gem en ny hemmelighed.
Type Andre typer hemmeligheder.
Enter the credentials as key-value pairs (username, password) for your Snowflake instance.
Til Hemmeligt navn, indtast et navn til din hemmelighed. Brug præfikset snowflake så det er nemt at finde.

Lad de resterende felter stå som standard, og vælg Næste.
Fuldfør din hemmelige skabelse.

Lav en S3-spand til Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we use athena-accelerator/snowflake.

Configure Athena federation with the Snowflake instance

To configure Athena data source connector for Snowflake with your Snowflake instance, complete the following steps:

På AWS Serverless Application Repository-konsollen skal du vælge Tilgængelige applikationer.
Indtast i søgefeltet TrianzSnowflakeAthenaJDBC.

Til Ansøgningens navn, gå ind TrianzSnowflakeAthenaJDBC.
Til SecretNamePrefix, gå ind trianz-snowflake-athena.
Til Spildspand, gå ind Athena-accelerator/snowflake.
Til JDBCConnectorConfig, brug formatet snowflake://jdbc:snowflake://{snowflake_instance_url}/?warehouse={warehousename}&db={databasename}&schema={schemaname}&${secretname}

For eksempel kommer vi ind snowflake://jdbc:snowflake://trianz.snowflakecomputing.com/?warehouse=ATHENA_WH&db=ATHENA_DEV&schema=ATHENA&${trianz-snowflake-athena}DisableSpillEncyption – False

Til LambdaFunktionsnavn, gå ind trsnowflake.
Til SecurityGroupID, enter the security group ID where the Snowflake instance is deployed.

Sørg for at anvende gyldige indgående og udgående regler baseret på din forbindelse.

Til Spildpræfiks, opret en mappe under den S3-bøtte, du oprettede, og angiv navnet (f.eks. athena-spill).
Til Subnetider, use the subnets where the Snowflake instance is running with comma separation.

Sørg for, at undernettet er i en VPC og har NAT-gateway og internetgateway tilsluttet.

Vælg Jeg anerkender afkrydsningsfelt.
Vælg Implementer.

Sørg for at AWS identitets- og adgangsstyring (IAM) roller har tilladelser til at få adgang til AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, amazoncloudwatch, AWS CloudTrail, Secrets Manager, Lambda og Athena. For mere information, se Eksempel IAM-tilladelsespolitikker til at tillade Athena Federated Query.

Kør fødererede forespørgsler med Athena

Before running your federated query, be sure that you have selected Athena engine version 2. The current Athena engine version for any workgroup can be found in the Athena console page.

Kør dine fødererede forespørgsler ved hjælp af lambda:trsnowflake to run against tables in the Snowflake database. This is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:trsnowflake er en referencedatakildekonnektor Lambda-funktion, der bruger formatet lambda:MyLambdaFunctionName. For mere information, se Skrivning af fødererede forespørgsler.

The following screenshot is a unionall query example of data in Amazon S3 with a table in the AWS Lim Data Catalog and a table in Snowflake.

Bedste praksis for nøglepræstationer

If you’re considering Athena Federated Query with Snowflake, we recommend the following best practices:

Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Snowflake database. Use filter and limited-range scans in your queries to avoid full table scans.
If your SQL query requires returning a large volume of data from Snowflake to Athena (which could lead to query timeouts or slow performance), you may consider copying data from Snowflake to your S3 data lake.
The Snowflake schema, which is an extension of the star schema, is used as a data model in Snowflake. In the Snowflake schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Snowflake. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Snowflake database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Snowflake database cluster to benefit from concurrent connections to queue up.

Konklusion

In this post, you learned how to configure and use Athena federated with Snowflake using Lambda. With Athena Federated query user can leverage all of their data to produce analytics, derive business value without building ETL pipelines to bring data from different datastore such as Snowflake to Data Lake.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Snowflake for better performance. When queries are well written for federation, the performance penalties are negligible.

For mere information, se Athena brugervejledning , Brug af Amazon Athena Federated Query.

Om forfatteren

Navnit Shukla er AWS Specialist Solution Architect i Analytics. Han brænder for at hjælpe kunder med at afdække indsigt fra deres data. Han har bygget løsninger til at hjælpe organisationer med at træffe datadrevne beslutninger.