Создайте карточку оценки качества данных с помощью AWS Glue DataBrew, Amazon Athena и Amazon QuickSight

Исходный узел: 886717

Data quality plays an important role while building an extract, transform, and load (ETL) pipeline for sending data to downstream analytical applications and machine learning (ML) models. The analogy “garbage in, garbage out” is apt at describing why it’s important to filter out bad data before further processing. Continuously monitoring data quality and comparing it with predefined target metrics helps you comply with your governance frameworks.

In November 2020, AWS announced the general availability of AWS Glue Data Brew, a new visual data preparation tool that helps you clean and normalize data without writing code. This reduces the time it takes to prepare data for analytics and ML by up to 80% compared to traditional approaches to data preparation.

In this post, we walk through a solution in which we apply various business rules to determine the quality of incoming data and separate good and bad records. Furthermore, we publish a data quality score card using Amazon QuickSight and make records available for further analysis.

Обзор вариантов использования

For our use case, we use a public dataset that is available for download at Synthetic Patient Records with COVID-19. It contains 100,000 synthetic patient records in CSV format. Data hosted within SyntheticMass has been generated by SyntheaTM, an open-source patient population simulation made available by Корпорация МИТЕР.

When we unzip the 100k_synthea_covid19_csv.zip file, we see the following CSV files:

  • Allergies.csv
  • Careplans.csv
  • Conditions.csv
  • Devices.csv
  • Encounters.csv
  • Imaging_studies.csv
  • Immunizations.csv
  • Medications.csv
  • Observations.csv
  • Organizations.csv
  • Patients.csv
  • Payer_transitions.csv
  • Payers.csv
  • Procedures.csv
  • Providers.csv
  • Supplies.csv

We perform the data quality checks categorized by the following data quality dimensions:

  • завершенность
  • Согласованность
  • Целостность

For our use case, these CSV files are maintained by your organization’s data ingestion team, which uploads the updated CSV file to Простой сервис хранения Amazon (Amazon S3) every week. The good and bad records are separated through a series of data preparation steps, and the business team uses the output data to create business intelligence (BI) reports.

Обзор архитектуры

The following architecture uses DataBrew for data preparation and building key KPIs, Амазонка Афина for data analysis with standard SQL, and QuickSight for building the data quality score card.

Рабочий процесс включает в себя следующие шаги:

  1. The ingestion team receives CSV files in an S3 input bucket every week.
  2. The DataBrew job scheduled to run every week triggers the recipe job.
  3. DataBrew processes the input files and generates output files that contain additional fields depending on the recipe job logic.
  4. After the output data is written, we create external table on top of it by creating and running an AWS Glue Crawler.
  5. The good and bad records are separated by creating views on top of the external table.
  6. Data analysts can use Athena to analyze good and bad records.
  7. The records can also be separated directly using QuickSight calculated fields.
  8. We use QuickSight to create the data quality score card in the form of a dashboard, which fetches data through Athena.

Предпосылки

Прежде чем приступить к работе с этим учебным пособием, убедитесь, что у вас есть необходимые разрешения для создания ресурсов, необходимых как часть решения.

Additionally, create the S3 input and output buckets to capture the data, and upload the input data into the input bucket.

Create DataBrew datasets

To create a DataBrew dataset for the patient data, complete the following steps:

  1. На консоли DataBrew выберите Datasets.
  2. Выберите Подключить новый набор данных.
  3. Что касается Имя набора данных, введите имя (для этого сообщения Patients).
  4. Что касается Введите свой источник из S3, enter the S3 path of the patients input CSV.
  5. Выберите Создать набор данных.

Repeat these steps to create datasets for other CSV files, such as encounters, conditions, and so on.

Создайте проект DataBrew

To create a DataBrew project for marketing data, complete the following steps:

  1. На консоли DataBrew выберите Проекты.
  2. Выберите Создать проект.
  3. Что касается Название проекта, введите имя (для этого сообщения patients-data-quality).
  4. Что касается Выберите набор данных, наведите на Мои наборы данных.
  5. Выберите patients набор данных.
  6. Под Разрешения..., Для Название роли, выберите Управление идентификацией и доступом AWS (IAM) role that allows DataBrew to read from your Amazon S3 input location.

You can choose a role if you already created one, or create a new one. For more information, see Добавление роли IAM с разрешениями на ресурсы данных.

  1. Wait till the dataset is loaded (about 1–2 minutes).
  2. To make a consistency check, choose Дата рождения.
  3. На Создавай Меню, выберите Flag column.
  4. Под Create column, Для Values to flag, наведите на Пользовательское значение.
  5. Что касается Исходный столбец, выберите ДАТА РОЖДЕНИЯ.
  6. Что касается Values to flag, введите регулярное выражение (?:(?:18|19|20)[0-9]{2}).
  7. Что касается Flag values as, выберите да или нет.
  8. Что касается Столбец назначения, войти BIRTHDATE_flagged.

Новая колонка BIRTHDATE_FLAGGED now displays Yes for a valid four-digit year within BIRTHDATE.

  1. To create a completeness check, repeat the preceding steps to create a DRIVERS_FLAGGED column by choosing the DRIVERS column to mark missing values.
  2. To create an integrity check, choose the РЕГИСТРАЦИЯ преобразование.
  3. Выберите encounters набор данных и выбрать Следующая.
  4. Что касается Выберите тип соединения, наведите на Левое присоединение.
  5. Что касается Присоединить ключи, выберите Id для Таблица A и Пациент для Таблица B.
  6. Под Список столбцов, unselect all columns from Table B except for Patient.
  7. Выберите Завершить.
  8. Выберите Patient column and create another flag column PATIENTS_FLAG to mark missing values from the Patient колонка.

For our use case, we created three new columns to demonstrate data quality checks for data quality dimensions in scope (consistency, completeness, and integrity), but you can integrate additional transformations on the same or additional columns as needed.

  1. After you finish applying all your transformations, choose Опубликовать on the recipe.
  2. Enter a description of the recipe version and choose Опубликовать.

Создание задания DataBrew

Now that our recipe is ready, we can create a job for it, which gets invoked through our AWS Lambda функции.

  1. На консоли DataBrew выберите Джобс.
  2. Выберите Создать работу.
  3. Что касается Название работы¸ введите имя (например, patient-data-quality).

Your recipe is already linked to the job.

  1. Под Настройки вывода заданиядля Тип файла, выберите окончательный формат хранения (для этого поста мы выбираем CSV).
  2. Что касается S3 местоположение, введите окончательный путь к корзине вывода S3.
  3. Что касается компрессия, choose the compression type you want to apply (for this post, we choose Ничто).
  4. Что касается Хранилище выходных файлов, наведите на Замена выходных файлов для каждого запуска задания.

We choose this option because our use case is to publish a data quality score card for every new set of data files.

  1. Под Разрешения..., Для Название роли¸ choose your IAM role.
  2. Выберите Создать и запустить задание.

Create an Athena table

If you’re familiar with Apache Hive, you may find creating tables on Athena to be familiar. You can create tables by writing the DDL statement on the query editor, or by using the wizard or Драйвер JDBC. To use the query editor, enter the following DDL statement to create a table:

CREATE EXTERNAL TABLE `blog_output`( `id` string, `birthdate` string, `birthdate_flagged` string, `deathdate` string, `ssn` string, `drivers` string, `drivers_flagged` string, `passport` string, `prefix` string, `first` string, `last` string, `suffix` string, `maiden` string, `marital` string, `race` string, `ethnicity` string, `gender` string, `birthplace` string, `address` string, `city` string, `state` string, `county` string, `zip` bigint, `lat` double, `lon` double, `healthcare_expenses` double, `healthcare_coverage` double, `patient` string, `patient_flagged` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://<your-bucket>/blog_output/';

Let’s validate the table output in Athena by running a simple SELECT query. The following screenshot shows the output.

Create views to filter good and bad records (optional)

To create a good records view, enter the following code:

CREATE OR REPLACE VIEW good_records AS
SELECT * FROM "databrew_blog"."blog_output"
where birthdate_flagged = 'Yes' AND
drivers_flagged = 'No' AND
patient_flagged = 'No'

To create a bad records view, enter the following code:

CREATE OR REPLACE VIEW bad_records AS
SELECT * FROM "databrew_blog"."blog_output"
where birthdate_flagged = 'No' OR
drivers_flagged = 'Yes' OR patient_flagged = 'Yes'

Now you have the ability to query the good and bad records in Athena using these views.

Create a score card using QuickSight

Now let’s complete our final step of the architecture, which is creating a data quality score card through QuickSight by connecting to the Athena table.

  1. На консоли QuickSight выберите Афина как ваш источник данных.
  2. Что касается Имя источника данныхвведите имя.
  3. Выберите Создать источник данных.
  4. Choose your catalog and database.
  5. Select the table you have in Athena.
  6. Выберите Выберите.

Now you have created a dataset.

To build the score card, you add calculated fields by editing the dataset blog_output.

  1. Locate your dataset.
  2. Выберите Изменить набор данных.
  3. Выберите Добавить вычисляемое поле.
  4. Add the field DQ_Flag со значением ifelse({birthdate_flagged} = 'No' OR {drivers_flagged} = 'Yes' OR {patient_flagged} = 'Yes' , 'Invalid', 'Valid').

Similarly, add other calculated fields.

  1. Add the field % Birthdate Invalid Year со значением countIf({birthdate_flagged}, {birthdate_flagged} = 'No')/count({birthdate_flagged}).
  2. Add the field % Drivers Missing со значением countIf({drivers_flagged}, {drivers_flagged} = 'Yes')/count({drivers_flagged}).
  3. Add the field % Patients missing encounters со значением countIf({patient_flagged}, {patient_flagged} = 'Yes')/count({patient_flagged}).
  4. Add the field % Bad records со значением countIf({DQ_Flag}, {DQ_Flag} = 'Invalid')/count({DQ_Flag}).

Now we create the analysis blog_output_analysis.

  1. Change the format of the calculated fields to display the Процент формат.
  2. Start adding visuals by choosing Добавить визуальный на + Добавить .

Now you can create a quick report to visualize your data quality score card, as shown in the following screenshot.

If QuickSight is using SPICE storage, you need to refresh the dataset in QuickSight after you receive notification about the completion of the data refresh. If the QuickSight report is running an Athena query for every request, you might see a “table not found” error when data refresh is in progress. We recommend using SPICE storage to get better performance.

Убираться

Чтобы избежать дополнительных расходов, удалите ресурсы, созданные во время этого пошагового руководства.

Заключение

This post explains how to create a data quality score card using DataBrew, Athena queries, and QuickSight.

This gives you a great starting point for using this solution with your datasets and applying business rules to build a complete data quality framework to monitor issues within your datasets. We encourage you to use various built-in transformations to get the maximum value for your project.


Об авторах

Nitin Aggarwal is a Senior Solutions Architect at AWS, where helps digital native customers with architecting data analytics solutions and providing technical guidance on various AWS services. He brings more than 16 years of experience in software engineering and architecture roles for various large-scale enterprises.

Гаурав Шарма is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

Вивек Кумар is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

Source: https://aws.amazon.com/blogs/big-data/build-a-data-quality-score-card-using-aws-glue-databrew-amazon-athena-and-amazon-quicksight/

Отметка времени:

Больше от AWS