Data Engineering Technologies 2021
Emerging technologies supporting the field of data engineering are growing at a rapid clip. This curated list includes the most important offerings available in 2021.
By Tech Ninja, @techninjathere, OpenSource, Analytics & Cloud enthusiast.
A partial list of top engineering technologies, image created by KDnuggets.
Complete curated list of emerging technologies in Data Engineering
- Abacus AI, enterprise AI with AutoML, similar space to DataRobot.
- Algorithmia, enterprise MLOps.
- Amundsen, an open-sourced data discovery and metadata engine.
- Anodot, monitors all your data in real-time for lightning-fast detection of incidents.
- Apache Arrow, essential because of non-JVM, in-memory, columnar format and vectorized.
- Apache Calcite, framework for building SQL databases and data management systems without owning data. Hive, Flink, and others use Calcite.
- Apache HOP, facilitates all aspects of data and metadata orchestration.
- Apache Iceberg is an open table format for massive analytic datasets.
- Apache Pinot, real-time distributed OLAP datastore. Its growth is impressive and it is in a similar space to Druid, but not exactly!
- Apache Superset, open source BI with many connectors available.
- Beam, implement batch and streaming data processing jobs that run on any execution engine.
- Cnvrg, enterprise MLOps.
- Confluent, Apache Kafka and following ecosystem.
- Dagster, a data orchestrator for machine learning, very programming-based and in a similar space to Airflow, but emphasizes state flow.
- DASK, Data Science purely in Python.
- DataRobot, solid ML platform with a strong focus in enterprise MLOps.
- Databricks, with new SQL analytics and lakehouse paper, expecting more amazing OSS.
- DataFrame Whale is a straightforward data discovery tool.
- Dataiku, enterprise AI/MLOps platform.
- Delta Lake, ACID on Apache Spark.
- DVC, open-source version control system for ML projects and desired for MLOps.
- Feast, open-source feature store, now with Tecton.
- Fiddler, enterprise explainable AI.
- Fivetran, data integration pipeline.
- Getdbt, is hitting the sweet spot of Apache Spark by bringing a simplified SQL-based pipeline.
- Great Expectations, Data Science testing framework, it’s already amazing!
- Hopswork, open-sourced MLOps feature store.
- Hudi brings transactions, record-level updates/deletes, and change streams to data lakes.
- Koalas, Pandas on Apache Spark.
- The Kubeflow project is dedicated to making machine learning workflows on Kubernetes that is simple, portable, and scalable.
- lakeFS enables you to manage your data lake the way you manage your code. Run parallel pipelines for experimentation and CI/CD for your data.
- maiot-ZenML, open-sourced MLOps Framework, having a bit of everything.
- Marquez, open-source metadata with a fantastic UI.
- Metabase, an open-source BI with excellent visualization.
- MLFlow, a machine learning platform.
- Montecarlodata, data governance or data discovery or data observability.
- Nextflow, data-driven computational pipelines designed for BioInformatics, but can go beyond.
- Pachyderm, MLOps platform, in the space of MLFlow.
- Papermill, parameterizing a notebook, makes Data Science more exciting and more accessible.
- Prefect, designed to make workflow management easier and better compared to Apache Airflow.
- RAPIDS, Data Science on GPUs.
- Ray, distributed machine learning and now streaming.
- Starburst, unlock the value of distributed data by making it fast and easy to access.
- Tecton, enterprise feature store.
- Trino, aka PrestoSQL, now with a clear separation from Presto, Trino can focus heavily on features.
Reordered alphabetically, based on this original. Reposted with permission.
Related:
Source: https://www.kdnuggets.com/2021/09/data-engineering-technologies-2021.html
- "
- &
- 2021
- access
- AI
- All
- analytics
- Apache
- Apache Kafka
- Apache Spark
- apps
- Bit
- Building
- change
- Cloud
- code
- data
- data integration
- Data Lake
- data management
- data processing
- data science
- databases
- DataRobot
- deep learning
- Detection
- discovery
- ecosystem
- engineer
- Engineering
- Enterprise
- Excel
- execution
- Explainable AI
- Face
- FAST
- Feature
- Features
- flow
- Focus
- format
- Framework
- governance
- GPUs
- Growing
- Growth
- Hive
- HTTPS
- image
- integration
- IT
- Jobs
- Kubernetes
- learning
- List
- machine learning
- Making
- management
- Microsoft
- ML
- MLOps
- Offerings
- open
- open source
- Others
- Paper
- platform
- portfolio
- project
- projects
- Python
- real-time
- Run
- Science
- scientists
- Simple
- skills
- Space
- Spot
- SQL
- State
- store
- Stories
- streaming
- sweet
- system
- Systems
- tech
- Technologies
- Testing
- top
- Transactions
- ui
- value
- version control
- visualization
- web
- workflow
- X
More from KDnuggets
Getting Started with Google Cloud Platform in 5 Steps – KDnuggets
Source Node: 2303370
Time Stamp: Oct 1, 2023
Windows on Snapdragon Brings Hybrid AI to Apps at the Edge – KDnuggets
Source Node: 2351397
Time Stamp: Oct 25, 2023
Synthetic Data Platforms: Unlocking the Power of Generative AI for Structured Data – KDnuggets
Source Node: 2166680
Time Stamp: Jul 11, 2023
KDnuggets News, February 15: Top Free Resources To Learn ChatGPT • 5 Pandas Plotting Functions You Might Not Know
Source Node: 1960072
Time Stamp: Feb 15, 2023
The Architecture Behind DeepMind’s Model for Near Real Time Weather Forecasts
Source Node: 1877271
Time Stamp: Oct 5, 2021
Unveiling Hidden Patterns: An Introduction to Hierarchical Clustering – KDnuggets
Source Node: 2314626
Time Stamp: Oct 6, 2023
Great New Resource for Natural Language Processing Research and Applications
Source Node: 875076
Time Stamp: May 27, 2021
Semantic Layers are the Missing Piece for AI-Enabled Analytics – KDnuggets
Source Node: 2480815
Time Stamp: Feb 14, 2024