Data governance is significant , Traditional data governance is managed in the form of documents , It has been unable to meet the needs of data governance under big data . And suitable for Hadoop Data governance of big data ecosystem is very important .
Data governance under big data is a huge problem for many enterprises , There are not many solutions to the data that can be found , But in recent years , A lot of companies have tried and open source , This article will analyze these data discovery platforms in detail , In foreign countries, there are more than ten kinds of implementation schemes .
Data discovery platform can solve the problem
Why a data discovery platform is needed ？
In the process of data governance , We often encounter these problems ： Where does the data exist ？ How to use this data ？ What data does ？ How data is created ？ How the data is updated ？
The purpose of data discovery platform is to solve the above problems , Help better find , Understanding and using data .
such as Facebook Of Nemo We used full-text retrieval technology , This can quickly search for target data .
When the user is browsing the data table , How to quickly understand data ？ The general way is to list , data type , The description shows , If the user has permission , You can also preview the data .
Here is Amundsen Data column display function of .
data ETL It's a big problem , Especially how to show it is very difficult , In fact, the data is ETL It can be represented by a flow chart of data , Many platforms support this feature , such as Databook, also Metcat.
Amundsen And data scheduling platform Airflow It's a very good combination .
Data discovery platform comparison
The next table Compare the support of each major platform for the above functions
|Search for||recommend||Table description||Data preview||Make statistics||Occupancy indicators||jurisdiction||ranking||Data lineage||Change notice||Open source||file||Support data sources|
|Amundsen (Lyft)||Todo||Hive, Redshift, Druid, RDBMS, Presto, Snowflake, etc.|
|Datahub (LinkedIn)||Hive, Kafka, RDBMS|
|Metacat (Netflix)||Todo||Todo||Hive, RDS, Teradata, Redshift, S3, Cassandra|
|Atlas (Apache)||HBase, Hive, Sqoop, Kafka, Storm|
|Marquez (Wework）||S3, Kafka|
|Databook (Uber)||Hive, Vertica, MySQL, Postgress, Cassandra|
|Data Access Layer (Twitter)||HDFS, Vertica, MySQL|
Here are five open source solutions
LinkedIn Open source , Originally called WhereHows . After a period of development datahub On 2020 year 2 Month in Github Open source
It's a very dynamic project , It has a table structure , Search for , Data lineage and other functions , There are also functions such as users and groups .
Official documents are also available . Open source version supports Hive,Kafka And metadata in relational databases .
therefore Datahub The usage rate is still very high .
Lyft On 2019 year 4 Month developed Amundsen, And with 10 In open source .
Amundsen Provides search and ranking functions , Help to find data table better .
The supported data sources are very rich , Support hive ,druid Such as more than 15 Data sources , It also provides scheduling with tasks airflow Fusion , And provided with superset etc. BI How tools are integrated .
And the data lineage function is also being developed .
Netflix stay 2018 year 6 The month is open source. Metacat.
Metacat Support Hive,Teradata,Redshift,S3,Cassandra and RDS Integration of .
But while Metacat Open source , But there's no official documentation , There's very little information .
Wework On 2018 year 10 The month is open source. Marquez
Marquez Also on the Airflow With good support .
You can see Marquez It's still being updated , Keep an eye on .
As part of the data governance plan ,Atlas On 2015 year 7 Month begins Hortonworks Incubate .
Atlas 1.0 On 2018 year 6 Published in , The current version is 2.1.
Atlas The main goal is data governance , Support and HBase,Hive and Kafka Integration of .
How to choose
First of all, let's talk about the choice of the writer , Although the datahub and amundsen Very interested in , Finally, I chose Atlas.
Open source , Document richness , function , These are compared in detail in the table above , How to choose or to consider the actual situation .
There are five open source companies ： Amundsen Datahub Metacat Marquez Atlas
There are three with documentation ： Amundsen Datahub Atlas
The search function is strong ： Amundsen
There's data lineage ： Datahub Atlas
Considering the cycle of the project , Implementation, etc , I suggest you start from Atlas introduction , Open the exploration road of data governance .
Of course, some companies have adopted it at the same time Atlas and Amundsen,Atlas Dealing with metadata management , utilize Amundsen Powerful data search capabilities to do data search , It's also a good choice .
Welcome to pay attention “ Real time streaming ”
future , “ Real time streaming ” Will launch Atlas 2.1 Deployment and practice Series articles , Open the door to data governance .
more Flink,Kafka And other real-time big data analysis related technology blog , Technology Information , Welcome to real-time streaming Official account back office reply “ e-book ” download 300 page Flink Practical e-books