当前位置:网站首页>Atlas vs datahub vs Amundsen

Atlas vs datahub vs Amundsen

2020-11-11 08:12:59 Dugu Feng

Data governance is significant , Traditional data governance is managed in the form of documents , It has been unable to meet the needs of data governance under big data . And suitable for Hadoop Data governance of big data ecosystem is very important .

​ Data governance under big data is a huge problem for many enterprises , There are not many solutions to the data that can be found , But in recent years , A lot of companies have tried and open source , This article will analyze these data discovery platforms in detail , In foreign countries, there are more than ten kinds of implementation schemes .

Data discovery platform can solve the problem

Why a data discovery platform is needed ?

In the process of data governance , We often encounter these problems : Where does the data exist ? How to use this data ? What data does ? How data is created ? How the data is updated ?

.....

The purpose of data discovery platform is to solve the above problems , Help better find , Understanding and using data .

such as Facebook Of Nemo We used full-text retrieval technology , This can quickly search for target data .

When the user is browsing the data table , How to quickly understand data ? The general way is to list , data type , The description shows , If the user has permission , You can also preview the data .

Here is Amundsen Data column display function of .

data ETL It's a big problem , Especially how to show it is very difficult , In fact, the data is ETL It can be represented by a flow chart of data , Many platforms support this feature , such as Databook, also Metcat.

Amundsen And data scheduling platform Airflow It's a very good combination .

Data discovery platform comparison

The next table Compare the support of each major platform for the above functions

Search for recommend Table description Data preview Make statistics Occupancy indicators jurisdiction ranking Data lineage Change notice Open source file Support data sources
Amundsen (Lyft) Todo Hive, Redshift, Druid, RDBMS, Presto, Snowflake, etc.
Datahub (LinkedIn) Hive, Kafka, RDBMS
Metacat (Netflix) Todo Todo Hive, RDS, Teradata, Redshift, S3, Cassandra
Atlas (Apache) HBase, Hive, Sqoop, Kafka, Storm
Marquez (Wework) S3, Kafka
Databook (Uber) Hive, Vertica, MySQL, Postgress, Cassandra
Dataportal (Airbnb) Unknown
Data Access Layer (Twitter) HDFS, Vertica, MySQL
Lexikon (Spotify) Unknown
Here are five open source solutions

DataHub (LinkedIn)

LinkedIn Open source , Originally called WhereHows . After a period of development datahub On 2020 year 2 Month in Github Open source

https://github.com/linkedin/datahub

It's a very dynamic project , It has a table structure , Search for , Data lineage and other functions , There are also functions such as users and groups .

Official documents are also available . Open source version supports Hive,Kafka And metadata in relational databases .

therefore Datahub The usage rate is still very high .

Amundsen (Lyft)

Lyft On 2019 year 4 Month developed Amundsen, And with 10 In open source .

https://github.com/amundsen-io/amundsen

Amundsen Provides search and ranking functions , Help to find data table better .

The supported data sources are very rich , Support hive ,druid Such as more than 15 Data sources , It also provides scheduling with tasks airflow Fusion , And provided with superset etc. BI How tools are integrated .

And the data lineage function is also being developed .

Metacat(Netflix)

Netflix stay 2018 year 6 The month is open source. Metacat.

Metacat Support Hive,Teradata,Redshift,S3,Cassandra and RDS Integration of .

But while Metacat Open source , But there's no official documentation , There's very little information .

Marquez (WeWork)

Wework On 2018 year 10 The month is open source. Marquez

Marquez Also on the Airflow With good support .

You can see Marquez It's still being updated , Keep an eye on .

Apache Atlas(Hortonworks)

As part of the data governance plan ,Atlas On 2015 year 7 Month begins Hortonworks Incubate .

Atlas 1.0 On 2018 year 6 Published in , The current version is 2.1.

Atlas The main goal is data governance , Support and HBase,Hive and Kafka Integration of .

github Address

https://github.com/apache/atlas

Rich documentation

How to choose

First of all, let's talk about the choice of the writer , Although the datahub and amundsen Very interested in , Finally, I chose Atlas.

Open source , Document richness , function , These are compared in detail in the table above , How to choose or to consider the actual situation .

There are five open source companies : Amundsen Datahub Metacat Marquez Atlas

There are three with documentation : Amundsen Datahub Atlas

The search function is strong : Amundsen

There's data lineage : Datahub Atlas

Considering the cycle of the project , Implementation, etc , I suggest you start from Atlas introduction , Open the exploration road of data governance .

Of course, some companies have adopted it at the same time Atlas and Amundsen,Atlas Dealing with metadata management , utilize Amundsen Powerful data search capabilities to do data search , It's also a good choice .

Welcome to pay attention “ Real time streaming ”

future , “ Real time streaming ” Will launch Atlas 2.1 Deployment and practice Series articles , Open the door to data governance .

more Flink,Kafka And other real-time big data analysis related technology blog , Technology Information , Welcome to real-time streaming Official account back office reply “ e-book ” download 300 page Flink Practical e-books

版权声明
本文为[Dugu Feng]所创,转载请带上原文链接,感谢

随机推荐