当前位置:网站首页>Spark -- RDD is often said in spark. What is RDD?

Spark -- RDD is often said in spark. What is RDD?

2020-12-06 11:39:51 osc_ nk8pyo7o



259e891dcc7e1c344a74fed69b2f6e64.jpeg



It's today spark Topic 2 article , Let's see spark A very important concept ——RDD.

In the last lecture, we installed it locally spark, Although we only have local A cluster , But it still doesn't stop us from doing experiments .spark The biggest feature is that no matter what the resources of the cluster are , The code for the calculation is the same ,spark Meeting Automatically do distributed scheduling for us .


RDD Concept


Introduce spark Cannot do without RDD,RDD It's a very important part of it . But many beginners don't know RDD What is it , I'm the same myself , I'm learning systematically spark I wrote a lot of code before , But for RDD The concept is still in the clouds .

RDD My English full name is Resilient Distributed Dataset, I can make it clear when I write it in English . Even if you don't know the first word , It's a distributed dataset, at least . The first word is elastic , So literal translation is Elastic distributed data sets . Although we still don't know , But it's better than just knowing RDD This concept is much clearer ,

RDD Is an immutable collection of distributed objects , Every RDD It's all divided into sections , These partitions run on different nodes of the cluster .

There is only such a shallow explanation in many materials , It seems to have said a lot , But we all get Less than . On reflection, there are many questions , Finally, I found a detailed explanation in the God's blog , The great God turned spark Source code , Found one of them RDD The definition of , One RDD It contains the following :

  • A list of partitions

  • A function for computing each split

  • A list of dependencies on other RDDs

  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Let's look at it one by one :

  1. It's a set of partitions , Zoning is spark The smallest unit of a dataset in . in other words spark The data is stored by partition , Different partitions are stored on different nodes . This is also the basis of distributed computing .

  2. A computing task applied to each partition . stay spark among The data and the operations performed are separate , also spark Mechanism based on lazy Computing , That is, before the action that actually triggers the calculation appears ,spark Which data will be stored to perform which calculations . The mapping between data and computation is stored in RDD in .

  3. RDD Dependencies between ,RDD There is a transformational relationship between , One RDD It can be converted to something else by a conversion operation RDD, These conversions are recorded . When part of the data is lost ,spark The lost part of the data can be recalculated through the recorded dependencies , Instead of recalculating all the data .

  4. A partition method , That is to say The function that computes partitions .spark Support is based on hash Of hash Partition methods and scope based range Partition method .

  5. A list of , What is stored is the priority storage location of each partition .

Through the above five points , We can see that spark An important idea . namely Mobile data is not as good as mobile computing , That is to say spark When running the schedule , Will tend to distribute calculations to nodes , Instead of gathering data from nodes to calculate .RDD It is based on this idea that , It does exactly the same thing .


establish RDD


spark There are two ways to create RDD, One is Reading external data sets , The other is Parallelize a set already stored in memory .

Let's look at them one by one , The easiest way, of course, is to parallelize , Because it doesn't require an external dataset , It can be done easily .

Before that , So let's see SparkContext The concept of ,SparkContext As a whole spark Entrance , The equivalent of a program main function . Before we start spark When ,spark We've created one for us SparkContext Example , Name it sc, We have direct access to .

ffb1df5021d08ba4d14efed17b0b7aca.jpeg

We want to create RDD It also needs to be based on sc Conduct , For example, I want to create a string of RDD:

texts = sc.parallelize(['now test''spark rdd'])

Back to texts It's just one. RDD:

40d1a47060623557e621ad112d982d75.jpeg

except parallelize Besides , We can also from External data Generate RDD, For example, I want to read in from a file , have access to sc In the middle of textFile Method to get :

text = sc.textFile('/path/path/data.txt')

Generally speaking , Except for local debugging, we seldom use parallelize Create RDD, Because we need to read the data in memory first . Due to memory limitations , It makes it difficult for us to put spark The ability to play out .


Transformation operation and action operation


We were just introducing RDD I actually mentioned ,RDD Two operations are supported , One is called Transformation operation (transformation) One is called Action operations (action).

seeing the name of a thing one thinks of its function , When performing the conversion operation ,spark Meeting Will a RDD Into another RDD.RDD We will record the content of our transformation in , But it doesn't work . So what we get is still a RDD It's not the result of execution .

For example, we created texts Of RDD after , We want to filter the content , Only keep the length beyond 8 Of , We can use filter To convert :

textAfterFilter = texts.filter(lambda x: len(x) > 8)

After we call it, we get a RDD, As we said just now , because filter It's a transformation operation , therefore spark It just records its contents , It doesn't really work .

The conversion operation can operate any number of RDD, For example, if I do the following , You'll get 4 individual RDD:

inputRDD = sc.textFile('path/path/log.txt')
lengthRDD = inputRDD.filter(lambda x: len(x) > 10)
errorRDD = inputRDD.filter(lambda x: 'error' in x)
unionRDD = errorRDD.union(lengthRDD)


final union Will the two RDD The results of the combination of , If we execute the above code ,spark It's going to record these RDD Of information , Let's draw this dependency information , It becomes a dependency graph :

9731c8c422c53443dc20359f21670c0f.jpeg

No matter how many conversions we perform ,spark They don't actually perform the operations , Only when we perform an action operation , Only when the conversion operation is recorded can the operation be put into operation . Like first(),take(),count() It's all action and operation , Now spark It will give us the results of the calculation .

695d3be67a0a2901f440ff4149e47817.jpeg

among first Is used to return the first result ,take You need to pass in a parameter , Specify the number of returned results ,count It's the number of results . As we expected , When we do this ,spark For us to return the results .

This article focuses on RDD The concept of , Our next article will focus on the in-depth interpretation of transformation operation and action operation . Interested students might as well look forward to it ~



版权声明
本文为[osc_ nk8pyo7o]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201206113446172b.html