当前位置:网站首页>Spark common errors and problem solving methods

Spark common errors and problem solving methods

2022-09-23 08:42:37Book Recalls Jiangnan

查看SparkPlease see log and error problems method:https://blog.csdn.net/qq_33588730/article/details/109353336

1. org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow

原因:kryoInsufficient space in serialization cache.

解决方法:增加参数,--conf spark.kryoserializer.buffer.max=2047m.

2. org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error

原因:此时es.port可能为9300,因为ElasticSearchExcept for the client programJava使用TCP方式连接ES集群以外,Other languages are basically usedHttp方式,ES客户端默认TCP端口为9300,而HTTP默认端口为9200.elasticsearch-hadoop使用的就是HTTP方式连接的ES集群.

解决方法:可以将es.port设置为 9200.

3. Error in query: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found

解决方法:如果是SparkSQL脚本,则rand()etc functions cannot appear injoin...on的后面.

4. driverFrequent occurrences in end log:Application report for application_xxx_xxx (stage: ACCEPTED)

解决方法:通过yarn UI左侧的“Scheduler”界面,Search for your own task submissionsyarn队列,Check if resources are exhausted,Coordinate rational use of resources with colleagues in the same queue,Optimize tasks with unreasonable resource usage,如下所示:

5. SparkToo much task data(billions of records)跑不过去

原因:The amount of data is too largeexecutorMemory can't hold so much data.

解决方法:增加参数,--conf spark.shuffle.spill.numElementsForceSpillThreshold=2000000,Write excess data to disk.

6. user clas threw exeception:ml.dmlc.xgboost4j.java.XGBoostError:XGBoostModel trained failed, caused by Values to assemble cannot be null

原因:Promising machine learning training datanull的地方.

解决方法:put the data innull的地方去掉,或对nullValues ​​are preprocessed before training.

7. Caused by: org.apache.spark.sql.catalyst.parser.ParseException: Datatype void is not supported

原因:Spark不支持Hive表中的void字段类型,The code in the temporarycreate的Hive表中,如果fromThe source of a field in the table for a null value,则create tableWhen the field type of the temporary table will becomevoid.

解决方法:如果是上面这种情况,可以用Hiverun the task or modify theHiveThe field type of the table is notvoid,或将null转换为string等.

8. ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service failed after 16 retries

原因:Spark UIPort binding attempts are consecutive16ports are occupied.

解决方法:可以把spark.port.maxRetriesThe parameter is adjusted larger如128.

9. Error in Query: Cannot create a table having a column whose name contains commas in Hive metastore

解决方法:查看SparkSQLDoes the script exist something like“round(t1.sim_score , 5)”Statements that use the result of a function as a field value,If not added later“as”Aliases cause the error.

10. Failed to send RPC to ...

原因:Memory allocation of large amount of data beyond the default setting threshold,container会被yarn杀掉,When other nodes pull data from this node,会连不上.

解决方法:Can optimize the code will be continuousjoin语句(超过5个)拆分,每3about consecutivejoinThe result of the statement generates a temporary table,The temporary table and the following two or three consecutivejoin组合,Regenerate the next temporary table,and filter the unnecessary amount of data in advance in the temp table,Make redundant data not involved in subsequent calculation processing.Only under the premise that the code logic performance and parameters are reasonable,Finally to increase--executor-memory、--driver-memory等资源,不能本末倒置.

11. ERROR shuffle.RetryingBlockFetcher: Failed to fetch block shuffle_7_18444_7412, and will not retry

原因:Executor被kill,Can't pull itblock.may be onDAcaused by skewed data,其他executorAll completed work is recycled,only inclinedexecutor还在工作,Pull the recycledexecutormay not be able to pull the data on.

解决方法:If there is indeed a data skew,It can be processed according to the methods of these two links:(1)http://www.jasongj.com/spark/skew/;(2)https://www.cnblogs.com/hd-zg/p/6089220.html;

还可以参考《SparkSQL内核剖析》第11章第266several pagesSQLThe level of processing method,如下所示:

select /*+ BROADCAST (table1) */ * from table1 join table2 on table1.id = table2.id

--The separation of tilt data(倾斜key已知)
select * from table1_1 join table2 on table1_1.id = table2.id
union all
select /*+ BROADCAST (table1_2) */ * from table1_2 join table2 on table1_2.id = table2.id

select id, value, concat(id, (rand() * 10000) % 3) as new_id from A

select id, value, concat(id, suffix) as new_id
from ( 
      select id, value, suffix
      from B Lateral View explode(array(0, 1, 2)) tmp as suffix

select t1.id, t1.id_rand, t2.name
  from (
        select id ,
               case when id = null then concat(‘SkewData_’, cast(rand() as string))
                    else id end as id_rand
          from test1
          where statis_date = ‘20200703’) t1
  left join test2 t2
    on t1.id_rand = t2.id 

同时,Moderate increasespark.sql.shuffle.partitions参数,通过提高并发度can also alleviate data skew.

12. org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0

原因:Code logic or task parameter configuration is unreasonable、data skew, etc.OOM.分为driver OOM和executor OOM.

解决方法:(1)查看代码中是否有coalesce()等函数,该函数相比repartition()不会进行shuffle,Dealing with large partitions is easy to causeOOM,如果有则可换成repartition(),尽量减少coalesce()的使用.

                  (2)whether to use theexecutorPull all data back ondriver的collect()函数,尽量Avoid or use with cautioncollect()与cache()等函数,让各executorDistributed data and execution,Multiple nodes share massive data load.

                  (3)代码中是否有超过5个连续joinor multiple levels of nestingforUnreasonable code logic such as loops,Code parsing and related object serialization are both indriver端进行,Overly redundant and complex code that is not split will causedriver OOM,可参考第10点优化.

                  (4)代码中whether an oversized table is broadcast,可以合理设置spark.sql.adaptiveBroadcastJoinThreshold参数(以B为单位,默认10485760即10MB);In the business logic code is very complex、Number of scanned files and data volume、taskextremely large number(as hundreds of thousands)And no matter how small the broadcast threshold is set, it is stillOOM,可以设置为-1关闭broadcast join.

                  (5)View the task submitted parameter--executor-cores与--executor-memoryIs the proportion of at least1:4,一个executor上的所有coreshared configurationexecutor内存,如果有类似2core4Getc. exist,Easy to use with large amounts of dataOOM,should be at least1core4G或者2core8G等.

                  (6)个别executor的OOMIt is also one of the phenomena that occurs when the data is skewed.,If this is the case, refer to11点解决.

                  (7)Under the premise of reasonable code logic and parameters,The last step is to moderately increase resources,例如--executor-memory=8g、--driver-memory=4g等.

13. Caused by: Java.lang.ClassCastException:org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.DoubleWritable

原因:下游Hive表在select from上游Hive表时,selectFields with the same name have different types.

解决方法:modify the downstreamHiveConsistent with the upstream table table of the corresponding field type,alter table xxx change 原字段名 新字段名(可不变) 类型;

 14. driverend frequently“Full GC”字样或“connection refused”etc. log content.

原因:与第12点OOM类似,driverAt this time, the memory pressure is very high,inability to deal withexecutorNetwork connection heartbeat and other work.

解决方法:(1)Check if there is no less than optimal split code logic complexity(too many consecutivejoin),可参考第10点Streamline unreasonable code splits,Filter excess data early in temporary tables.

                  (2)If the amount of data is really huge,It's not that the code is not reasonable enough,可以减小SparkSQL的broadcast joinSmall table threshold even disable this feature,增加参数set spark.sql.autoBroadcastJoinThreshold=2048000或-1等(默认为10M,Adjust according to the specific amount of data).

                  (3)由于driver也负责application对应的spark uiWeb page status maintenance,Too many small files can causetask数较多,在维护jobThere will also be a lot of memory pressure during the execution progress,This blog can be used to judge the phenomenon of small files:https://blog.csdn.net/qq_33588730/article/details/109353336

                  (4)The code has been changed and the parameters have been adjusted reasonably,Finally, increasedriver端内存,--driver-memory=4g等.

15. Caused by: org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.

原因:According to the reasons indicated above in English,对RDD的transformation中嵌套transformation或actionThe operation will cause the calculation to fail.

解决方法:Find nested near the line where the error was reportedtransformation或action操作,Isolate the nested logic.

16. removing executor 38 because it has been idle for 60 seconds

原因:Usually caused by skewed data,开启AEIdle down after the featuresexecutor会被回收.

解决方法:与第11与第12The point solution is similar to,第10、11、12、16、17There may be some kinds of log phenomena at the same time..

17. Container killed by YARN for exceeding memory limits. 12.4 GB of 11GB physical memory used.

原因:(1)数据倾斜,个别executorThe memory usage is very large and exceeds the limit.

           (2)Task too many small files,且数据量较大,导致executorOut of memory.

           (3)The task parameter setting is unreasonable,executorToo few numbers cause stress loads to be concentrated in lessexecutor上.

           (4)代码不合理,有repartition(1)Equal code logic.

解决方法:(1)For the data skew situation, please refer to the11、12点.

                  (2)The amount of data is very large, please refer to the10、14点.

                  (3)Check whether the task parameter settings are unreasonable,例如executor-memoryset big,但是--num-executorsThere are only dozens of settings,It can be increased reasonably according to the cluster situation and the size of the business volumeexecutor数,The quantitative criterion is aexecutor的CPU coreTry to process only one at a timeHDFS block的数据(如128或256M),在没有设置--executor-coreswith other parameters,默认一个executor包含一个CPU core.

                  (4)See if there is something like in the coderepartition(1)obviously unreasonable logic.

                  (5)Reasonable in code performance and logic,and increase resources under the premise of reasonable parameters,External memory can be increased:--conf  spark.yarn.executor.memoryOverhead=4096(单位为M,Specific settings based on business volume).

18. Found unrecoverable error returned Bad Request - failed to parse; Bailing out

原因:ESThere are historical indexes but not deleted.

解决方法:删除ESCorresponding historical index in.

19. org.apache.spark.shuffle.FetchFailedException: Too large frame

原因:shuffle中executorThe amount of data exceeds the limit when pulling a partition.

解决方法:(1)根据业务情况,Determine whether the excess data volume has not been filtered out in advance in the temporary table,Still participating in subsequent unnecessary computing processing.

                  (2)Determine whether there is data skew,If yes, refer to11、12点,或者通过repartition()To carry on the reasonable partition,Avoid excessive data volume in a partition.

                  (3)判断--num-executors即executorIs the quantity too low?,可以Reasonable increase in concurrency,Keep the data load from being concentrated in a small amountexecutor上,减轻压力.

20. 读写Hive表时报“Table or view not found”,但实际上HiveDatabase and table in yuanHDFS上都存在.

解决方法:After excluding reasons such as cluster connection address configuration, etc.,查看代码中SparkSessionTo establish whetherenableHiveSupport(),Haven't been able to add may be unable to identifyHive表.

21. java.io.FileNotFoundException

原因:(1)except that the file does correspondHDFSdoes not exist on the path other than,may be preceded bycreate viewBut the data comes from the lastinsert的目标表,后面insert overwrite And when the target tablefrom这个view,因为Spark有“谓词下推”Waiting for lazy execution mechanism,actually start executingcreate view的transformation操作时,因为前面insert overwriteThe target table deletes the file on the target table,So it is equivalent to query yourself and write yourself,Will cause the file to be read does not exist.

           (2)由于Spark的内存缓存机制,The files in the directory have changed in a short time, but the meta information in the cache is not synchronized in time,still think the file exists,SparkThe file meta information in the cache will be read first,If do not match actual the files in the directory will be an error.

解决方法:(1)If it is the above code logic,可以不用create view,Instead create a temporary table that falls to disk,insert目标表时fromTemporary table is fine.

                  (2)Can be added in front of the read and write coderefresh table 库名.表名,This discards the cached information,Still read from the actual file status of the disk.

22. org.apache.shuffle.FetchFailedException: Connect from xxx closed

原因:Usually caused by skewed data,其他executorWork completed and recycled due to inactivity,Individual loadexecutorpull otherexecutorWhen data is less than.

解决方法:可参考第11点,加上参数,set spark.sql.adaptive.join.enabled=trueset spark.sql.adaptive.enabled=true,And set it reasonably according to the amount of business dataspark.sql.adaptiveBroadcastJoinThreshold即broadcast joinThe little table size threshold.

23. caused by:org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations

原因:If the address configuration etc. are correct,Generally, the big data platformHBaseThe number of concurrent connections of a component is limited,导致大量SparkSQL任务连接HBaseThere are some tasks to connection timeout.

解决方法:Check the code and tasks inHBaseAre properties such as connection configuration correct?,If it is correct, please consult directlyHBaseComponent platform developer.

24. sparksql在某个stageThe long run,但task很少,数据量也不大,And the code logic is just simplejoin,as shown below:

原因:点进该stageThe link to see details,发现stage中各task的shuffle read数据量不大,但shuffle spillmuch larger amount of data,如下所示:

can be determined that thejoinOperations may have a Cartesian product,join onBoth fields in each have many duplicate values ​​that are not unique,会导致这种情况.

解决方法:加上参数,set spark.sql.adaptive.shuffle.targetPostShuffleInputSize=64000000can alleviate this phenomenon,Fundamentally, it is still based on business logic for field value deduplication、Avoid Duplicate Field Value Participationjoin等.

25. ERROR:Recoverable Zookeeper: Zookeeper exists failed after 4 attempts baseZNode=/hbase Unable to set watcher on znode (/hbase/...)

原因:Sparktask not connectedHBase,If it is not a problem with the configuration of connection parameters and properties in the task,就是HBaseComponent limits the number of concurrent connection.

解决方法:可参考第23point solution.

26. Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead


解决方法:增加处理key为null数据的逻辑(如将keyConvert to a random number or simply discard the piece of data),或使用ORC格式.

27. Java.io.IOException: Could not read footer for file

原因:The error is divided into two cases:(1)Although the table is created,该hiveThe table meta information is set toparquet格式,But after actually writing,The files in the corresponding directory are notparquet格式的;

                                               (2)The file read is an empty file.

解决方法:(1)If the corresponding file is inHDFSAfter checking it, I found that it is notparquet格式,You can rebuild the table in the corresponding format and move the file to the directory corresponding to the new table,Or correct the code configuration to rerun the task,thereby deleting the file overwriting;

                  (2)如果是空文件,You can delete the file directly.

28. com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:Communications link failure

原因:Check out the errorexecutorfound in the logFull GC,Full GCwill cause all other threads to pause,包括维持MySQL连接的线程,而MySQLThe connection will be closed after a period of unresponsiveness,cause the connection to fail.

解决方法:可参考第14point solution.

29. java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

原因:数据量比较大,但是各executor负载压力比较大,Communication time out with each other,sometimes accompanied“Executor heartbeat timed out after xxx ms”的内容.

解决方法:(1)Check for data skew,Or whether the set concurrency parameter is too small,It can be appropriately increased according to the specific business volumespark.default.parallelismspark.sql.shuffle.partitions参数,Both values ​​are set to the same.

                  (2)View the stack information at the error location,如果有下图所示的broadcast join调用:

                           可以将spark.sql.autoBroadcastJoinThresholdThe parameter setting is smaller(数值以B为单位),or turn off set to-1.

                  (3)After code optimization and adjustment of previous parameters,still not working,You can also try to increasespark.network.timeout参数如600s,但不宜过大.

30. Error communicating with MapOutputTracker

原因:目前遇到的情况是NameNodeShort-term service unavailability caused by active-standby switchover

解决方法:检查是否HDFS的NMIs the main switch state,After switching, re-run the task again.

31. org.apache.spark.shuffle.FetchFailedException: Connection from xxx closed

原因:数据量较大,过多block fetch操作导致shuffle server挂掉,还会伴随stage中taskof failures and retries,对应task的executorThere will also be the following error message:“ ERROR shuffle-client-7-2 OneForOneBlockFetcher: Failed while starting block fetches”,或者“Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.”.

解决方法:加上--conf spark.reducer.maxReqsInFlight=10和--conf spark.reducer.maxBlocksInFlightPerAddress=10参数(The specific value is judged according to the cluster situation and business volume),来限制block fetch并发数.or mergeSpark3中的remote shuffle serviceSource nature.

32. Error in query: org.apache.hadoop.hive.ql.metadata.Table.ValidationFailureSemanticException:Partition spec {statis_date=,STATIS_DATE=xxx} contains non-partition column

原因:SparkSQLThe capitalSQLStatements to lowercase,如果用hiveWhen creating a table, the partition field is uppercase,SparkAn error will be reported after reading the uppercase partition field name.

解决方法:将hiveTable partitioning field name changed to lowercase,或者修改spark源码逻辑(AstBuilder.scala),Automatically convert the read uppercase partition field name to lowercase and then process it.

33.Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster

原因:HDFSnot uploadedSpark对应版本的包,SparkThe application on the client machine(安装有spark完整目录)After submitting to the cluster,The computing components are not installed on the cluster computing machine,而是从HDFS上下载该application需要的Spark jarStart running after the pack,如果HDFSthere is nospark目录,will be due to the failure to meet the needsjarerror.

解决方法:用hadoop fs -putThe command will correspond to the version ofSpark目录上传到HDFSUnder the corresponding parent directory.

34.ERROR Driver ApplicationMaster: User class threw exception: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'xxx' not found

原因:有两种情况:(1)建立SparkSessionobject without adding.enableHiveSupport();

                                (2)Spark2开始用SparkSessionpackage insteadSparkContext作为Application程序的入口,Only one is allowed to be initializedSparkSession对象然后用sparkSession.sparkContext来获取SparkContext,如果没有初始化SparkSession对象,can only initialize oneSparkContext对象.这里可能是Repeated initialization creates multipleSparkContext.

解决方法:(1)在SparkSessionobject initialization location.getOrCreate()之前加上.enableHiveSupport().

                  (2)delete the rest of the codeSparkContext初始化语句,只通过sparkSession.sparkContext来获取SparkContext.

35.Caused by: java.lang.RuntimeException: Unsupported data type NullType.

原因:使用ParquetFormat when reading and writing data,If the field values ​​of a column in the table are allnull,用create table xxx using parquet as select ...语句建表时,全为nullThe column of will be recognized asvoid类型,就会报错,可以参考该链接:https://stackoverflow.com/questions/40194578/getting-java-lang-runtimeexception-unsupported-data-type-nulltype-when-turning.

解决方法:Try to avoid a column full ofnull值,For example, adding null value detection logic to the code,赋予随机值;If the business requirement is to make this column full ofnull,Can be used when creating a tableusing orc as或stored as orc as语句.

36.Failed to create local dir in /xxxx

原因:Usually the disk is full or the disk is damaged.

解决方法:Contact operation and maintenance.


原因:wholeTextFileDoes not support one-time read greater than1G的大文件,Because it turns the entire file content into aText对象,而TextObject has a length limit.

解决方法:Split a single large file into multiple small files to read.

38.Total size of serialized results of 2000 tasks (2048MB) is bigger than spark.maxResultSize(1024.0 MB)

原因:各Executor上的各task返回给driverof data exceeds the default limit.

解决方法:Appropriately increase the parametersspark.driver.maxResultSize,and this parameter is less than--driver-memory的值.

39.Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index: 0

原因:查询的HiveThe corresponding partition in the table orHDFSFree directory file.

解决方法:加上参数set spark.sql.hive.convertMetastoreOrc=true;.

40.Caused by:java.io.NotSerializableException:org.apache.kafka.clients.producer.KafkaProducer

原因:KafkaProducerThe initialization code for the object is indriver端执行的,driverThe related code object will be serialized and sent to eachexecutor,而KafkaProducer对象不支持序列化.

解决方法:The initialization code for the object is changed from thedriverMove the initialization position to letexecutor初始化的位置,For example, if you useforeachPartition(),就可以将KafkaProducerObject initialization code moved inside this function,refer to this link:https://stackoverflow.com/questions/40501046/spark-kafka-producer-serializable

41.HiveThe data in some columns of the table is misplaced、The field value displays abnormally or the amount of data is incorrect,但SQL中selectFields are in correct order and syntax is correct

原因:使用HQL或SparkSQL建表时,使用的是create table xxx as select xxxsyntax without adding table format,默认会The lowest in terms of read and write performance and compression efficiencyText格式来建表,可能will be due to the fact that some field values ​​in the data itself containTextTable row or column delimiters with the same content(such as commas),False positives for row or column delimiters,Thus appeared column field value dislocation display abnormal phenomenon.

解决方法:建hiveavailable on the tableORC格式,The format has encoding mechanisms to prevent content likeTextIn that way, due to the delimiter and other reasons, the misplacement,Simultaneous compression efficiency ratioText、RCFile高,Read and write performance is also better:create table xxx stored as orc as select xxx

42.java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions

原因:(1)代码中使用了zip()等函数,The function is one of the requirements of the twoRDD的partition个数相同,It will report an error if it is not the same.

           (2)使用了参数spark.sql.join.preferSortMergeJoin=false开启了Shuffled Hash Join,该种joinThe hash value of each field value will be calculated first and distributed to eachpartition中,如果两个joinThere is a large difference in the amount of data in the tables,In each field value hash generated after calculationpartition数不同,也会报错.

解决方法:(1)According to the specific business logic and data,选择repartition()and other functions to implement the corresponding logic,Or perform targeted partition processing on the data itself.


43.org.apache.hadoop.hive.ql.parse.SemanticException: No partition predicate found for table

原因:A query on a partitioned table does not specify a partition.

解决方法:select语句后面加上where筛选条件,Specify the partition to scan.

44.java.io.InvalidClassException: org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$; local class incompatible: stream classdesc serialVersionUID = -1370062776194295619, local class serialVersionUID = -1798326439004742901

原因:SparkThe local client machinejar包与HDFS上的同名jarPackages are not actually the same.

解决方法:排查Spark客户端与HDFS上的同名jar包情况(如时间、md5值等),replace with the same version.

45.org.apache.orc.FileFormatException: Malformed ORC file

原因:hiveWhen the table is created, the specified useORC格式,而对应HDFSThe contents of the files in the directory are notORC,may not be normalMR或Sparktask run out,Instead, other format files are imported directly into the tableHDFS目录下的.

解决方法:使用Hive或SparkThe engine is flushed normallyORCformat data to correspond toHDFS目录中.


本文为[Book Recalls Jiangnan]所创,转载请带上原文链接,感谢