您的位置 首页 java

Spark读写压缩文件API使用详解及部署在k8s

最近简单的研究了一下Spark on K8s,期间遇到了些许问题,在这里总结一下分享给大家。

环境介绍

hadoop 集群:部署在实体机上

Spark : k8s上

需求

要实现的功能是使用spark读取远程hadoop集群上的lzo文件

问题

使用Spark官方提供的DockerFile新建容器是没有问题的,但是由于我的测试环境的数据是lzo压缩文件,导致Spark读取数据时会报本地库的错误:

19/09/20 06:02:46 WARN LzoCompressor: java. lang .UnsatisfiedLinkError: Cannot load liblzo2.so.2 (liblzo2.so.2: cannot open shared object file: No such file or directory)!
19/09/20 06:02:46 ERROR LzoCodec: Failed to load/initialize native-lzo library
19/09/20 06:02:46 ERROR  executor : Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: native-lzo library not available
 at com.hadoop.compression.lzo.LzopCodec.createDecompressor(LzopCodec.java:104)
 at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:89)
 at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:104)
 at org.apache.spark. rdd .NewHadoopRDD$$anon$1.<init>(NewHadoop RDD . scala :168)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:133)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
 

解决

Spark从2.30版本开始支持运行在k8s容器中,如果Spark运行在k8s上,那么Spark源码中给的Dockerfile中使用的基础源是openjdk:8-alpine,由于使用了alpine,且Dockerfile中没有安装读取lzo等的本地库,因此作业读取lzo文件时失败。

解决方法:

1.alpine中安装lzo依赖

	 apk  add lzo --no-cache
 

2.在alpine容器中重新编译hadoop-lzo本地库,

下载hadoop-lzo源码:,将其复制到alpine容器中执行:

		mvn clean package -Djava.test.skip=true
 

如果出错,需要安装一下环境:

echo "" > /etc/apk/repositories 
echo "" >> /etc/apk/repositories 
echo "" >> /etc/apk/repositories 
apk update --no-cache
apk add gcc --no-cache 
apk add gcc++ --no-cache 
apk add lzo --no-cache 
apk add lzo-dev --no-cache 
apk add make --no-cache 
 

之后重新编译,编译成功后在target下查找native!

基于alpine v3.9版本编译:

hadoop-lzo.jar依赖包

lzo本地库文件:

这几个文件先复制出来,稍后提交作业时会用到。

3.提交时指定lzo本地库和hadoop-lzo.jar依赖包。

下面是修改后的Dockerfile,其中需要hadoop-lzo-0.4.21-SNAPSHOT.jar和hadoop-lzo本地库:

.
├── gplnative
│ ├── libgplcompression.a
│ ├── libgplcompression.la
│ ├── libgplcompression.so
│ ├── libgplcompression.so.0
│ └── libgplcompression.so.0.0.0
└── hadoop-lzo-0.4.21-SNAPSHOT.jar
 

Dockerfile文件修改成如下:

FROM openjdk:8-alpine
ARG spark_jars=jars
ARG gpl_libs=glplib
ARG img_path=kubernetes/dockerfiles
# Before building the docker image, first build and make a Spark distribution following
# the instructions in 
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .
RUN set -ex && 
 echo "" > /etc/apk/repositories && 
 echo "" >> /etc/apk/repositories && 
 echo "" >> /etc/apk/repositories && 
 apk upgrade --no-cache && 
 apk add --no-cache bash tini libc6-compat && 
 apk add --no-cache lzo && 
 mkdir -p /opt/spark && 
 mkdir -p /opt/spark/work-dir 
 touch /opt/spark/RELEASE && 
 rm / bin /sh && 
 ln -sv /bin/bash /bin/sh && 
 chgrp root /etc/passwd && chmod ug+rw /etc/passwd
COPY ${spark_jars} /opt/spark/jars
COPY ${gpl_libs}/hadoop-lzo-0.4.21-SNAPSHOT.jar /opt/spark/jars
COPY ${gpl_libs}/gplnative /opt/gplnative
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY conf /opt/spark/conf
COPY ${img_path}/spark/entrypoint.sh /opt/
COPY examples /opt/spark/examples
COPY data /opt/spark/data
ENV SPARK_HOME /opt/spark
WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]
 

提交代码时加上参数:

--conf spark.executor.extraLibraryPath=/opt/gplnative  
--conf spark.driver.extraLibraryPath=/opt/gplnative  
 

至此Spark on k8s就可以读取lzo文件了!

喜欢的朋友可以点个关注哦!一起学习一起成长!

文章来源:智云一二三科技

文章标题:Spark读写压缩文件API使用详解及部署在k8s

文章地址:https://www.zhihuclub.com/191577.shtml

关于作者: 智云科技

热门文章

网站地图