CDH 集群添加 Spark Sql 命令行

Cloudera 集群 添加 Spark Sql 命令行

目前需要从Hive Sql 迁移到 Spark Sql,但是CDH版本的集群,不支持Spark Sql命令行。这篇文章详细介绍了Cloudera的Spark命令是如何启动的,然后在此基础上添加Spark Sql。

Cloudera Spark 启动原理

CDH的版本目前是5.12,它的内置Spark版本是1.6。不过我已经添加了Spark2的版本,所以对于spark sql的支持,下面讲解的也只是针对spark2。如果Spark的版本号不同,对应的原理也是一样。

我们通过以spark2-submit命令为例,查看它的启动过程。

首先查看该命令的位置,可以看到最后链接指向了/opt/cloudera/parcels/SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753/bin/spark2-submit。

1
2
3
4
5
6
7
8
[root@master01 ~]# which spark2-submit
/bin/spark2-submit

[root@master01 ~]# ll /bin/spark2-submit
lrwxrwxrwx 1 root root 31 Feb 15 10:44 /bin/spark2-submit -> /etc/alternatives/spark2-submit

[root@master01 ~]# ll /etc/alternatives/spark2-submit
lrwxrwxrwx 1 root root 84 Feb 15 10:44 /etc/alternatives/spark2-submit -> /opt/cloudera/parcels/SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753/bin/spark2-submit

接下来看看spark2-submit文件的内容

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash
  # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in
  SOURCE="${BASH_SOURCE[0]}" # 获取shell文件路径
  BIN_DIR="$( dirname "$SOURCE" )" # 获取该shell文件的所在文件夹
  # 从上面的代码可以看到,这里都使用了链接的方法。 -h 选项 是测试该文件是否为链接
  while [ -h "$SOURCE" ]
  do
    SOURCE="$(readlink "$SOURCE")" # 使用readlink方法,获取该文件的绝对路径
    [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE" # 检查文件路径是否为根目录
    BIN_DIR="$( cd -P "$( dirname "$SOURCE"  )" && pwd )"  # 
  done
  BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" # 进入到文件的目录,这里的值为/opt/cloudera/parcels/SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753/bin/
  CDH_LIB_DIR=$BIN_DIR/../../CDH/lib # 这里的值为/opt/cloudera/parcels/CDH/lib
  LIB_DIR=$BIN_DIR/../lib # 这里的值为/opt/cloudera/parcels/SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753/lib
export HADOOP_HOME=$CDH_LIB_DIR/hadoop # 设置HADOOP_HOME环境变量为/opt/cloudera/parcels/CDH/lib/hadoop

# 查看JAVA_HOME环境变量
. $CDH_LIB_DIR/bigtop-utils/bigtop-detect-javahome

exec $LIB_DIR/spark2/bin/spark-submit "$@" # 执行命令

spark2-submit命令会设置好环境变量,并且进入到对应的执行目录里。最后执行了/opt/cloudera/parcels/SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753/lib/spark2/bin/spark-submit文件,下面看看这个文件的内容

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env bash
# 寻找环境变量SPARK_HOME的值
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
# 执行java命令,main类为org.apache.spark.deploy.SparkSubmit
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

上面使用了find-spark-home文件,查看SPARK_HOME的值。如果存在SPARK_HOME存在,则直接返回。如果不返回, 则查看当前目录下,是否有find_spark_home.py文件。如果存在find_spark_home.py文件,则调用python执行获取结果。如果不存在,则使用当前目录为SPARK_HOME。

这里Cloudera的最后使用/opt/cloudera/parcels/SPARK2/lib/spark2目录,作为SPARK_HOME的值。

添加 Spark Sql 命令

首先我们添加 /bin 目录下添加 spark2-sql 链接文件,这样 spark2-sql 命令就可以被找到

1
2
ln -s  /etc/alternatives/spark2-sql /bin/spark2-sql
ln -s  /opt/cloudera/parcels/SPARK2/bin/spark2-sql  /etc/alternatives/spark2-sql

创建文件 /opt/cloudera/parcels/SPARK2/bin/spark2-sql,文件内容如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash
  # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in
  SOURCE="${BASH_SOURCE[0]}"
  BIN_DIR="$( dirname "$SOURCE" )"
  while [ -h "$SOURCE" ]
  do
    SOURCE="$(readlink "$SOURCE")"
    [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE"
    BIN_DIR="$( cd -P "$( dirname "$SOURCE"  )" && pwd )"
  done
  BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
  CDH_LIB_DIR=$BIN_DIR/../../CDH/lib
  LIB_DIR=$BIN_DIR/../lib
export HADOOP_HOME=$CDH_LIB_DIR/hadoop

# Autodetect JAVA_HOME if not defined
. $CDH_LIB_DIR/bigtop-utils/bigtop-detect-javahome

exec $LIB_DIR/spark2/bin/spark-sql "$@"  

并且给与这个文件可执行权限 chmod a+x /opt/cloudera/parcels/SPARK2/bin/spark2-sql

然后创建 /opt/cloudera/parcels/SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753/lib/spark2/bin/spark-sql 文件,内容如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

export _SPARK_CMD_USAGE="Usage: ./bin/spark-sql [options] [cli option]"
exec "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"

同样给与这个文件可执行权限 chmod a+x /opt/cloudera/parcels/SPARK2-2.2.0.cloudera3-1.cdh5.13.3.p0.556753/lib/spark2/bin/spark-sql

这里将相关命令的文件,都已经配置好了。但是因为Cloudera内置的Spark库不全,需要将缺少的依赖包添加进来。我们可以直接下载编译完成的Spark库,下载网址为  http://archive.apache.org/dist/spark/ 。从里面选取的版本号,需要与Cloudera上安装的Spark版本一致。

下载完并且解压后,里面会有一个 jars 的目录,将Cloudera缺少的库,拷贝到cloudera对应的库文件夹里面即可。

1
2
cp jars/spark-hive-thriftserver_2.11-2.2.0.jar /opt/cloudera/parcels/SPARK2/lib/spark2/jars/
cp jars/hive-* /opt/cloudera/parcels/SPARK2/lib/spark2/jars/ # 注意到这里不要覆盖原有文件

整个添加的过程就完成了,接下来就可以直接使用spark-sql命令,就可以进入客户端shell

1
2
3
4
sudo -u hive spark2-sql
spark-sql> use test;
spark-sql> show tables;
.....

下面简单介绍下spark-sql常用的命令选项

1
2
3
4
5
# 执行 sql 命令
sudo -u hive spark2-sql -e "select count(*) from test.my_table;"

# 执行 sql 文件
sudo -u hive spark2-sql -f my_sql.sql
updatedupdated2023-07-022023-07-02