3.4.8 集群管理
集群管理模块可以对平台的所有集群的配置信息统一管理,这个模块只有平台管理员的权限角色才可以进入。
3.4.8.1新增集群
如下图选择一种集群类型。spark集群用于处理Spark任务,包括定时和周期的离线任务和Spark Stream任务,flink集群用于处理实时Flink任务,spark集群比flink集群多出一个spark.conf配置。
3.4.8.2 配置集群
- 基本信息
3.4.8.3 集群的具体配置
样例如下:
#jobserver-tomcat 的 conf/server.xml里的提供服务的端口
server.port = 7002
spark.home = /home/admin/spark-2.4.0-bin-2.6.0-cdh5.15.0
spark.yarn.jars = /user/yarn_jars/spark_2.4.0_2.0.0/*
spark.jobserver.jar = /user/yarn_jars/spark_2.4.0_2.0.0/jobserver-yarn-2.4.0.jar
#datacompute地址
datacompute.addr =http://10.57.26.5:8181
#jobserver-control 和 jobserver-yarn 用到kerberos用户名
#hadoop.kerberos.user=tdkj1@HADOOP.COM
hadoop.yarn.webui = http://cdh173:8088
spark.create.table.enabled = true
jobserver.profile = dev
jobserver.console.url = http://cdh173:8088/proxy/
spark.jobserver.control.url = http://10.57.30.218:7002
datacompute.git = https://gitlab.fraudmetrix.cn
jobserver.parquet.write.users = jian.tang
spark.dc.column.authorization.enabled = true
spark.executor.extraJavaOptions = -Dfile.encoding=UTF-8
spark.driver.extraJavaOptions = -Dfile.encoding=UTF-8
//集群提交限制
spark.jobserver.maxAccept = 8
spark.jobserver.maxNum =20
spark.jobserver.yarn.limitMemory = 1024
spark.jobserver.yarn.limitCore = 2
3.4.8.4 spark.conf的配置
样例如下:
spark.yarn.queue default
spark.master yarn
#spark.dc.url http://cdh174:8181
spark.yarn.dist.jars hdfs://tdhdfs/user/yarn_jars/aspectjweaver-1.8.10.jar,hdfs://tdhdfs/user/yarn_jars/jvm_profiler/jvm-profiler-0.0.9.jar,hdfs://tdhdfs/user/yarn_jars/sql_extensions/spark-sql-extension-1.0.1.jar
#spark调度参数
spark.speculation false
spark.speculation.interval 1000
spark.speculation.multiplier 1.5
spark.speculation.quantile 0.75
spark.task.cpus 1
#spark执行参数
spark.broadcast.blockSize 4m
spark.default.parallelism 8
spark.files.useFetchCache true
spark.files.maxPartitionBytes 134217728
spark.storage.memoryMapThreshold 10m
spark.files.overwrite true
spark.eventLog.logStageExecutorMetrics.enabled true
spark.eventLog.logStageExecutorProcessTreeMetrics.enabled true
#spark启动参数
spark.executor.instances 1
spark.executor.memory 1G
spark.driver.memory 1G
spark.driver.cores 1
spark.executor.memoryOverhead 512m
spark.driver.memoryOverhead 521m
spark.yarn.queue root.user.admin
spark.hive.init true
spark.tispark.pd.addresses 10.58.10.33:2379
spark.tispark.plan.allow_index_read true
#sql增强 权限之类
#spark.sql.extensions cn.tongdun.sql.TDExtensions
#sparksql参数
spark.sql.codegen false
spark.sql.shuffle.partitions 200
spark.sql.parquet.cacheMetadata true
spark.sql.inMemoryColumnarStorage.compressed true
spark.sql.inMemoryColumnarStorage.batchSize 10000
spark.sql.catalogImplementation hive
#spark.sql.parquet.compression.codec zstd
#spark动态分配参数
spark.dynamicAllocation.enabled false
spark.shuffle.service.enabled true
spark.dynamicAllocation.executorIdleTimeout=300s
spark.dynamicAllocation.minExecutors 6
spark.dynamicAllocation.initialExecutors 0
spark.dynamicAllocation.maxExecutors 30
spark.hadoop.hive.exec.compress.output true
spark.hadoop.mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.SnappyCodec
spark.hadoop.hive.output.file.extension .snappy.parquet
spark.hadoop.parquet.metadata.read.parallelism 8
spark.hadoop.parquet.compress SNAPPY
spark.parquet.column.index.access true
spark.sql.parquet.mergeSchema false
spark.submit.tasks.threshold.enabled true
spark.submit.tasks.threshold 10000
spark.driver.extraLibraryPath /opt/cloudera/parcels/CDH/lib/hadoop/lib/native
spark.executor.extraLibraryPath /opt/cloudera/parcels/CDH/lib/hadoop/lib/native
spark.eventLog.enabled true
spark.eventLog.dir hdfs://tdhdfs/tmp/spark
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.updatejar.enabled true
3.4.8.4 各种 xxx-site.xml配置文件
从cdh的hive管理界面下载,然后逐一复制上去:
3.4.8.5 kerberos.conf配置
若启用了kerberos,则将krb5.conf里的内容复制到这里: