在 WSL2 Ubuntu 20.04 上安装 Hadoop 并写一个 WordCount

看过这么多 tutorial,跟着 Hadoop 的 tutorial 跑一个 Hello World 是最 suffer 的一次。看过的包括但不限于 《Hadoop 权威指南》,某乎的文章,CSDN 上的经验,最后再加上官网的文档才勉强安装好跑出 WordCound 的结果……心累。

Prerequisites

  • GNU/Linux。这里用的是 WSL Ubuntu-20.04
  • Java。 这里用的是 openjdk version “11.0.13”
  • ssh。
  • Hadoop 的安装包。可以在 Apache Download Mirrors 下载,这里用的是 hadoop-3.3.2.tar.gz

JDK Enviorment Setting

最好先设置一下 Java 环境:

1
2
3
4
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

看下效果:

1
java -version

输出:

1
2
3
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Hadoop Installation

把 hadoop-3.3.2.tar.gz 解压到某个位置,并改名:

1
2
sudo tar -zxvf  hadoop*.tar.gz -C  ~/apps
cd ~/apps

添加环境变量:

1
sudo vim ~/.bashrc

把 Hadoop 环境添加到最后并保存:

1
2
3
4
5
6
# Hadoop环境
export HADOOP_HOME=~/apps/hadoop-3.3.2
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_LIBRAY_PATH=/usr/local/hadoop/lib/native
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

现在可以检验下 Hadoop 环境是不是设好了:

1
hadoop version
1
2
3
4
5
6
Hadoop 3.3.1
Source code repository https://github.com/apache/hadoop.git -r a3b9c37a397ad4188041dd80621bdeefc46885f2
Compiled by ubuntu on 2021-06-15T05:13Z
Compiled with protoc 3.7.1
From source with checksum 88a4ddb2299aca054416d6b7f81ca55
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar

Hadoop Configuration

Hadoop 有三种运行模式: 独立(或本地)模式(Local (Standalone) Mode),伪分布模式(Pseudo-Distributed Mode)以及全分布模式(Fully-Distributed Mode)。 为了演示以及学习的目的,将 Hadoop 设置成伪分布模式。

定位到 /usr/local/hadoop/etc/hadoop 并修改 hadoop-env.sh:

1
2
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native"

先建好 $HADOOP_HOME/tmp/data, $HADOOP_HOME/tmp/name$HADOOP_HOME/logs 然后赋权 sudo chown <username>:<group> $HADOOP_HOME/tmp, sudo chown <username>:<group> $HADOOP_HOME/logs,再修改 core-site.xml:

1
2
3
4
5
6
7
8
9
10
11
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/username/apps/hadoop-3.3.2/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

修改 hdfs-site.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/username/apps/hadoop-3.3.2/tmp/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/username/apps/hadoop-3.3.2/tmp/data</value>
</property>
</configuration>

我们将 dfs.replication 设置为 1 ,这样 HDFS 就不会按照默认配置将文件系统块副本设置为 3。
接下来配置 Yarn 的单机模式。修改 mapred-site.xml:

1
2
3
4
5
6
7
8
9
10
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>

然后修改 yarn-site.xml:

1
2
3
4
5
6
7
8
9
10
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

到此 Hadoop 配置完成。测试以下,首先开启 ssh:

1
sudo service ssh start

第一次启动的话要将 namenode 格式化:

1
hadoop namenode –format

然后启动 Hadoop 集群:

1
start-all.sh
1
2
3
4
5
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [pos.baidu.com]
Starting resourcemanager
Starting nodemanagers

看一下当前所有运行的 Java 进程:

1
jps
1
2
3
4
5
6
7
11379 Jps
14388 ResourceManager
14173 SecondaryNameNode
13742 NameNode
13902 DataNode
14542 NodeManager
14943 Jps

如归 NameNode 和 DataNode 未正常启动的话可以参考这里.

现在可以查看 Hadoop 的 web 界面了: http://localhost:9870

查看资源管理页面: http://localhost:8088/

WordCount

终于可以用 Hadoop 来干活了,比如写一个 WordCount。

首先准备下数据。假如:

  • /wordcount/input 是 HDFS 里的输入目录
  • /wordcount/output 是 HDFS 里的输出目录

在 HDFS 里创建 input 路径。这里先不要建 output 路径,否则后面执行 MR 任务时会因为路径存在而报错:

1
hadoop fs -mkdir -p /wordcount/input

我们将准备两个文件作为输入文件:

1
2
## file01
Hello World Bye World
1
2
## file02
Hello Hadoop Goodbye Hadoop

将这两个文件在 ~/wordcount/ 路径下,并导入到 HDFS:

1
hadoop fs -copyFromLocal ./file0* /wordcount/input

现在可以 Mapper 和 Reducer 的代码了。这里直接抄官方最新版:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

保存为 ~/java_file/wordcount/WordCount.java, 接下来编译这个 java 文件并创建一个 jar:

1
2
3
cd ~/java_file/wordcount/
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

然后我们运行这段代码:

1
hadoop jar wc.jar WordCount /wordcount/input /wordcount/output

如果成功运行完成,run 出来的结果应该在 /wordcount/output 里面,我们来查看下:

1
hadoop fs -cat /wordcount/output/part-r-00000
1
2
3
4
5
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2