Deep Learning at Alibaba Cloud with Alluxio: How To Run PyTorch on HDFS

Author profile picture

@bin-fanBin Fan

VP of Open Supply and Founding Member @Alluxio

Google’s TensorFlow and Fb’s PyTorch are two Deep Learning frameworks which have been standard with the open supply neighborhood. Even if PyTorch remains to be a rather new framework, many builders have effectively followed it because of its ease of use. 

Through default, PyTorch does now not strengthen Deep Learning mannequin coaching at once in HDFS, which brings demanding situations to customers who retailer information units in HDFS. Those customers want to both export HDFS information at the beginning of each and every coaching activity or alter the supply code of PyTorch to strengthen studying from HDFS. Each approaches don’t seem to be perfect as a result of they require further guide paintings that can introduce further uncertainties to the learning activity.

To steer clear of this drawback, we make a selection to make use of Alluxio as an interface to get right of entry to  HDFS by means of a POSIX FileSystem interface. This means very much stepped forward building potency at Alibaba Cloud. 

This text demonstrates how this paintings used to be completed inside a Kubernetes atmosphere.

Get ready HDFS 2.7.2 atmosphere

For this instructional, we used a Helm Chart to put in HDFS to mock an present HDFS cluster.

1. Set up the helm chart of Hadoop 2.7.2

git clone kubectl label nodes cn-huhehaote. hdfs-namenode-selector=hdfs-namenode-0
#helm set up -f values.yaml hdfs charts/hdfs-k8s
helm dependency construct charts/hdfs-k8s
helm set up hdfs charts/hdfs-k8s  --set tags.ha=false  --set tags.clear-cut=true  --set international.namenodeHAEnabled=false  --set hdfs-simple-namenode-k8s.nodeSelector.hdfs-namenode-selector=hdfs-namenode-0

2. Test the standing of the helm chart

kubectl get all -l unlock=hdfs

3. Consumer get right of entry to hdfs

kubectl exec -it hdfs-client-f5bc448dd-rc28d bash
[email protected]:/# hdfs dfsadmin -report
Configured Capability: 422481862656 (393.47 GB)
Provide Capability: 355748564992 (331.32 GB)
DFS Last: 355748515840 (331.32 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Beneath replicated blocks: 0
Blocks with corrupt replicas: 0
Lacking blocks: 0
Lacking blocks (with replication issue 1): 0 -------------------------------------------------
Reside datanodes (2): Identify: (172-31-136-180.node-exporter.arms-prom.svc.cluster.native)
Hostname: iZj6c7rzs9xaeczn47omzcZ
Decommission Standing : Standard
Configured Capability: 211240931328 (196.73 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 32051716096 (29.85 GB)
DFS Last: 179189190656 (166.88 GB)
DFS Used%: 0.00%
DFS Last%: 84.83%
Configured Cache Capability: 0 (0 B)
Cache Used: 0 (0 B)
Cache Last: 0 (0 B)
Cache Used%: 100.00%
Cache Last%: 0.00%
Xceivers: 1
Ultimate touch: Tue Mar 31 16:48:52 UTC 2020

4. HDFS Jstomer configuration

[[email protected] kubernetes-HDFS]# kubectl exec -it hdfs-client-f5bc448dd-rc28d bash
[email protected]:/# cat /and so forth/hadoop-custom-conf
cat: /and so forth/hadoop-custom-conf: Is a listing
[email protected]:/# cd /and so forth/hadoop-custom-conf
[email protected]:/and so forth/hadoop-custom-conf# ls
core-site.xml hdfs-site.xml
[email protected]:/and so forth/hadoop-custom-conf# cat core-site.xml
<?xml edition="1.0"?>
<?xml-stylesheet kind="textual content/xsl" href="configuration.xsl"?>
<configuration> <belongings> <call>fs.defaultFS</call> <worth>hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.native:8020</worth> </belongings>
[email protected]:/and so forth/hadoop-custom-conf# cat hdfs-site.xml
<?xml edition="1.0"?>
<?xml-stylesheet kind="textual content/xsl" href="configuration.xsl"?>
<configuration> <belongings> <call></call> <worth>record:///hadoop/dfs/call</worth> </belongings> <belongings> <call>dfs.namenode.datanode.registration.ip-hostname-check</call> <worth>false</worth> </belongings> <belongings> <call>dfs.datanode.information.dir</call> <worth>/hadoop/dfs/information/0</worth> </belongings>
[email protected]:/and so forth/hadoop-custom-conf# hadoop --version
Error: No command named `--version' used to be discovered. In all probability you supposed `hadoop -version'
[email protected]:/and so forth/hadoop-custom-conf# hadoop -version
Error: No command named `-version' used to be discovered. In all probability you supposed `hadoop edition'
[email protected]:/and so forth/hadoop-custom-conf# hadoop edition
Hadoop 2.7.2
Subversion -r b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by way of jenkins on 2019-01-26T00:08Z
Compiled with protoc 2.5.0
From supply with checksum d0fda26633fa762bff87ec759ebe689c
This command used to be run the usage of /decide/hadoop-2.7.2/proportion/hadoop/traditional/hadoop-common-2.7.2.jar

5. Experimental HDFS elementary record operations

# hdfs dfs -ls /
Discovered 1 pieces
drwxr-xr-x - root supergroup 0 2020-03-31 16:51 /verify
# hdfs dfs -mkdir /mytest
# hdfs dfs -copyFromLocal /and so forth/hadoop/hadoop-env.cmd /verify/
# hdfs dfs -ls /verify
Discovered 2 pieces
-rw-r--r-- 3 root supergroup 3670 2020-04-20 08:51 /verify/hadoop-env.cmd

6. Obtain information

mkdir -p /information/MNIST/uncooked/
cd /information/MNIST/uncooked/
hdfs dfs -mkdir -p /information/MNIST/uncooked
hdfs dfs -copyFromLocal *.gz /information/MNIST/uncooked

Deploy Alluxio

1. First make a selection the designated node, which will also be one or extra

kubectl label nodes cn-huhehaote. dataset=mnist

2. Create config.yaml, during which you wish to have to configure the node selector to specify the node

cat << EOF > config.yaml
imageTag: "2.2.0-SNAPSHOT-b2c7e50"
nodeSelector: dataset: mnist
homes: alluxio.fuse.debug.enabled: "false" alluxio.person.record.writetype.default: MUST_CACHE alluxio.grasp.magazine.folder: /magazine alluxio.grasp.magazine.kind: UFS alluxio.grasp.mount.desk.root.ufs: "hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.native:8020"
employee: jvmOptions: " -Xmx4G "
grasp: jvmOptions: " -Xmx4G "
tieredstore: ranges: - alias: MEM degree: 0 quota: 20GB kind: hostPath trail: /dev/shm top: 0.99 low: 0.8
fuse: picture: imageTag: "2.2.0-SNAPSHOT-b2c7e50" jvmOptions: " -Xmx4G -Xms4G " args: - fuse - --fuse-opts=direct_io

It will have to be famous that theHDFS edition must be specified at compilation time. 

3. Deploy Alluxio

tar -xvf alluxio-0.12.0.tgz
helm set up alluxio -f config.yaml alluxio

4. Test the standing of Alluxio, wait till all parts are able

helm get manifest alluxio | kubectl get -f -
provider/alluxio-master ClusterIP None <none> 19998/TCP,19999/TCP,20001/TCP,20002/TCP 14h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/alluxio-fuse 4 4 4 4 4 <none> 14h
daemonset.apps/alluxio-worker 4 4 4 4 4 <none> 14h NAME READY AGE
statefulset.apps/alluxio-master 1/1 14h

Get ready PyTorch container picture

1. Create a Dockerfile

mkdir pytorch-mnist
cd pytorch-mnist
vim Dockerfile

Populate the Dockerfile with the next content material:

FROM pytorch/pytorch:1.4-cuda10.1-cudnn7-devel # pytorch/pytorch:1.4-cuda10.1-cudnn7-devel ADD / CMD ["python", "/"]
2. Create a PyTorch python record known as
cd pytorch-mnist

Populate the python record with the next content material:

# -*- coding: utf-8 -*-
# @Creator: cheyang
# @Date: 2020-04-18 22:41:12
# @Ultimate Changed by way of: cheyang
# @Ultimate Changed time: 2020-04-18 22:44:06
from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.purposeful as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR magnificence Web(nn.Module): def __init__(self): tremendous(Web, self).__init__() self.conv1 = nn.Conv2d(1, 32, 3, 1) self.conv2 = nn.Conv2d(32, 64, 3, 1) self.dropout1 = nn.Dropout2d(0.25) self.dropout2 = nn.Dropout2d(0.5) self.fc1 = nn.Linear(9216, 128) self.fc2 = nn.Linear(128, 10) def ahead(self, x): x = self.conv1(x) x = F.relu(x) x = self.conv2(x) x = F.relu(x) x = F.max_pool2d(x, 2) x = self.dropout1(x) x = torch.flatten(x, 1) x = self.fc1(x) x = F.relu(x) x = self.dropout2(x) x = self.fc2(x) output = F.log_softmax(x, dim=1) go back output def prepare(args, mannequin, software, train_loader, optimizer, epoch): mannequin.prepare() for batch_idx, (information, goal) in enumerate(train_loader): information, goal =, optimizer.zero_grad() output = mannequin(information) loss = F.nll_loss(output, goal) loss.backward() optimizer.step() if batch_idx % args.log_interval == 0: print('Teach Epoch:  [/ (:.0f%)]tLoss: :.6f'.layout( epoch, batch_idx * len(information), len(train_loader.dataset), 100. * batch_idx / len(train_loader), loss.merchandise())) def verify(mannequin, software, test_loader): mannequin.eval() test_loss = 0 proper = 0 with torch.no_grad(): for information, goal in test_loader: information, goal =, output = mannequin(information) test_loss += F.nll_loss(output, goal, aid='sum').merchandise() pred = output.argmax(dim=1, keepdim=True) proper += pred.eq(goal.view_as(pred)).sum().merchandise() test_loss /= len(test_loader.dataset) print('nTest set: Reasonable loss: :.4f, Accuracy: / (:.0f%)n'.layout( test_loss, proper, len(test_loader.dataset), 100. * proper / len(test_loader.dataset))) def major(): # Coaching settings parser = argparse.ArgumentParser(description='PyTorch MNIST Instance') parser.add_argument('--batch-size', kind=int, default=64, metavar='N', assist='enter batch length for coaching (default: 64)') parser.add_argument('--test-batch-size', kind=int, default=1000, metavar='N', assist='enter batch length for trying out (default: 1000)') parser.add_argument('--epochs', kind=int, default=14, metavar='N', assist='choice of epochs to coach (default: 14)') parser.add_argument('--lr', kind=drift, default=1.0, metavar='LR', assist='finding out charge (default: 1.0)') parser.add_argument('--gamma', kind=drift, default=0.7, metavar='M', assist='Learning charge step gamma (default: 0.7)') parser.add_argument('--no-cuda', motion='store_true', default=False, assist='disables CUDA coaching') parser.add_argument('--seed', kind=int, default=1, metavar='S', assist='random seed (default: 1)') parser.add_argument('--log-interval', kind=int, default=10, metavar='N', assist='what number of batches to attend sooner than logging coaching standing') parser.add_argument('--save-model', motion='store_true', default=False, assist='For Saving the present Type') args = parser.parse_args() use_cuda = now not args.no_cuda and torch.cuda.is_available() torch.manual_seed(args.seed) software ="cuda" if use_cuda else "cpu") kwargs =  if use_cuda else  train_loader = torch.utils.information.DataLoader( datasets.MNIST('../information', prepare=True, download=True, change into=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=args.batch_size, shuffle=True, **kwargs) test_loader = torch.utils.information.DataLoader( datasets.MNIST('../information', prepare=False, change into=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=args.test_batch_size, shuffle=True, **kwargs) mannequin = Web().to(software) optimizer = optim.Adadelta(mannequin.parameters(), scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma) for epoch in vary(1, args.epochs + 1): prepare(args, mannequin, software, train_loader, optimizer, epoch) verify(mannequin, software, test_loader) scheduler.step() if args.save_model:, "") if __name__ == '__main__': major()

3. Construct the picture

Construct a tradition picture beneath the similar degree of the listing, the objective container picture on this instance is
docker construct -t .

4. Push the constructed reflect 

to the reflect warehouse created within the East China 1 space for customers who’re within the Larger China space (Alibaba Cloud). You’ll be able to discuss with the elementary operation of mirroring.

Post PyTorch coaching duties

1. Set up enviornment

$ wget
$ tar -xvf arena-installer-0.3.3-332fcde-linux-amd64.tar.gz
$ cd arena-installer/
$ ./set up.
$ yum set up bash-completion -y
$ echo "supply <(enviornment finishing touch bash)" >> ~/.bashrc
$ chmod u+x ~/.bashrc
2. Use enviornment to post coaching duties, take note to make a choice selector as

enviornment post tf  --name=alluxio-pytorch  --selector=dataset=mnist  --data-dir=/alluxio-fuse/information:/information  --gpus=1  "python /"

3. And think about the learning log via enviornment

# enviornment logs --tail=20 alluxio-pytorch
Teach Epoch: 12 [49280/60000 (82%)] Loss: 0.021669
Teach Epoch: 12 [49920/60000 (83%)] Loss: 0.008180
Teach Epoch: 12 [50560/60000 (84%)] Loss: 0.009288
Teach Epoch: 12 [51200/60000 (85%)] Loss: 0.035657
Teach Epoch: 12 [51840/60000 (86%)] Loss: 0.006190
Teach Epoch: 12 [52480/60000 (87%)] Loss: 0.007776
Teach Epoch: 12 [53120/60000 (88%)] Loss: 0.001990
Teach Epoch: 12 [53760/60000 (90%)] Loss: 0.003609
Teach Epoch: 12 [54400/60000 (91%)] Loss: 0.001943
Teach Epoch: 12 [55040/60000 (92%)] Loss: 0.078825
Teach Epoch: 12 [55680/60000 (93%)] Loss: 0.000925
Teach Epoch: 12 [56320/60000 (94%)] Loss: 0.018071
Teach Epoch: 12 [56960/60000 (95%)] Loss: 0.031451
Teach Epoch: 12 [57600/60000 (96%)] Loss: 0.031353
Teach Epoch: 12 [58240/60000 (97%)] Loss: 0.075761
Teach Epoch: 12 [58880/60000 (98%)] Loss: 0.003975
Teach Epoch: 12 [59520/60000 (99%)] Loss: 0.085389 Take a look at set: Reasonable loss: 0.0256, Accuracy: 9921/10000 (99%)


Prior to now, working the PyTorch program required customers to change the PyTorch adapter code so that you can get right of entry to information in HDFS. The usage of  Alluxio, we have been ready to temporarily broaden and prepare fashions with none further paintings to change PyTorch code or manually replica HDFS information. This means is additional simplified by way of putting in all the atmosphere inside Kubernetes.


The Noonification banner

Subscribe to get your day by day round-up of best tech tales!