HDFS-分布式文件系统

大兔子大兔子 提交于 2020-03-05 05:51:36

HDFS

Hadoop分布式文件系统(HDFS[Hadoop Distributed File System])是指被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统(Distributed File System)。它和现有的分布式文件系统有很多共同点。但同时,它和其他的分布式文件系统的区别也是很明显的。HDFS是一个高度容错性的系统,适合部署在廉价的机器上。HDFS能提供高吞吐量的数据访问,非常适合大规模数据集上的应用。
设计思想:分而治之:将大文件、大批量文件,分布式存放在大量服务器上,以便于采取分而治之的方式对海量数据进行运算分析
应用:为各类分布式运算框架(如:mapreduce,spark,tez,……)提供数据存储服务

HDFS Architecture

HDFS采用Master/Slave主从结构,每个HDFS集群包括一个单独的NameNode,用作管理文件系统的命名空间以及控制客服端访问权限的Master服务端,集群中包含一定数量的DataNode,并且至少含有一个。用于节点存储数据。HDFS就是一个文件系统并允许存储用户数据。在内部,一个文件并划分为一个或多个的块,并存储于这些DateNode集合中。NameNode执行文件系统的操作,例如打开,关闭,重命名文件和生成文件目录,并生成集群中DataNode的映射关系。DataNode自然是负责客户端的存取数据的请求。当然,DateNode在NameNode的管控之下执行对块的增删查改。
在这里插入图片描述
HDFS由Java开发,可跨平台部署,极端例子是可以将NameNode与DataNode分别部署于不同的操作系统之上。
(HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case)

The File System Namespace

HDFS支持传统的层级目录结构,允许用创建目录并在这些目录下创建文件,本文件系统层级结构类似于主流操作系统的文件系统。在其目录下可以创建,移动,重命名文件。HDFS能控制用户限额(user quotas)和访问权限,HDFS不支持软连接和硬链接。但HDFS架构并不阻止使用者实现这些特征。
软硬链接详解出处

在这里插入图片描述在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
当HDFS遵循文件系统命名惯例时,一些文件名和路径将被保留,例如透明加密和快照的特征将使用保留路径(While HDFS follows naming convention of the FileSystem, some paths and names (e.g. /.reserved and .snapshot ) are reserved. Features such as transparent encryption and snapshot use reserved paths.)。
NameNode维护着整个文件系统,任何文件系统或者文件系统属性的改变都会被记录于NameNode中,应用能够规定被HDFS管控的文件的副本数量。
(HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS supports user quotas and access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.
While HDFS follows naming convention of the FileSystem, some paths and names (e.g. /.reserved and .snapshot ) are reserved. Features such as transparent encryption and snapshot use reserved paths.
The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.)

Data Replication

HDFS设计的目的就是多机器的庞大集群中提供可靠的文件存储服务。其通过有序的块存储文件,这些文件块将被复制存储于有限容量的块中,块的容量和复制特性每份文件都可配置。
文件中除了最后一个块,其他的容量都相同。在支持不同容量的块可以添加和同步之后,用户可以不用填满每个文件的最后一个块才去填充新的文件块。
应用能够管控文件的副本数量,其复制属性会在文件创建之时指定,随后可以更改。HDFS是一次写入模式(除了追加和截取数据),并且在任何时候都严格执行一次写入原则。
NameNode为所有块的复制做决策,并且周期性的接收集群中每个DataNode的心跳检测和阻塞报告。心态检测的返回结果暗示着当前DataNode是否正常运行。阻塞报告则包含了当前DataNode所有的块的清单。
(HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.
All blocks in a file except the last block are the same size, while users can start a new block without filling out the last block to the configured block size after the support for variable length block was added to append and hsync.
An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.)
在这里插入图片描述

Replica Placement:The First Body Steps

副本的摆放策略对于HDFS的性能和可靠性是至关重要的。对于摆放策略的优化使HDFS有别于其他的分布式文件系统。这是一个需要大量经验和调试的特性。机架感知式的副本摆放策略的目的是为了改善数据的可靠性,可用性和优化带宽。当前副本摆放策略的改进便是致力于这个方向。当前策略改进的短期目标便是在实际生产环境中去验证它,了解它在实际环境中的表现,并建立相应基础环境去测试和寻找更多的更优的策略。
大型HDFS实例集群通常运行于多个机器间。在不同机器上的两个节点通信只能依靠于交换机。在很多实例中,同一机器上节点通信的网络带宽表现是优于不同机器上节点通信的网络带宽表现的。
NameNode 关于每个DataNode所属的机架ID是通过Hadoop Rack Awareness的进程显示结果来确定的。简单却无优化的策略是将副本摆放于单一机架上。Hadoop Rack Awareness防止了当单一机架宕机后丢失数据并且能够在读取数据时利用带宽从多个机架上获取数据。 这个策略在集群中分布放置副本使得在某一机器宕机时做到负载均衡。但这个策略会增加写入数据的开销,因为要将当前块的数据传输到多个机架中。
对于通常情况,当副本参数为3时。HDFS的放置策略是,如果写入程序位于数据节点上,则将一个副本放在本地计算机上,否则放在与写入程序位于同一机架中的随机数据节点上,另一个副本放在不同(远程)机架中的节点上,最后一个副本放在同一远程机架中的不同节点上。(HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode in the same rack as that of the writer, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. )
机架感知是通过确定任意两个节点是位于同一机架,还是跨机架,来保证数据可靠性同时减少带宽消耗,提高读写性能。下图的副本摆放策略便是机架感知的具体呈现:
在这里插入图片描述
此策略减少了机架间的写入通信量,有效的改善了写入性能。机架宕机机率远小于节点宕机。此策略不影响数据可靠性和可用性的保证。它的确减少了读取数据的网络带宽,因为块只放在两个机架中,而不是三个。在这种策略下,并不是一个文件的副本都分布到不同的机架上。三分之一的副本位于一个节点上,三分之二的副本位于一个机架上,另三分之一的副本均匀分布在其余机架上。(One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks.)这个策略在加强写入性能的同时并没有降低数据可靠性和读取性能。
如果副本参数大于3,则在将每个机架的副本数保持在上限(基本上是(副本-1)/机架+2)以下的同时,随机确定第4个和后续副本的位置。(If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit (which is basically (replicas - 1) / racks + 2).)
因为NameNode不允许DataNode在同一个块拥有多个副本,副本的最大可建数量是当前DataNode的数量和。
Storage Types and Storage Policies加入对HDFS的支持后,除了上面描述的机架感知策略外,NamNode还考虑了副本放置的策略,NameNode选取节点首先遵循机架感知策略,然后检查候选节点是否具有与文件关联的策略所需的存储。如果候选节点没有所需的存储类型,NameNode将会寻找其他节点,如果在第一个路径中找不到足够的节点来放置副本,则NameNode将在第二个路径中查找具有回退存储类型的节点。
此处描述的当前默认副本放置策略正在进行中。
(The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode in the same rack as that of the writer, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.
If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit (which is basically (replicas - 1) / racks + 2).
Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time.
After the support for Storage Types and Storage Policies was added to HDFS, the NameNode takes the policy into account for replica placement in addition to the rack awareness described above. The NameNode chooses nodes based on rack awareness at first, then checks that the candidate node have storage required by the policy associated with the file. If the candidate node does not have the storage type, the NameNode looks for another node. If enough nodes to place replicas can not be found in the first path, the NameNode looks for nodes having fallback storage types in the second path.
The current, default replica placement policy described here is a work in progress.)

HDFS-Write

写入流程

HDFS-Read

读取流程

HDFS-元数据控制

在这里插入图片描述
对于元数据的管理,如果采用常规的文件管理,例如上图,对于文件的操作所导致的元数据更改会对磁盘IO造成巨大的压力。所以HDFS采用下列的策略管理元数据:
在这里插入图片描述
将元数据的中的目录结构写入内存,并定期序列化到磁盘,对于未序列化的操作以日志方式记录防止丢失。再通过SecondaryNameNode将日志与序列化的信息进行合并重新序列化到磁盘。

Replica Selection

为了最小化全局的带宽消耗和读取延迟,HDFS试图满足来自最接近读卡器的副本的读取请求。如果存在一个副本所在机架也是读卡器的话,那么副本将首先满足读取的请求,如果HDFS集群跨越多个数据中心,则首选驻留在本地数据中心的副本,而不是任何远程副本。

SafeMode

启动时,NameNode将进入安全模式,当NameNode进入安全模式状态时,数据块将不能进行复制。NameNode将接收DataNode的心跳检测和数据块报告。数据块报告包含数据节点托管的数据块列表。每个数据块含有特定的最小副本系数。数据块的最小副本系数通过NameNode检测后才会被视为安全。在安全复制数据块的可配置百分比与NameNode一起检查(加上另外30秒)之后,NameNode退出安全模式状态。然后,它确定仍具有少于指定副本数目的数据块(如果有)的列表。然后,NameNode将这些块复制到其他数据节点。

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!