hadoop definitive guide 4th edition pdf download

When a dataset Outgrows (too big to apply) the storage capacity of a single physical machine, it becomes necessary to Partitionit across a number of separate machines.

Filesystems that manage the storage across a network of machines are called distributed filesystems.

Since they are network based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular disk filesystems.

For example, one of the biggest challenges is making the filesystem Tolerate node failure without suffering data loss (can tolerate node failure without losing data).
Hadoop Comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.

1 The Design of HDFS

HDFS is a filesystem Designed for storing very large files with streaming data access
patterns (streaming data access mode), running on Clusters of commodity hardware (commercial hardware cluster).

(1) Very large files

"Very large" in this context means files that are hundreds of MB, GB, or TB in size. There are Hadoop clusters running today that store PB of data.

(2) Streaming data access streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a Write-once, read-many-times (write once and read more) pattern. The time to read the whole dataset is more important than the Latencyin reading the first record.

(3) Commodity hardware commercial hardware

Hadoop doesn't require expensive, highly reliable hardware. It's designed to run on clusters of commodity hardware. (Ordinary hardware that all retail stores can buy) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to Carry on working Without a noticeable interruption (no obvious interruption) to the user in the face of such failure.

These are areas where HDFS is not a good fit today:

(1) Low-latency data access low latency data access

HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for low-latency access.

(2) Lots of small files

Because the namenode holds filesystem Metadata in memory.

Each file, directory, and block Take about about 150 bytes.

If you had one million files, each taking one block, you would need at least 300 MB of memory. Although storing millions of files is feasible, billions is Beyond the capability of current hardware. (Storing millions of files is OK, one billion or more files are...)

(For a large number of small pieces, the memory used to maintain file metadata alone will exceed the capabilities of the hardware)

（3）Multiple writers, ArbitraryFile modifications multi-user write, modify files arbitrarily

Files in HDFS may be written to by a Single writer (only one writer)Writes are always made at the end of the file, in append-only fashion. There is no support for multiple writers or for modifications at arbitrary offsets in the file. Incoming, does not support modification anywhere in the file)

Extension 1: What is a file?

Extension 2: What is a file system?

Extension 3: How to design a file system

Extension 4: What is metadata?

The following is a passage from Chekhov's novel "The Set of People", which describes a woman named Valentan:

(She) is not too young, 30 years old, tall, well-proportioned, black eyebrows, red face - a word, not a girl, but jelly, she is so active, noisy, non-stop Taking advantage of the lyrical songs of Little Russia, laughing loudly, and sending out a series of loud laughter: Ha, Ha, Ha!

This paragraph provides a few pieces of information:Age (up and down 30 years old), height (high head), appearance (well-formed, black eyebrows, red face), character (active, noisy, constantly licking the lyrical songs of Little Russia, loud laughing out loud).With this information, we can roughly imagine what kind of person Walanka is. By extension, as long as we provide these kinds of information, we can also speculate about what other people look like.

In this example "age", "height", "face", "character" are metadata Because they are data/information used to describe specific data/information.

Of course, these pieces of metadata are not accurate enough to describe a person's situation. Everyone from small to large has filled in things like the Personal Status Registration Form, including name, gender, ethnicity, political appearance, one-inch photos, academic qualifications, professional titles, etc... Metadata is relatively complete.

In everyday life, metadata is ubiquitous. There is a class of things that define a set of metadata.

(quoting a blog post from Yu Yifeng:）

2 HDFS Concepts

2.1 Blocks block

Posted by: zackzackbehanane0273105.blogspot.com

Source: https://www.programmersought.com/article/83061739771/