1, Secondary storage

"Secondary storage" is a definition with respect to the "Primary storage". Just image what you will think at the first glance of "storage", which might be a lot of Servers, disks, cables, etc. If these items are entitled into a single unit, they will be called as a "storage".

The "Primary storage" is the same as "storage" under most scenarios. The "Secondary storage" refers to storage with infreqently accessed data.

For the primary storage, the top metering indexes are IOPS, Capacity, Throughput; whereas for secondary storage, such metering indexes are still important, but not the most important.

Let's talk about the dedupfs, our product, which is a secondary storage. dedupfs has the capability to identifying the duplicated part in the storage and keep a single copy for the duplicated parts in order to reduce the storage usage. To make this happen, a lot of CPU cycles are spent on the algorithm for identifying deplicated parts, and a lot of IOPS are spent for maintaining the information of the duplicated parts. Here, the CPU cycles and IOPS are consumed by dedupfs to save the storage usage. In other words, dedupfs takes large amount of CPU cycles and IOPS to save storage usage.

You can simply treat this process as an encoding/decoding process: dedupfs encodes the data to reduce the data volume before storing, and it decodes the data before returning back the data to the caller.

2, Data Deduplication

Data deduplication is a technique to reduce the data volume. And there are different variants for this technique: deduplicate data in the source, or deduplicate data in the target/storage.

Just image such scenario: you have 10 virtual machines which are running Redhat Linux, and each virtual machine has 1 virtual disk. Now you have 10 virtual disks, each virtual disk is 20 GB. Then the total storage usage will be 200 GB, and let's suppose the storage after duplication is 40 GB.

If the duplication happens at the source side, CPU cycles are taken at the source side and 40 GB is transferred through network to the secondary storage.

Advantage:
Network trafic is reduced from 200GB to 40GB;

Disadvantages:
The secondary storage works in Client/Server mode, a client/agent/daemon must be installed in the source side and gets configured before using it;
Identifying the duplicated data is a CPU tensive action, the client's computing resources will be occupied;

If the duplication happens at the secondary storage side, CPU cycles are spent on the storage side and 200 GB is transferred through network to the secondary storage.

Advantage:
The secondary storage acts like a remote file system, or a remote disk in the cloud, no client/agent/daemon/application are arequired to be installed and configured in the source sides;
No computing resources are involved from the resource side, everything is done in the secondary storage side;

Disadvantage:
Network tranfic is 200 GB.