Skip to content

rvarun11/google-file-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 

Repository files navigation

google-file-system

status

Project is based on the paper Google File System.

Project Demo - https://youtu.be/LDqfd4PvoiQ

Index - Changelog | Dependency | Commamnds | Architecture

Changelog

  • Added Remote Procedure Calls with rpyc
  • Added replication factor
  • Added Create, Read, Append, Delete & List operations

Dependency

  • rpyc

Commands

  • Create: python client.py create <file_name>

  • Read: python client.py read <file_name> <data>

  • Append: python client.py append <file_name> <string>

  • Delete: python client.py delete <file_name>

  • List: python client.py list

Architecture

The following architecture is based on the design goals mentioned below.

GFS Architecture

Why Big Storage is hard?

  1. To improve performance, large systems require sharding
  2. Sharding leads to faults
  3. To improve fault tolerance we need make replications
  4. Replications leads to inconsistencies
  5. To bring consistency, we often require clever design where the clients and servers have to do more work, this leads to low performance

Goal for the project

To understand:

  1. Working of Distributed File Systems while building a simple fault tolerant GFS.
  2. Working of RPCs.

Design Goals & Assumptions

  1. To build a client, a master server and three chunk servers with design similar to GFS.

  2. Master

    1. It will only store the metadata (mapping of files and their respective chunks) and it has be to persistent.
    2. Chunk Size will be 8 bytes and replication factor is set to 2.
    3. Instead of having only two tables, I'll be dividing the same logic into 3 tables, namely file_table, handle_table and chunk_servers, for simplicity.
    4. I'm also assuming that the master server will always work for the system to function properly. To make the system more fault tolerant and deal with master server failure, shadow copy of the master has to be created, which is beyond the scope of this project.
  3. Client

    1. Provide Create, Read, Append, Delete and List operations.
    2. All the heavy lifting and logic will reside in the client (creating chunks, managing connections, etc.)
    3. Since the chunk size is only 8 bytes, we'll be reading the entire chunk at a time, instead of giving byte range to be read, as done in the actual GFS.
  4. Chunk Server

    1. The chunk servers will be naive (no periodic heartbeats) and they'll only be used for Reading/Writing data from disk.
    2. Data will be not cached.
    3. Their location will be stored beforehand and will be accessible by the master.
    4. All replicas are given equal priority.
    5. Ability to handle faults by completing the operation in progress even if one chunk server goes down.

Language Specifications

Due to time constraints, simplicity was favoured over a complex and more apt design, as in the actual GFS. So, I decided to go with Python and RPyC instead of using something like C++ and gRPC which would've been better for learning.

Future Work

Dynamic Chunk Servers: Right now the Chunk Servers are naive have to be manually configured so that they can be used. This can be improved upon by making chunk servers establish a connection with master and then adding its URI to the the chunk server table. This will allow building of additional features like the Heartbeat Monitor.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages