Skip to content

Latest commit

 

History

History
680 lines (471 loc) · 24.2 KB

README.md

File metadata and controls

680 lines (471 loc) · 24.2 KB

HamsterDB

Introduction

HamsterDB is a column-store database built as a capstone project for the Harvard Undergraduate data systems course CS 165. It was designed to efficiently store and query large amounts of data.

As a column store database, it is optimized for aggrigate operations over large rows. And it has the following features.

  1. Basic database operations:
  • Inserting, Bulk loading, Selecting, and Fetching data
  • Aggregate operations like sum, add, sub, avg
  1. Threaded search:
  • Improved performance of select and fetch operations through the use of
    • shared scans and
    • parallelization
  1. Indexing:
  • Btrees and External sorts for improved searches
  • Multiclusterd Indexs, with data duplication
  • Unclustered Indexs
  1. Joins:
  • Nested loop joins
  • Hash joins

File organization

The client.c file contains the source code for the client side of the database application. This is responsible for handling socket connections with the server.

The server.c file contains the source code for the server side of the database application.

The include directory contains header files for the project, including declarations of functions and data structures used throughout the project.

The Join directory contains files related to the implementation of the join operation in the database.

The Parser directory contains files related to the parsing of queries and commands for the database. There are multiple parsers realted for each sql queries. The names of each file is indicative of the type of parser.

The Serializer directory contains files related to the serialization and deserialization of data for storage and retrieval in the database.

The Create directory contains files related to the implementation of the create operation in the database.

The Indexing directory contains files related to the implementation of indexing in the database.

The Makefile is a file used to build the project. It contains instructions for the build system on how to compile and link the various source files and libraries into the final executable.

The primes.long file contains a list of prime numbers that may be used by the database for various purposes.

The Datastructures directory contains files defining the data structures used in the database, such as HashTables, Queues, LinkedLists, and BTrees.

The Others directory contains miscellaneous files that do not fit into any of the other categories.

The Parallelization directory contains files related to the implementation of parallelization in the database.

The Printer directory contains files related to the printing of data from the database.

The Utils directory contains utility functions used throughout the project.

The Engine directory contains files related to the database engine, including command execution.

The Insert directory contains files related to the implementation of the insert operation in the database, including insert and bulk loading.

The Select directory contains files related to the implementation of the select operation in the database.

The Varpool directory contains files related to the management of variables in the database. The varpool uses a hashtable to keep track of intermidiary variables that are assigned by the client.

Installation and Usage

First download this repository. This project is tested on a linux environment with the above docker file. But as long us you have the development tools setup any linux system will do.

git clone [email protected]:hileamlakB/HamsterDB.git

You can the go to the src directory with

cd src

Run the make file with

make

You will then have two main executables. client and server.

You can then frist run server and then on another terminal client. Or server on the background and client in the foreground.

Queries

You can play with the database by loading data and running different queries. The query uses a domain specfic langauge, taken from the CS 165 webiste, described below.

HamsterDB Query(taken from CS165)
  • Keywords are unqualified text and symbols. For example: create, tbl, col etc. (These are static words that you can use to parse the instructions. They will appear in the same order and location in the string).
  • Items that appear in brackets are required but indicate good opportunities for extensions, or relate to one of the extra features found in the project description. For example, 'create(idx,<col_name>,[btree])' means that you must support creating at least B-tree indexes, but you may want to also support additional indexes like zone maps or hash maps.
  • Variables appear in-between angle brackets. They are strings that appear in 165L and are either identifiers like the name of a table or are labels that the system must carry through the execution of the commands (this will become more clear through the examples).
  • End of line indicates next instruction (but your design can buffer or parse multiple lines at a time as you see fit).
  • Comments are marked with '--' and continue until the end of the line.


  • Creating new database objects

    create(<object_type>,<object_name>,<parameter1>,<parameter2>,...)

    The create function creates new structures in the system. The possible structures are databases, tables, columns, and indexes. It does not return anything. Below you can see all possible instances.

    create(db,<db_name>)
    create(tbl,<t_name>,<db_name>,<col_cnt>)
    create(col,<col_name>,<tbl_var)
    create(idx,<col_name>,[btree, sorted], [clustered, unclustered])
    

    Usage

    create(db,"awesomebase") -- create db with name "awesomebase"
    create(tbl,"grades",awesomebase,6) -- create table "grades" with 6 columns in the "awesomebase"
    create(col,"project",awesomebase.grades) -- create column 1 with name "project"
    create(col,"midterm1",awesomebase.grades) -- create column 2 with name "midterm1"
    create(col,"midterm2",awesomebase.grades) -- create column 3 with name "midterm2"
    create(col,"class",awesomebase.grades) -- create column 4 with name "class"
    create(col,"quizzes",awesomebase.grades) -- create column 5 with name "quizzes"
    create(col,"student_id",awesomebase.grades) -- create column 6 with name "student_id"
    

    SQL Example

    CREATE DATABASE awesomebase
    CREATE TABLE grades (grades int, project int, midterm1 int, midterm2 int, class int, quizzes int, student_id int)
    

    In the create table statement, the first value of a parameter is the column name and the second parameter is its type. VARCHAR(n), BINARY(n), BIGINT, and TIMESTAMP are examples of other SQL data types.


    Loading

    load(<filename>)

    This function loads values from a file. Both absolute and relative paths should be supported. The columns within the file are assigned names that correspond to already created database objects. This filename should be a file on the client's side and the client should pass the data in this file to the server for loading.

    Parameters

    <filename>: The name of the file to load the database from.

    Return Value

    None.

    Usage


    load("/path/to/myfile.txt")
    -- or relative path
    load("./data/myfile.txt")
    

    File format

    Input data will be provided as ASCII-encoded CSV files. For example:
    foo.t1.a,foo.t1.b
    10,-23
    -22,910

    This file would insert two rows into columns 'a' and 'b' of table 't1' in database 'foo'.

    SQL Example

    There is no direct correlate in SQL to the load command. That being said, almost all vendors have commands to load a file into a table. The MySQL version would be:
     LOAD DATA INFILE myfile.txt 

    Inserting rows into a table

    The system will support relational, that is, row-wise (one row at a time) inserts:



    relational_insert(<tbl_var>,[INT1],[INT2],...)

    Parameters

    <col_var>: A fully qualified column name.
    <tbl_var>: A fully qualified table name.
    INT/INTk: The value to be inserted (32 bit signed).

    Return Value

    None.

    Usage


    relational_insert(awesomebase.grades,107,80,75,95,93,1) 
    

    SQL Example

    There are two different insert statements in SQL. In the first statement below, the column names are omitted and the values are inserted into the columns of the table in the order those columns were declared in table creation. In the second statement, column names are included and the values in the insert statement are put in the corresponding given column. The two statements below perform the same action.
    INSERT INTO grades VALUES (107,80,75,95,93,1)
    INSERT INTO grades (midterm1, project, midterm2, class, quizzes, student_id) VALUES (80,107,75,95,93,1)
    

    Selecting values

    There are two kinds of select commands.


    Select from within a column:

    <vec_pos>=select(<col_name>,<low>,<high>)

    Parameters

    <col_name>: A fully qualified column name or an intermediate vector of values
    <low>: The lowest qualifying value in the range.
    <high>: The exclusive upper bound.
    null: specifies an infinite upper or lower bound.


    Select from pre-selected positions of a vector of values:

    <vec_pos>=select(<posn_vec>,<val_vec>,<low>,<high>)

    Parameters

    <posn_vec>: A vector of positions
    <val_vec>: A vector of values.
    <low>: The lowest qualifying value in the range.
    <high>: The exclusive upper bound.
    null: specifies an infinite upper or lower bound.

    Return Value

    <vec_pos>: A vector of qualifying positions.

    Usage


    -- select
    pos_1=select(awesomebase.grades.project,90,100) -- Find the rows with project score between 90 and 99
    pos_2=select(awesomebase.grades.project,90,null) -- Find the rows with project greater or equal to 90

    SQL Example

    SELECT student_id FROM grades WHERE midterm1 > 90 AND midterm2 > 90

    In the statement above, we might select on midterm1 using the first select, then select on midterm2 using the second type of select.

    Fetching values

    This function collects the values from a column at given positions.



    <vec_val>=fetch(<col_var>,<vec_pos>)

    Parameters

    <col_var>: A fully qualified column name.
    <vec_pos>: A vector of positions that qualify (as returned by select or join).

    Return Value

    <vec_val>: A vector of qualifying values.

    Usage


    a_plus=select(awesomebase.grades.project,100,null) -- Find the rows with project greater or equal to 100
    ids_of_top_students=fetch(awesomebase.grades.student_id,a_plus) -- Return student id of the qualifying rows
    

    SQL Example

    The fetch command would be an internal operation at the end of a SQL query. For example, using our last query:
     SELECT student_id FROM grades WHERE midterm1 > 90 AND midterm2 > 90 

    The last part of this query after the two WHERE clauses had been evaluated would use a fetch on column student_id.

    Deleting rows

    Row deletions happen using the relational_delete operation. It will internally issue multiple separate column deletes.



    relational_delete(<tbl_var>,<vec_pos>)

    Parameters

    <tbl_var>: A fully qualified table name.
    <vec_pos>: A vector of positions.

    Return Value

    None.

    Usage


    low_project=select(awesomebase.grades.project,0,10) -- Find the rows with project less than 10
    relational_delete(awesomebase.grades,low_project) -- Clearly this is a mistake!!
    

    SQL Example

    DELETE FROM grades WHERE midterm1 < 40 AND midterm2 < 40 

    Joining columns

    This function performs a join between two inputs, given both the values and respective positions of each input. We expect at least a hash and nested-loop join to be implemented, but implementing others (such as sort-merge) is a possibility.



    <vec_pos1_out>,<vec_pos2_out>=join(<vec_val1>,<vec_pos1>,<vec_val2>,<vec_pos2>, [hash,nested-loop,...])

    Parameters

    <vec_val_1>: The vector of values 1.
    <vec_pos_1>: The vector of positions 1.
    <vec_val_2>: The vector of values 2.
    <vec_pos_2>: The vector of positions 2.
    <type>: The type of join (i.e. hash, sort-merge)

    NOTE: There is no explicit indication which is the smaller relation. Why this matters will become apparent when we discuss joins.

    Return Value

    <vec_pos1_out>,<vec_pos2_out>: The concatenation of the positions in each input table of the resulting join.

    Usage


    positions1=select(awesomebase.cs165.project_score,100,null) -- select positions where project score >= 100 in cs165
    positions2=select(awesomebase.cs265.project_score,100,null) -- select positions where project score >= 100 in cs265
    values1=fetch(awesomebase.cs165.student_id,positions1)
    values2=fetch(awesomebase.cs265.student_id,positions2)
    r1, r2 = join(values1,positions1,values2,positions2,hash) -- positions of students who have project score >= 100 in both classes
    student_ids = fetch(awesomebase.cs165.student_id, r1)
    print(student_ids)
    

    SQL Example

    SELECT student_id FROM cs165_grades JOIN cs265_grades 
    WHERE cs165_grades.project > 100 
    AND cs165_grades.project > 100 
    AND cs165_grades.student_id = cs265_grades.student_id

    Min aggregate

    There are two kinds of min aggregate commands.

    <min_val>=min(<vec_val>)

    The first min aggregation signature returns the minimum value of the values held in <vec_val>.

    Parameters

    <vec_val>: A vector of values to search for the min OR a fully qualified name.

    Return Value

    <min_val>: The minimum value of the input <vec_val>.




    The second min aggregation signature returns the minimum value and the corresponding position(s) (as held in <vec_pos>) from the values in <vec_val>.

    <min_pos>,<min_val>=min(<vec_pos>,<vec_val>)

    Parameters

    <vec_pos>: A vector of positions corresponding to the values in <vec_val>.
    <vec_val>: A vector of values to search for the min OR a fully qualified name.
    Note: When null is specified as the first input of the function, it returns the position of the min from the <vec_val> array.

    Return Value

    <min_pos>: The position (as defined in <vec_pos>) of the min.
    <min_val>: The minimum value of the input <vec_val>.

    Usage


    positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100
    values1=fetch(awesomebase.grades.midterm1,positions1)
    -- used here
    min1=min(values1) -- the lowest midterm1 grade for students who got 100 or more in their project
    

    SQL Example

    SELECT min(midterm1) FROM grades WHERE project >= 100 

    Max aggregate

    There are two kinds of max aggregate commands.

    <max_val>=max(<vec_val>)

    The first max aggregation signature returns the maximum value of the values held in <vec_val>.

    Parameters

    <vec_val>: A vector of values to search for the max OR a fully qualified name.

    Return Value

    <max_val>: The maximum value of the input <vec_val>.




    The second max aggregation signaturereturns the maximum value and the corresponding position(s) (as held in <vec_pos>) from the values in <vec_val>. <max_pos>,<max_vals>=max(<vec_pos>,<vec_val>)

    Parameters

    <vec_pos>: A vector of positions corresponding to the values in <vec_val>.
    <vec_val>: A vector of values to search for the max OR a fully qualified name.
    Note: When null is specified as the first input of the function, it returns the position of the max from the <vec_val> array.

    Return Value

    <max_pos>: The position (as defined in <vec_pos>) of the max.
    <max_val>: The maximum value of the input <vec_val>.

    Usage


    positions1=select(awesomebase.grades.midterm1,null,90) -- select students with midterm less than 90
    values1=fetch(awesomebase.grades.project,positions1)
    -- used here
    max1=max(values1) -- get the maximum project grade for students with midterm less than 90
    

    SQL Example

    SELECT MAX(project) FROM grades WHERE midterm1 < 90 

    Sum aggregate

    <scl_val>=sum(<vec_val>)

    This is the aggregation function sum. It returns the sum of the values in <vec_val>.

    Parameters

    <vec_val>: A vector of values.

    Return Value

    <scl_val>: The scalar value of the sum.

    Usage


    positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100
    values1=fetch(awesomebase.grades.quizzes,positions1)
    -- used here
    sum_quizzes=sum(values1) -- get the sum of the quizzes grade for students with project more than or equal to 100
    

    SQL Example

    SELECT SUM(quizzes) FROM grades WHERE project>=100

    Average aggregate

    <scl_val>=avg(<vec_val>)

    This is the aggregation function average. It returns the arithmetic mean of the values in <vec_val>.

    Parameters

    <vec_val>: A vector of values.

    Return Value

    <scl_val>: The scalar value of the average. For the average operator, in staff automated grading we expect your system to provide 2 places of decimal precision (e.g. 0.00).

    Usage


    positions1=select(awesomebase.grades.project,100,null) -- select students with project more than or equal to 100
    values1=fetch(awesomebase.grades.quizzes,positions1)
    -- used here
    avg_quizzes=avg(values1) -- get the average quizzes grade for students with project more than or equal to 100
    

    SQL Example

    SELECT AVG(quizzes) FROM grades WHERE project>=100

    Adding two vectors

    <vec_val>=add(<vec_val1>,<vec_val2>)

    This function adds two vectors of values.

    Parameters

    <vec_val1>: The vector of values 1.
    <vec_val2>: The vector of values 2.

    Return Value

    <vec_val>: A vector of values equal to the component-wise addition of the two inputs.

    Usage


    midterms=add(awesomebase.grades.midterm1,awesomebase.grades.midterm2)
    

    SQL Example

    SELECT midterm1 + midterm2 FROM grades

    Subtracting two vectors

    <vec_val>=sub(<vec_val1>,<vec_val2>)

    This function subtracts two vectors of values.

    Parameters

    <vec_val1>: The vector of values 1.
    <vec_val2>: The vector of values 2.

    Return Value

    <vec_val>: A vector of values equal to the component-wise addition of the two inputs.

    Usage


    -- used here
    score=sub(awesomebase.grades.project,awesomebase.grades.penalty)
    

    SQL Example

    SELECT AVG(midterm2 - midterm1) FROM grades 

    Updating values

    This function updates values from a column at given positions with a given value.



    relational_update(<col_var>,<vec_pos>,[INT])

    Parameters

    <col_var>: A variable that indicates the column to update.
    <vec_pos>: A vector of positions.
    INT: The new value.

    Return Value

    None.

    Usage


    project_to_update=select(awesomebase.grades.project,0,100) -- ...it should obviously be over 100!
    -- used here
    relational_update(awesomebase.grades.project,project_to_update,113)

    SQL Example

    UPDATE grades SET midterm1 = 100 WHERE midterm2 = 100

    Printing results

    print(<vec_val1>,...)

    The print command prints one or more vector in a tabular format.

    Parameters

    <vec_val1>: One or more vectors of values to be combined and printed.

    Return Value

    None.

    Usage


    -- used here
    print(awesomebase.grades.project,awesomebase.grades.quizzes) -- print project grades and quiz grades
    --OR--
    pos_high_project=select(awesomebase.grades.project,80,null) -- select project more than or equal to 80
    val_project=fetch(awesomebase.grades.project,pos1) -- fetch project grades
    val_studentid=fetch(awesomebase.grades.student_id,pos1) -- fetch student id
    val_quizzes=fetch(awesomebase.grades.quizzes,pos1) -- fetch quizz grades
    print(val_studentid,val_project,val_quizzes) -- print student_id, project grades and quiz grades for projects more than or equal to 80
    

    Then, the result should be:

    1,107,93
    2,92,85
    3,110,95
    4,88,95

    SQL Example

    This instruction is used to print out the results of a query. As such, this command is used in every query in a database which returns values.

    Batching Commands

    Batching consists of two commands. The first command, batch_queries, tells the server to hold the execution of the subsequent requests. The second command, batch_execute, then tells the server to execute these queries.



    batch_queries()

    batch_execute()

    Return Value

    batch_queries: None.
    batch_execute: No explicit return value, but the server must work out with the client when it is done sending results of the batched queries.

    Usage

    batch_queries()
    a_plus=select(awesomebase.grades.project,90,100) -- Find the students (rows) with project grade between 90 and 100
    a=select(awesomebase.grades.project,80,90) -- Find the students (rows) with project grade between 80 and 90
    super_awesome_peeps=select(awesomebase.grades.project,95,105)
    ids_of_students_with_top_project=fetch(awesomebase.grades.student_id,a_plus) -- Find the student id of the a_plus students
    batch_execute() -- The three selects should run as a shared scan
    

    SQL Example

    There is no batching command in the SQL syntax. However, almost all commercial databases have a command to submit a batch of queries.

    Shutting down

    This command shuts down the server. Data relating to databases, tables, and columns should be persisted so that they are available again when the server is restarted. Intermediate results and the variable pool should not be persisted.

    shutdown

    Return Value

    None.

    Usage

    shutdown
    

    Here is a sample tests on insertion, select, fetch, sum, and print.

    create(db,"db1")
    create(tbl,"tbl1",db1,2)
    create(col,"col1",db1.tbl1)
    create(col,"col2",db1.tbl1)
    load("test.csv")
    relational_insert(db1tbl1,-1,-11)
    relational_insert(db1tbl1,-2,-22)
    relational_insert(db1tbl1,-3,-33)
    relational_insert(db1tbl1,-4,-44)
    relational_insert(db1tbl1,-5,-55)
    relational_insert(db1tbl1,-6,-66)
    relational_insert(db1tbl1,-7,-77)
    relational_insert(db1tbl1,-8,-88)
    relational_insert(db1tbl1,-9,-99)
    s1=select(db1.tbl1.col1,-45869,34131)
    f1=fetch(db1.tbl1.col2,s1)
    a1=sum(f1)
    print(a1)
    
    shutdown
    

    Todo

    [ ] Implement updates [ ] Implement Delets [ ] Implement Grace Join [ ] Improve design based on Inputs [ ] Implement SIMDS

    Credit

    Hileamlak Yitayew

    My professors (Stratos Idreos) TAs (Hao Jiang, Utku Sirin, Subarna Chatterjee, Sanket Purandare)