-
Notifications
You must be signed in to change notification settings - Fork 1
II. Code Generation
Given the scope of the project, the mapper and the reducer code that needs to be generated is quite simple and follows a fairly well defined structure. For this reason, we have a few "state variables" (in the form of lists, dictionaries, etc) that we use to encode all the information we need from the query to generate the accurate code. We divide all the required functionalities into blocks of code, and given a query, we generate code for these blocks independently, and then tie them together with minimal processing and this is greatly helped by the fact that the code generation is bottom-up in our approach (as opposed to a more top-down approach like using templates, or plugging in values into prewritten code).
The changes we can make to the mapper's functioning is mostly based on the where clauses.
We encode the where clauses (there could be multiple) and the way each of them relates to the other (and, or conjunctions). While we do not handle nested where clauses, the script can generate code for an arbitrarily long where clause, and its behaviour is the same as the behaviour of an equivalent clause in plain Python. We simply generate an if block, and make it so that the mapper outputs to stdin only when within the block. We split the data on "," since the expected input is .csv files. Datatypes of the columns also pose a problem, and as a workaround to this, we generate code to typecast any non string column's data (We get this information from the metastore). This also allows us to forbid certain operations on strings (be it in where clauses or aggregations).
One downside to our approach is that we send the entire input line to the reducer, even if we do not need all that data on the reducer side. This is a tradeoff we made for the sake of simplicity in the code generation algorithm, since sending only parts of the line would mean recalculation of indices and sanity checks to make sure we are sending all the data we actually need to the reducer.
The reducer also follows a well defined structure, and the only two things that change with the reducer are the aggregations and project statements.
Project statements are just output based on column indices (we use dictionaries to map column names to indices). Aggregations are slightly less simplistic since they need global variables initialized before we read from stdin, and they also need to be updated every time they see a new row of data. They also need to be type casted again. To achieve that, we divide the aggregation based code generation into three blocks - global variable initialization block, global variable update block and global variable print block. They operate independently of each other, so it's very simple to generate this code for an arbitrary number of observations. To avoid conflicts and ensure independence between aggregations, we make the variable name a function of the aggregation and the column its operating on.