Onramp Minimal C (omC) is a tiny subset of C. It is implemented by Onramp's first stage compiler (cci/0-omc) and second stage preprocessor (cpp/1-omc).
It contains the minimum number of C features required to reasonably implement the second stage compiler (cci/1-opc) and other Onramp tools. It's designed to be simple enough for its compiler to be implemented in assembly, but powerful enough to be used to implement most of ANSI C (i.e. Onramp Practical C.) It is a critical step in the jump from assembly to a full implementation of C.
Its features are, in brief:
void
,int
andchar
types, and pointers to themif
,while
,continue
,break
andreturn
for control flowtypedef
type names, andextern
andstatic
global variable declarations- String, character, and integer constants
- All of the arithmetic and comparison operators in C
- Functions with up to four parameters
- Local variables with initializers and simple assignments (
=
) - Arbitrary nesting of blocks and expressions
- Mixed declarations and code
#include
,#ifdef
/#ifndef
, and object-like#define
/*C*/
and//C++
style comments
omC supports enough of C to define libc-compatible headers. This requires void
and void*
, extern
for global variables, and typedef
for FILE
and size_t
(it's actually harder to bootstrap these with #define
.) The support for typedef
and file inclusion in particular allows for the construction of large scale projects with idiomatic code organization in omC.
If you've ever been interested in writing a compiler or interpreter, omC is a great place to start. There is a full grammar specification below; a heavily documented reference implementation; a large battery of tests with scripts to run them; a set of C-compatible standard library headers; and some interesting programs such as an eight queens solver, not to mention many of the stages of Onramp itself. (You don't even need to write the preprocessor either because, once your compiler works, you could compile and use Onramp's preprocessor!)
omC is quite powerful, yet it is simple enough to be implemented in about a thousand lines of your favorite programming language (or a few thousand lines of assembly.) It is also easy to extend with additional features from C.
omC is analogous to the short-lived NB or "New B", an extension of B. NB essentially just added the int
and char
types to the typeless B. This paved the way for struct support and other features which eventually became K&R C. omC is inspired by the way C was originally bootstrapped by Dennis Ritchie more than five decades ago.
Unfortunately NB was lost. It went through several iterations, probably never having any concrete specification, and it existed so briefly that no version of it was preserved. This time, we are keeping the intermediate language around! We are implementing it in assembly and we will attempt to specify it here.
The main difference between NB and omC is, rather than making it an extension of B, omC is instead a proper subset of C. omC files can be compiled with any ordinary C99 or later compiler. This allows us to use modern tools to test and debug our code.
Compared to C, omC has:
- Only
void
,int
andchar
types and pointers to them.- No
struct
/union
/enum
; - No arrays;
- No
unsigned
,long
,short
; - No function pointers;
const
is ignored;- Limited
typedef
.
- No
- Only
if
,while
,continue
,break
andreturn
for control flow.- No
else
; - No
for
ordo
loops; - No labels or
goto
; - No
switch
/case
; - Braces are required for
if
andwhile
.
- No
- Limited operators: only simple arithmetic and logic.
- No compound assignment (
+=
,&=
, etc.) - No
++
/--
; - No logical
&&
or||
(bitwise&
and|
must be used instead); - No ternary conditional
?:
, comma operator, and more; sizeof
only with parenthesized types, not expressions.
- No compound assignment (
- No binary operator precedence. All binary expressions must be fully parenthesized!
- No variadic functions, and at most four function arguments.
- Only
#include
,#ifdef
/#ifndef
, and object-like#define
preprocessor directives.
Despite these limitations, omC is a proper subset of C. It can be compiled (and tested!) with any C99 or later compiler.
We also support a few C99 features that were not in C89, namely mixed declarations and code and C++-style comments.
The omC preprocessor supports only a subset of the preprocessing directives in standard C:
#include
#define
for non-recursive object-like macros#ifdef
,#ifndef
and#endif
conditionals#line
The above features are enough to support header files with include guards and to do platform-based conditionals and substitutions (to test and debug on Linux for example.)
The following macros must be predefined by the preprocessor to identify the supported language when compiling omC:
__onramp_cpp_omc__
, declaring the supported preprocessor dialect__onramp_cci_omc__
, declaring the supported compiler dialect
Additionally, the Onramp compiler driver defines the following macros:
__onramp__
, declaring the target platform__onramp_cpp__
, declaring this as a stage of the Onramp preprocessor__onramp_cci__
, declaring this as a stage of the Onramp compiler
Any unrecognized preprocessor directives are ignored in untaken conditional branches. You should therefore wrap any conventional preprocessor directives (like #if
) in #ifndef __onramp_cpp_omc__
. Most final-stage Onramp programs use this to support bootstrapping with the omC preprocessor and the opC compiler.
The omC preprocessor does not support function-like macros (i.e. #define
with arguments); #if
and friends; #else
; #undef
; and more. It also supports no built-in macros (like __FILE__
or __LINE__
), no modern features (like __has_builtin
) and no extensions (like __COUNTER__
or #include_next
.)
The omC preprocessor, like the compiler, only supports the ASCII-encoded POSIX portable character set, excluding null, alert and backspace. Any characters outside of this range cause an error.
The omC preprocessor does not support recursive macros. For example:
#define FOO BAR
#define BAR FOO
FOO
Unlike standard C (which preprocesses this to FOO
), in the omC preprocessor this is malformed. The omC preprocessor may diagnose it, or it may crash, go into an infinite loop or exhibit other undefined behaviour. This also means that you cannot define a macro to itself, which is otherwise legal C.
The #
character can only appear at the start of a line, after optional whitespace and comments. Since #
cannot appear anywhere else outside of a comment, the preprocessor is not required to detect and diagnose this. It may treat a #
anywhere outside of a comment as the start of a preprocessor directive. Such code is nevertheless malformed and this behaviour should not be relied upon.
The below is a description of the grammar as parsed by the Onramp's omC preprocessor (cpp/1) and compiler (cci/0).
The syntax is similar to the style of parsing expression grammars (PEG):
- bold is used for token categories
monospace
is used for literal tokens or regular expressions- italics is used for the names of parsing expressions (i.e. non-terminals)
- parentheses are for grouping
- '/' is used for alternatives (the "choice" operator in PEG)
- * means repetition
- ? means optional
omC supports C-style /*comments*/
and C++-style // line comments
. The preprocessor removes comments.
A preprocessor directive starts at the beginning of a line and runs until the end of the line. Whitespace and comments are allowed before and after all tokens (except in a file-specifier; see below.). The following directives are recognized:
include-directive: #
include
file-specifier
define-directive: #
define
alphanumeric token*
ifdef-directive: #
ifdef
alphanumeric
ifndef-directive: #
ifndef
alphanumeric
endif-directive: #
endif
line-directive: #
line
decimal file-specifier?
empty-directive: #
The include directive includes the content of the given file in place of the directive. A file specifier in #include
and #line
takes one of two forms:
file-specifier: ( <
filename >
) / ( "
filename "
)
The form with "
searches the current directory first; the form with <
does not. They otherwise search the same set of include paths and behave identically. Note that whitespace and comments are not allowed in and around the filename.
filename is a special kind of token that is only parsed in the context of an include directive. It is essentially any non-whitespace character until the closing "
or >
. (This can be difficult to parse with a lexer as it is context-sensitive. Onramp's omC preprocessor is scannerless so this is not an issue.)
The following tokens are recognized by the omC lexer. They are given here as regular expressions.
alphanumeric: [A-Za-z_][0-9A-Za-z_]*
number: decimal / hexadecimal
decimal: [1-9][0-9]*
hexadecimal: 0[xX][0-9A-Fa-f]*
string: "[^"]*"
character: '[^']'
punctuation: <<
/ >>
/ ==
/ !=
/ <=
/ >=
/ (any other non-whitespace character)
An alphanumeric token is either a keyword, an identifier or a type name. (Onramp does not use a "lexer hack"; it leaves it up to the parser to determine what an alphanumeric token is.)
Note that numbers are strictly decimal and hexadecimal integers. There is no support for numerical suffixes, floating point, or octal or binary representations.
file: global*
global: specifier type alphanumeric ( ;
/ function)
specifier: typedef
/ static
/ extern
A file is composed of a sequence of globals. Each global is either a typedef, a variable or a function.
At most one specifier can be provided. If a specifier is provided, the global cannot be a function.
typedef
makes the name an alias for the type.
extern
declares a variable without emitting a definition. static
declares a variable and emits a definition with internal linkage. If no specifier for a variable is provided, it is declared and emitted with external linkage. (There is no initializer for global variables; they are initialized to zero.)
function: (
( void
/ parameter-list ) )
( ;
/ compound-statement )
parameter-list: parameter ( ,
parameter )*
parameter: type alphanumeric?
A parameter list cannot be empty. If a function takes no parameters, it must be declared (void)
. Functions can have at most four parameters.
Parameter names are optional. If a parameter name is omitted, the parameter's value is inaccessible.
If a function's parameter list is followed by a semicolon, it is a function declaration. Otherwise, it is a function definition, and the compound statement is the body of the function.
type: const
? base const
? ( *
const
? )*
base: void
/ int
/ char
A type consists of a base type followed by 0 to 15 pointer declarators. const
is accepted and ignored anywhere before, after and in between.
Note that types are always parsed separately from identifiers. There is no support for any postfix or parenthesized declarators.
Types are simple enough to be represented in just six bits. cci/0 uses an additional bit as an l-value flag; see cci/0 for details.
statement: ;
/ compound-statement / control-statement / selection-statement / expression-statement
compound-statement: {
( local-declaration / statement )* }
control-statement: ( if
/ while
) (
expression )
compound-statement
selection-statement: ( break
/ continue
/ ( return
expression? ) ) ;
expression-statement: expression ;
Statements are fairly self-explanatory. Compound statements create a scope for variables, and can have mixed declarations and statements. The value of an expression statement is ignored. break
and continue
statements jump to the end or to the condition check of the nearest containing while
statement respectively.
The return
statement exits the function returning the expression's value if provided. If the containing function returns void
, a return
statement must omit the expression; otherwise, it must provide an expression of the same type as the function return type.
All functions with a non-void
return type must return via a return
statement, except for main()
which returns 0 if control reaches the end of the function.
local-declaration: type alphanumeric ( =
expression )? ;
A local declaration declares and defines a variable in the current compound statement. The variable ceases to exist at the end of the containing compound statement. While it exists, the variable shadows any previously defined variables or parameters with the same name.
An optional initializing expression can be provided. If it is, it must have the same type as the variable.
expression: binary-expression
binary-expression: unary-expression ( binary-operator unary-expression )?
unary-expression: ( unary-operator unary-expression ) / primary-expression
primary-expression: alphanumeric / number / string / character / sizeof-expression / parenthesized-expression
sizeof-epxression: sizeof
(
type )
parenthesized-expression: (
expression )
A string expression has type char*
. A character expression has type int
. An alphanumeric expression has the type of and evaluates to the value of the most recently defined extant variable with a matching name.
The sizeof
expression evaluates to the size of the given type. Note that it cannot take an expression as argument.
Note that multiple binary operators cannot be specified in one expression. They must be parenthesized.
unary-operator: cast-operator / +
/ -
/ !
/ ~
/ *
/ &
cast-operator: (
type )
These operators have the same meaning as they do in C.
binary-operator: assignment-operator / comparison-operator / arithmetic-operator
assignment-operator: =
comparison-operator: ==
/ !=
/ <
/ >
/ <=
/ >=
arithmetic-operator: +
/ -
/ *
/ /
/ ^
/ <<
/ >>
/ &
/ |
/ ^
These operators have the same meaning as they do in C. The comparison operators evaluate to 0
or 1
.
Note that there is no binary operator precedence. All nested binary expressions must be fully parenthesized.
The +
operator, when adding a value to a pointer, multiplies the value by the size of the pointed-to type. The -
operator, when taking the difference of two pointers, divides the result by the size of the pointed-to type.