This directory contains scripts that aide in accessing and manipulating the Project CodeNet dataset.
Each script has the usual -h
or --help
option to briefly explain its
purpose and the possible and expected command-line arguments with
their defaults.
project_codenet_submissions.sh
: offers to select all submissions for a
particular problem, language,
status and code sizes. Generates a list of source code file names, one
per line.
project_codenet_aggregate.sh
: uses project_codenet_submissions.sh
repeatedly to
obtain all submissions for
a set of problems, a set of languages, a set of statuses and code size
range and act upon each source file by a user defined action (by
default a symbolic link is created in a user selected output directory).
project_codenet.conf
: is a sample configuration file for project_codenet_aggregate.sh
. It
specifies the set of problems, languages, statuses and minimum and
maximum code sizes of interest and
defines the action to take upon each submission source file.
dataset_verify.sh
: checks the integrity of the dataset in both
directions: whether all
submissions mentioned in the metadata do indeed exist and reside in
the expected location of the file system, and conversely, whether all
source files are correctly covered by the metadata records.
post_fdupes.sh
: postprocesses a file generated by the fdupes
(or
jdupes
) utility program. Collects various statistics about the
duplicates, like how many file sets there are, whether sets are of the
same language, and whether there are duplicates in the Accepted submissions.
callgraph.sh
: explores the call graph of a C, C++, or Java source
file. By default starts from main
and creates a JSON-Graph of all
reachable functions.
callees.sh
: expects a C, C++, or Java source file as input and extracts all
function definitions reachable from a given start function name
(by default main
).
callgraph_aux.sh
: shared auxiliary functionality for the other
callgraph scripts. Uses srcml and xmlstarlet to explore the call graph
of a C, C++, or Java source file.