Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
This module captures code specific metrics of input data. The implementation is borrowed from the work done in CodeParrot and StarCoder projects. In the current implementation, the module includes the following metrics & reports each metrics in individual column:
- line specific metrics include mean & max line length
- character and token ratio - uses the input tokenizer to tokenize the input data & measure the ratio between the characters and tokens
- identifies the high occurrence of the keywords "test " or "config" and tags them as config or test samples
- tags the samples as autogenerated if the sample contains keywords like
auto-generated
,autogenerated
orautomatically generated
- programming language specific identification, where:
- if the input sample is
python
programming language and sample has no reference to constructs like def, class, it is highlighted ashas_no_keywords
- if the input sample is
This module adds the following fields into the output file:
- line_mean
- line_max
- total_num_lines
- avg_longest_lines
- alphanum_frac
- char_token_ratio
- autogenerated
- config_or_test
- has_no_keywords
- has_few_assignments
- is_xml
- is_html
It uses a tokenizer to collect metrics specific to token ratio. It is designed to download the tokenizer from the Huggingface if the input tokenizer is not found in the local cache. By default, it uses codeparrot/codeparrot tokenizer.
The following command line arguments are available in addition to the options provided by the ray launcher and the python launcher.
- "--contents_column_name" - input a column name which contains data to process. The default column name:
contents
- "--language_column_name" - input a column name which contains programming language details. The default column name:
language
- "--tokenizer" - input a tokenizer to convert the data into tokens. The default tokenizer is
codeparrot/codeparrot
- "--hf_token" - input the Hugging Face auth token to download the tokenizer. This option is only required for the tokenizer's whose access is restricted in Hugging Face.
To run the samples, use the following make
targets
run-cli-sample
- runs src/code_quality_transform_python.py using command line argsrun-local-sample
- runs src/code_quality_local_python.py
These targets will activate the virtual environment and set up any configuration needed.
Use the -n
option of make
to see the detail of what is done to run the sample.
For example,
make run-cli-sample
...
Then
ls output
To see results of the transform.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.