Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision tree C code exported by porter has wrong datatype for features array it should be float #43

Open
vijaykilledar opened this issue Jan 7, 2019 · 4 comments
Labels
1.0.0 Issue covered in release 1.0.0

Comments

@vijaykilledar
Copy link

vijaykilledar commented Jan 7, 2019

C code exported by porter has wrong data type for feature value as double which will cause accuracy percentage.

scikit-learn code

def predict(self, X, check_input=True):

         """Predict class or regression value for X.
        For a classification model, the predicted class for each sample in X is
        returned. For a regression model, the predicted value based on X is
        returned.
        Parameters
        ----------
        X : array-like or sparse matrix of shape = [n_samples, n_features]
            The input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csr_matrix``.
        check_input : boolean, (default=True)
            Allow to bypass several input checking.
            Don't use this parameter unless you know what you do.
        Returns
        -------
        y : array of shape = [n_samples] or [n_samples, n_outputs]
            The predicted classes, or the predict values.
        """

porter C Code:

int main(int argc, const char * argv[]) {{
    /* Features: */
    double features[argc-1];
    int i;
    for (i = 1; i < argc; i++) {{
        features[i-1] = atof(argv[i]);
    }}

    /* Prediction: */
    printf("%d", {method_name}(features, 0));
    return 0;

}}
@vijaykilledar vijaykilledar changed the title C code exported by porter has wrong datatype for features array it should be float Decision tree C code exported by porter has wrong datatype for features array it should be float Jan 7, 2019
@nok
Copy link
Owner

nok commented Jan 7, 2019

Can you please provide some data and code for comparison?

(There is a bigger difference between the internal and textual representation of values in Python I guess.)

@vijaykilledar
Copy link
Author

ok I will provide detail example/data tomorrow.

@vijaykilledar
Copy link
Author

attaching zip file contains

  1. C program trained for 10000 records with accepting feature float data type
  2. C program trained for 10000 records with accepting feature double data type
  3. Shell script used to calculate the matched records of target binary of above programs
  4. Test data set file
  5. Expected prediction data file
    porter_attachments.zip
  6. csv file used for training (First column as Target class, and rest of the column as test data set)
    train_10000.zip

test script output at my end

./test_prediction.sh ./train_10000 ./train_10000_target ./porter_train_10000_double 
test data file - test_data/train_10000
expected prediction data file - test_data/train_10000_target
testing output binray by feeding training data .......
Total records - 10000
Matched prediction records - 9878

./test_prediction.sh ./train_10000 ./train_10000_target ./porter_train_10000 _float
test data file - test_data/train_10000
expected prediction data file - test_data/train_10000_target
testing output binray by feeding training data .......
Total records - 10000
Matched prediction records - 9992

@nok
Copy link
Owner

nok commented Jan 19, 2019

Okay, thanks. Can you please validate the data type of your training data?

print(type(X[0]))  # <type 'numpy.float32'> or <type 'numpy.float64'>

For load_digits it's numpy.float64 which is double in C. The integrity check finished without mismatches. So I changed the data to floats with X.astype(np.float32) and finished the integrity check again without errors.

Nevertheless it depends on the data. In general I see the problem of point precisions between data types and programming languages. It could make sense to add a possibility to change the features data type in transpiled output by using a new argument temp_dtype='float'.

Further atof() converts a string to double in C. On the other hand if you want to use floats, you should use strtof() to convert strings to float.

Can you test it?

@nok nok added the 1.0.0 Issue covered in release 1.0.0 label Aug 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.0 Issue covered in release 1.0.0
Projects
None yet
Development

No branches or pull requests

2 participants