Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exportModel encounters NullPointerException #817

Open
knguyen1 opened this issue Apr 12, 2024 · 7 comments
Open

exportModel encounters NullPointerException #817

knguyen1 opened this issue Apr 12, 2024 · 7 comments
Assignees
Milestone

Comments

@knguyen1
Copy link

Describe the bug
Cannot generate csv of model because of NullPointerException. Phase generateDocs works just fine. From documentation: https://docs.zingg.ai/zingg/stepbystep/createtrainingdata/exportlabeleddata

To Reproduce
Steps to reproduce the behavior:
Run: (.venv) spark@496208741a60:/workspaces/foo-zingg-entity-resolution $ ~/zingg-0.4.0/scripts/zingg.sh --phase exportModel --conf /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json --location tmp --properties-file /workspaces/foo-zingg-entity-resolution/zingg.conf

Expected behavior
Should be able to export a csv of the model.

Screenshots

24/04/12 15:35:03 INFO ClientOptions: --phase
24/04/12 15:35:03 INFO ClientOptions: exportModel
24/04/12 15:35:03 INFO ClientOptions: --conf
24/04/12 15:35:03 INFO ClientOptions: /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json
24/04/12 15:35:03 INFO ClientOptions: --location
24/04/12 15:35:03 INFO ClientOptions: tmp
24/04/12 15:35:03 INFO ClientOptions: --email
24/04/12 15:35:03 INFO ClientOptions: [email protected]
24/04/12 15:35:03 INFO ClientOptions: --license
24/04/12 15:35:03 INFO ClientOptions: zinggLicense.txt
24/04/12 15:35:03 WARN ArgumentsUtil: Config Argument is /workspaces/foo-zingg-entity-resolution/datasets/trader/conf_no_bdid.json
24/04/12 15:35:03 WARN ArgumentsUtil: phase is exportModel
24/04/12 15:35:03 INFO Client: 
24/04/12 15:35:03 INFO Client: **************************************************************************
24/04/12 15:35:03 INFO Client: *            ** Note about analytics collection by Zingg AI **           *
24/04/12 15:35:03 INFO Client: *                                                                        *
24/04/12 15:35:03 INFO Client: *  Please note that Zingg captures a few metrics about application's     *
24/04/12 15:35:03 INFO Client: *  runtime parameters. However, no user's personal data or application   *
24/04/12 15:35:03 INFO Client: *  data is captured. If you want to switch off this feature, please      *
24/04/12 15:35:03 INFO Client: *  set the flag collectMetrics to false in config. For details, please   *
24/04/12 15:35:03 INFO Client: *  refer to the Zingg docs (https://docs.zingg.ai/docs/security.html)    *
24/04/12 15:35:03 INFO Client: **************************************************************************
24/04/12 15:35:03 INFO Client: 
java.lang.NullPointerException
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Unknown Source)
        at zingg.spark.core.executor.SparkZFactory.get(SparkZFactory.java:40)
        at zingg.common.client.Client.setZingg(Client.java:68)
        at zingg.common.client.Client.<init>(Client.java:46)
        at zingg.spark.client.SparkClient.<init>(SparkClient.java:29)
        at zingg.spark.client.SparkClient.getClient(SparkClient.java:68)
        at zingg.common.client.Client.mainMethod(Client.java:185)
        at zingg.spark.client.SparkClient.main(SparkClient.java:76)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Desktop (please complete the following information):

  • OS:
(.venv) spark@496208741a60:/workspaces/foo-zingg-entity-resolution $ cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):
N/A

Additional context

{
    "fieldDefinition": [
        {
            "fieldName": "data_source",
            "fields": "data_source",
            "dataType": "string",
            "matchType": "DONT_USE"
        },
        // other fields...
    ],
    "output": [
        {
            "name": "output",
            "format": "csv",
            "props": {
                "location": "/tmp/zinggOutput",
                "delimiter": ",",
                "header": true
            }
        }
    ],
    "data": [{
        "name": "salesforce",
        "format": "jdbc",
        "props": {
            "url": "jdbc:redshift://my-redshift-server:5439/my-redshift-db",
            "dbtable": "my_schema.my_table",
            "driver": "com.amazon.redshift.jdbc42.Driver",
            "user": "test",
            "password": "password123"
        }
    }],
    "labelDataSampleSize" : 0.15,
    "numPartitions": 50,
    "modelId": 101,
    "zinggDir": "/workspaces/foo-zingg-entity-resolution/models"
}

@sonalgoyal sonalgoyal self-assigned this Apr 14, 2024
@sonalgoyal
Copy link
Member

thanks for reporting this. if you are struck, you can try reading the model folder at zinggDir/modelId/trainingData/marked using pyspark. this location will have your labeled data in parquet format

@vikasgupta78
Copy link
Contributor

Will be handled along side SparkConnect change, putting on hold for now

@havardox
Copy link

havardox commented Aug 18, 2024

For anyone who just wants to get their training data:

MODEL_PATH: str = "{your model folder}/{your model ID}"
OUTPUT_PATH: str = "output.csv"

from pathlib import Path
from pyspark.sql import SparkSession

context: SparkSession = SparkSession.builder.getOrCreate()

context.sparkContext.getConf().getAll()

df = context.read.parquet(str((Path(MODEL_PATH) / "trainingData/marked").absolute()))
print(df.toPandas())

# Save to CSV
df.toPandas().to_csv(Path(OUTPUT_PATH), header=True, index=False)

@iqoOopi
Copy link

iqoOopi commented Sep 15, 2024

same null pointer error on zingg:0.4.0 from docker img

@iqoOopi
Copy link

iqoOopi commented Sep 15, 2024

For anyone who just wants to get their training data:

MODEL_PATH: str = "{your model folder}/{your model ID}"
OUTPUT_PATH: str = "output.csv"

from pathlib import Path
from pyspark.sql import SparkSession

context: SparkSession = SparkSession.builder.getOrCreate()

context.sparkContext.getConf().getAll()

df = context.read.parquet(str((Path(MODEL_PATH) / "trainingData/marked").absolute()))
print(df.toPandas())

# Save to CSV
df.toPandas().to_csv(Path(OUTPUT_PATH), header=True, index=False)

Thanks havardox.

I'm running zingg from docker and new to spark. Wondering how can I export the model from docker?

@sonalgoyal
Copy link
Member

Can you try running pyspark in the docker and the commands shared above by @havardox

@Nitish1814
Copy link
Contributor

@sonalgoyal sonalgoyal added this to the 0.5.0 milestone Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

6 participants