Alternative schema specification API using variadic generics #82
nielsneerhoff
started this conversation in
Ideas
Replies: 2 comments 2 replies
-
Hi Niels!
Thank you! Happy to hear that! dataset = DataSet['fruit': str, 'sales': int, 'average_price': float] Could you elaborate on when you'd use this form? When I define my schema classes, I usually refer to them many times, for example: from pyspark.sql.types import IntegerType
from typedspark import Column, Schema, DataSet, transform_to_schema
class Person(Schema):
name: Column[StringType]
age: Column[IntegerType]
job_id: Column[IntegerType]
class Job(Schema):
id: Column[IntegerType]
base_salary: Column[IntegerType]
class PersonWithJob(Person, Job):
effective_salary: Column[IntegerType]
def foo(persons: DataSet[Person], jobs: DataSet[Job]) -> DataSet[PersonWithJob]:
return transform_to_schema(
persons.join(jobs, Person.job_id == Job.id),
PersonWithJob,
{
PersonWithJob.effective_salary: Job.base_salary + Person.age * 100,
}
) With the variadic generics, we can't do this, right? Is there a usecase I'm missing? |
Beta Was this translation helpful? Give feedback.
2 replies
-
Edited my post above for tone, my initial response was a bit dry :) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there!
Kudos! I have just started working with Spark in Python, and this library addresses an issue I've been struggling with from the beginning (especially coming from Java).
Python 3.11.4 was just released, and it includes some interesting new features, among which are variadic generics. In short, these could theoretically allow us to define a
DataSet
as:instead of:
(or equivalently, using PySpark types for the attribute type definitions).
What are your thoughts on adding support for this new type of schema specification to this library? Perhaps I can assist in implementing it.
There are several benefits of using the first schema definition approach:
dataset = DataSet['fruit': str, ...]
if one does not know all attributes of the schemaI'm quite new to open-source contributing, please feel free to give feedback on this post and the idea :)
Beta Was this translation helpful? Give feedback.
All reactions