diff --git a/docs/_build/overrides/404.html b/docs/_build/overrides/404.html index ee9b8faa2aba..986222099a22 100644 --- a/docs/_build/overrides/404.html +++ b/docs/_build/overrides/404.html @@ -1,4 +1,4 @@ -{% extends "main.html" %} +{% extends "base.html" %} {% block content %}
404 - You're lost. How you got here is a mystery. But you can click the button below to go back to the homepage or use the search bar in the navigation menu to find what you are looking for.

- Home + Home
{% endblock %} diff --git a/docs/api/index.md b/docs/api/index.md index 004799cae1b4..485b59923ad1 100644 --- a/docs/api/index.md +++ b/docs/api/index.md @@ -11,7 +11,7 @@ It's the best place to look if you need information on a specific function. ## Python The Python API reference is built using Sphinx. -It's available on [GitHub Pages](https://docs.pola.rs/py-polars/html/reference/index.html). +It's available in [our docs](https://docs.pola.rs/py-polars/html/reference/index.html). ## Rust diff --git a/docs/index.md b/docs/index.md index 2c72f776edbb..16ec4a31e4ae 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,19 +1,12 @@ ---- -hide: - - navigation ---- - -# Polars - ![logo](https://raw.githubusercontent.com/pola-rs/polars-static/master/logos/polars_github_logo_rect_dark_name.svg)

Blazingly Fast DataFrame Library

- rust docs + Rust docs latest - + Rust crates Latest Release PyPI Latest Release @@ -23,26 +16,42 @@ hide:
-Polars is a highly performant DataFrame library for manipulating structured data. The core is written in Rust, but the library is also available in Python. Its key features are: +Polars is a blazingly fast DataFrame library for manipulating structured data. The core is written in Rust, and available for Python, R and NodeJS. -- **Fast**: Polars is written from the ground up, designed close to the machine and without external dependencies. +## Key features + +- **Fast**: Written from scratch in Rust, designed close to the machine and without external dependencies. - **I/O**: First class support for all common data storage layers: local, cloud storage & databases. -- **Easy to use**: Write your queries the way they were intended. Polars, internally, will determine the most efficient way to execute using its query optimizer. -- **Out of Core**: Polars supports out of core data transformation with its streaming API. Allowing you to process your results without requiring all your data to be in memory at the same time -- **Parallel**: Polars fully utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration. -- **Vectorized Query Engine**: Polars uses [Apache Arrow](https://arrow.apache.org/), a columnar data format, to process your queries in a vectorized manner. It uses [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) to optimize CPU usage. +- **Intuitive API**: Write your queries the way they were intended. Polars, internally, will determine the most efficient way to execute using its query optimizer. +- **Out of Core**: The streaming API allows you to process your results without requiring all your data to be in memory at the same time +- **Parallel**: Utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration. +- **Vectorized Query Engine**: Using [Apache Arrow](https://arrow.apache.org/), a columnar data format, to process your queries in a vectorized manner and SIMD to optimize CPU usage. + + -## Performance :rocket: :rocket: +!!! info "Users new to DataFrames" + A DataFrame is a 2-dimensional data structure that is useful for data manipulation and analysis. With labeled axes for rows and columns, each column can contain different data types, making complex data operations such as merging and aggregation much easier. Due to their flexibility and intuitive way of storing and working with data, DataFrames have become increasingly popular in modern data analytics and engineering. -Polars is very fast, and in fact is one of the best performing solutions available. -See the results in h2oai's [db-benchmark](https://duckdblabs.github.io/db-benchmark/), revived by the DuckDB project. + -Polars [TPC-H Benchmark results](https://www.pola.rs/benchmarks.html) are now available on the official website. +## Philosophy + +The goal of Polars is to provide a lightning fast DataFrame library that: + +- Utilizes all available cores on your machine. +- Optimizes queries to reduce unneeded work/memory allocations. +- Handles datasets much larger than your available RAM. +- A consistent and predictable API. +- Adheres to a strict schema (data-types should be known before running the query). + +Polars is written in Rust which gives it C/C++ performance and allows it to fully control performance critical parts in a query engine. ## Example {{code_block('home/example','example',['scan_csv','filter','group_by','collect'])}} +A more extensive introduction can be found in the [next chapter](user-guide/getting-started.md). + ## Community Polars has a very active community with frequent releases (approximately weekly). Below are some of the top contributors to the project: diff --git a/docs/requirements.txt b/docs/requirements.txt index e0416d67440b..d0a5a5d8193f 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -5,6 +5,7 @@ matplotlib mkdocs-material==9.5.2 mkdocs-macros-plugin==1.0.4 +mkdocs-redirects==1.2.1 material-plausible-plugin==0.2.0 markdown-exec[ansi]==1.7.0 PyGithub==2.1.1 diff --git a/docs/src/python/user-guide/basics/expressions.py b/docs/src/python/user-guide/basics/expressions.py index 041b023f27c4..12c6ea2170ec 100644 --- a/docs/src/python/user-guide/basics/expressions.py +++ b/docs/src/python/user-guide/basics/expressions.py @@ -6,19 +6,16 @@ df = pl.DataFrame( { - "a": range(8), - "b": np.random.rand(8), + "a": range(5), + "b": np.random.rand(5), "c": [ - datetime(2022, 12, 1), - datetime(2022, 12, 2), - datetime(2022, 12, 3), - datetime(2022, 12, 4), - datetime(2022, 12, 5), - datetime(2022, 12, 6), - datetime(2022, 12, 7), - datetime(2022, 12, 8), + datetime(2025, 12, 1), + datetime(2025, 12, 2), + datetime(2025, 12, 3), + datetime(2025, 12, 4), + datetime(2025, 12, 5), ], - "d": [1, 2.0, float("nan"), float("nan"), 0, -5, -42, None], + "d": [1, 2.0, float("nan"), -42, None], } ) # --8<-- [end:setup] @@ -36,12 +33,12 @@ # --8<-- [end:select3] # --8<-- [start:exclude] -df.select(pl.exclude("a")) +df.select(pl.exclude(["a", "c"])) # --8<-- [end:exclude] # --8<-- [start:filter] df.filter( - pl.col("c").is_between(datetime(2022, 12, 2), datetime(2022, 12, 8)), + pl.col("c").is_between(datetime(2025, 12, 2), datetime(2025, 12, 3)), ) # --8<-- [end:filter] diff --git a/docs/src/python/user-guide/basics/reading-writing.py b/docs/src/python/user-guide/basics/reading-writing.py index dc8a54ebd18f..68c0ab235fd1 100644 --- a/docs/src/python/user-guide/basics/reading-writing.py +++ b/docs/src/python/user-guide/basics/reading-writing.py @@ -6,11 +6,12 @@ { "integer": [1, 2, 3], "date": [ - datetime(2022, 1, 1), - datetime(2022, 1, 2), - datetime(2022, 1, 3), + datetime(2025, 1, 1), + datetime(2025, 1, 2), + datetime(2025, 1, 3), ], "float": [4.0, 5.0, 6.0], + "string": ["a", "b", "c"], } ) diff --git a/docs/src/rust/user-guide/basics/expressions.rs b/docs/src/rust/user-guide/basics/expressions.rs index ea6cae3c84af..757c52e3939f 100644 --- a/docs/src/rust/user-guide/basics/expressions.rs +++ b/docs/src/rust/user-guide/basics/expressions.rs @@ -6,19 +6,16 @@ fn main() -> Result<(), Box> { let mut rng = rand::thread_rng(); let df: DataFrame = df!( - "a" => 0..8, - "b"=> (0..8).map(|_| rng.gen::()).collect::>(), + "a" => 0..5, + "b"=> (0..5).map(|_| rng.gen::()).collect::>(), "c"=> [ - NaiveDate::from_ymd_opt(2022, 12, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 12, 2).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 12, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 12, 4).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 12, 5).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 12, 6).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 12, 7).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 12, 8).unwrap().and_hms_opt(0, 0, 0).unwrap(), + NaiveDate::from_ymd_opt(2025, 12, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), + NaiveDate::from_ymd_opt(2025, 12, 2).unwrap().and_hms_opt(0, 0, 0).unwrap(), + NaiveDate::from_ymd_opt(2025, 12, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(), + NaiveDate::from_ymd_opt(2025, 12, 4).unwrap().and_hms_opt(0, 0, 0).unwrap(), + NaiveDate::from_ymd_opt(2025, 12, 5).unwrap().and_hms_opt(0, 0, 0).unwrap(), ], - "d"=> [Some(1.0), Some(2.0), None, None, Some(0.0), Some(-5.0), Some(-42.), None] + "d"=> [Some(1.0), Some(2.0), None, Some(-42.), None] ) .unwrap(); @@ -46,17 +43,17 @@ fn main() -> Result<(), Box> { let out = df .clone() .lazy() - .select([col("*").exclude(["a"])]) + .select([col("*").exclude(["a", "c"])]) .collect()?; println!("{}", out); // --8<-- [end:exclude] // --8<-- [start:filter] - let start_date = NaiveDate::from_ymd_opt(2022, 12, 2) + let start_date = NaiveDate::from_ymd_opt(2025, 12, 2) .unwrap() .and_hms_opt(0, 0, 0) .unwrap(); - let end_date = NaiveDate::from_ymd_opt(2022, 12, 8) + let end_date = NaiveDate::from_ymd_opt(2025, 12, 3) .unwrap() .and_hms_opt(0, 0, 0) .unwrap(); diff --git a/docs/src/rust/user-guide/basics/reading-writing.rs b/docs/src/rust/user-guide/basics/reading-writing.rs index 44c1a335428d..dad5e8713d24 100644 --- a/docs/src/rust/user-guide/basics/reading-writing.rs +++ b/docs/src/rust/user-guide/basics/reading-writing.rs @@ -9,9 +9,9 @@ fn main() -> Result<(), Box> { let mut df: DataFrame = df!( "integer" => &[1, 2, 3], "date" => &[ - NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 1, 2).unwrap().and_hms_opt(0, 0, 0).unwrap(), - NaiveDate::from_ymd_opt(2022, 1, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(), + NaiveDate::from_ymd_opt(2025, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), + NaiveDate::from_ymd_opt(2025, 1, 2).unwrap().and_hms_opt(0, 0, 0).unwrap(), + NaiveDate::from_ymd_opt(2025, 1, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(), ], "float" => &[4.0, 5.0, 6.0] ) diff --git a/docs/user-guide/basics/expressions.md b/docs/user-guide/basics/expressions.md deleted file mode 100644 index 0277d3da72f6..000000000000 --- a/docs/user-guide/basics/expressions.md +++ /dev/null @@ -1,130 +0,0 @@ -# Expressions - -`Expressions` are the core strength of Polars. The `expressions` offer a versatile structure that both solves easy queries and is easily extended to complex ones. Below we will cover the basic components that serve as building block (or in Polars terminology contexts) for all your queries: - -- `select` -- `filter` -- `with_columns` -- `group_by` - -To learn more about expressions and the context in which they operate, see the User Guide sections: [Contexts](../concepts/contexts.md) and [Expressions](../concepts/expressions.md). - -### Select statement - -To select a column we need to do two things. Define the `DataFrame` we want the data from. And second, select the data that we need. In the example below you see that we select `col('*')`. The asterisk stands for all columns. - -{{code_block('user-guide/basics/expressions','select',['select'])}} - -```python exec="on" result="text" session="getting-started/expressions" ---8<-- "python/user-guide/basics/expressions.py:setup" -print( - --8<-- "python/user-guide/basics/expressions.py:select" -) -``` - -You can also specify the specific columns that you want to return. There are two ways to do this. The first option is to pass the column names, as seen below. - -{{code_block('user-guide/basics/expressions','select2',['select'])}} - -```python exec="on" result="text" session="getting-started/expressions" -print( - --8<-- "python/user-guide/basics/expressions.py:select2" -) -``` - -The second option is to specify each column using `pl.col`. This option is shown below. - -{{code_block('user-guide/basics/expressions','select3',['select'])}} - -```python exec="on" result="text" session="getting-started/expressions" -print( - --8<-- "python/user-guide/basics/expressions.py:select3" -) -``` - -If you want to exclude an entire column from your view, you can simply use `exclude` in your `select` statement. - -{{code_block('user-guide/basics/expressions','exclude',['select'])}} - -```python exec="on" result="text" session="getting-started/expressions" -print( - --8<-- "python/user-guide/basics/expressions.py:exclude" -) -``` - -### Filter - -The `filter` option allows us to create a subset of the `DataFrame`. We use the same `DataFrame` as earlier and we filter between two specified dates. - -{{code_block('user-guide/basics/expressions','filter',['filter'])}} - -```python exec="on" result="text" session="getting-started/expressions" -print( - --8<-- "python/user-guide/basics/expressions.py:filter" -) -``` - -With `filter` you can also create more complex filters that include multiple columns. - -{{code_block('user-guide/basics/expressions','filter2',['filter'])}} - -```python exec="on" result="text" session="getting-started/expressions" -print( - --8<-- "python/user-guide/basics/expressions.py:filter2" -) -``` - -### With_columns - -`with_columns` allows you to create new columns for your analyses. We create two new columns `e` and `b+42`. First we sum all values from column `b` and store the results in column `e`. After that we add `42` to the values of `b`. Creating a new column `b+42` to store these results. - -{{code_block('user-guide/basics/expressions','with_columns',['with_columns'])}} - -```python exec="on" result="text" session="getting-started/expressions" -print( - --8<-- "python/user-guide/basics/expressions.py:with_columns" -) -``` - -### Group by - -We will create a new `DataFrame` for the Group by functionality. This new `DataFrame` will include several 'groups' that we want to group by. - -{{code_block('user-guide/basics/expressions','dataframe2',['DataFrame'])}} - -```python exec="on" result="text" session="getting-started/expressions" ---8<-- "python/user-guide/basics/expressions.py:dataframe2" -print(df2) -``` - -{{code_block('user-guide/basics/expressions','group_by',['group_by'])}} - -```python exec="on" result="text" session="getting-started/expressions" -print( - --8<-- "python/user-guide/basics/expressions.py:group_by" -) -``` - -{{code_block('user-guide/basics/expressions','group_by2',['group_by'])}} - -```python exec="on" result="text" session="getting-started/expressions" -print( - --8<-- "python/user-guide/basics/expressions.py:group_by2" -) -``` - -### Combining operations - -Below are some examples on how to combine operations to create the `DataFrame` you require. - -{{code_block('user-guide/basics/expressions','combine',['select','with_columns'])}} - -```python exec="on" result="text" session="getting-started/expressions" ---8<-- "python/user-guide/basics/expressions.py:combine" -``` - -{{code_block('user-guide/basics/expressions','combine2',['select','with_columns'])}} - -```python exec="on" result="text" session="getting-started/expressions" ---8<-- "python/user-guide/basics/expressions.py:combine2" -``` diff --git a/docs/user-guide/basics/index.md b/docs/user-guide/basics/index.md deleted file mode 100644 index af73c7967574..000000000000 --- a/docs/user-guide/basics/index.md +++ /dev/null @@ -1,18 +0,0 @@ -# Introduction - -This chapter is intended for new Polars users. -The goal is to provide a quick overview of the most common functionality. -Feel free to skip ahead to the [next chapter](../concepts/data-types/overview.md) to dive into the details. - -!!! rust "Rust Users Only" - - Due to historical reasons, the eager API in Rust is outdated. In the future, we would like to redesign it as a small wrapper around the lazy API (as is the design in Python / NodeJS). In the examples, we will use the lazy API instead with `.lazy()` and `.collect()`. For now you can ignore these two functions. If you want to know more about the lazy and eager API, go [here](../concepts/lazy-vs-eager.md). - - To enable the Lazy API ensure you have the feature flag `lazy` configured when installing Polars - ``` - # Cargo.toml - [dependencies] - polars = { version = "x", features = ["lazy", ...]} - ``` - - Because of the ownership ruling in Rust, we can not reuse the same `DataFrame` multiple times in the examples. For simplicity reasons we call `clone()` to overcome this issue. Note that this does not duplicate the data but just increments a pointer (`Arc`). diff --git a/docs/user-guide/basics/joins.md b/docs/user-guide/basics/joins.md deleted file mode 100644 index 21cb927164a9..000000000000 --- a/docs/user-guide/basics/joins.md +++ /dev/null @@ -1,26 +0,0 @@ -# Combining DataFrames - -There are two ways `DataFrame`s can be combined depending on the use case: join and concat. - -## Join - -Polars supports all types of join (e.g. left, right, inner, outer). Let's have a closer look on how to `join` two `DataFrames` into a single `DataFrame`. Our two `DataFrames` both have an 'id'-like column: `a` and `x`. We can use those columns to `join` the `DataFrames` in this example. - -{{code_block('user-guide/basics/joins','join',['join'])}} - -```python exec="on" result="text" session="getting-started/joins" ---8<-- "python/user-guide/basics/joins.py:setup" ---8<-- "python/user-guide/basics/joins.py:join" -``` - -To see more examples with other types of joins, go the [User Guide](../transformations/joins.md). - -## Concat - -We can also `concatenate` two `DataFrames`. Vertical concatenation will make the `DataFrame` longer. Horizontal concatenation will make the `DataFrame` wider. Below you can see the result of an horizontal concatenation of our two `DataFrames`. - -{{code_block('user-guide/basics/joins','hstack',['hstack'])}} - -```python exec="on" result="text" session="getting-started/joins" ---8<-- "python/user-guide/basics/joins.py:hstack" -``` diff --git a/docs/user-guide/basics/reading-writing.md b/docs/user-guide/basics/reading-writing.md deleted file mode 100644 index 8999f601e823..000000000000 --- a/docs/user-guide/basics/reading-writing.md +++ /dev/null @@ -1,45 +0,0 @@ -# Reading & writing - -Polars supports reading and writing to all common files (e.g. csv, json, parquet), cloud storage (S3, Azure Blob, BigQuery) and databases (e.g. postgres, mysql). In the following examples we will show how to operate on most common file formats. For the following dataframe - -{{code_block('user-guide/basics/reading-writing','dataframe',['DataFrame'])}} - -```python exec="on" result="text" session="getting-started/reading" ---8<-- "python/user-guide/basics/reading-writing.py:dataframe" -``` - -#### CSV - -Polars has its own fast implementation for csv reading with many flexible configuration options. - -{{code_block('user-guide/basics/reading-writing','csv',['read_csv','write_csv'])}} - -```python exec="on" result="text" session="getting-started/reading" ---8<-- "python/user-guide/basics/reading-writing.py:csv" -``` - -As we can see above, Polars made the datetimes a `string`. We can tell Polars to parse dates, when reading the csv, to ensure the date becomes a datetime. The example can be found below: - -{{code_block('user-guide/basics/reading-writing','csv2',['read_csv'])}} - -```python exec="on" result="text" session="getting-started/reading" ---8<-- "python/user-guide/basics/reading-writing.py:csv2" -``` - -#### JSON - -{{code_block('user-guide/basics/reading-writing','json',['read_json','write_json'])}} - -```python exec="on" result="text" session="getting-started/reading" ---8<-- "python/user-guide/basics/reading-writing.py:json" -``` - -#### Parquet - -{{code_block('user-guide/basics/reading-writing','parquet',['read_parquet','write_parquet'])}} - -```python exec="on" result="text" session="getting-started/reading" ---8<-- "python/user-guide/basics/reading-writing.py:parquet" -``` - -To see more examples and other data formats go to the [User Guide](../io/csv.md), section IO. diff --git a/docs/user-guide/concepts/index.md b/docs/user-guide/concepts/index.md new file mode 100644 index 000000000000..63a2ebeabe44 --- /dev/null +++ b/docs/user-guide/concepts/index.md @@ -0,0 +1,11 @@ +# Concepts + +The `Concepts` chapter describes the core concepts of the Polars API. Understanding these will help you optimise your queries on a daily basis. We will cover the following topics: + +- [Data Types: Overview](data-types/overview.md) +- [Data Types: Categoricals](data-types/categoricals.md) +- [Data structures](data-structures.md) +- [Contexts](contexts.md) +- [Expressions](expressions.md) +- [Lazy vs eager](lazy-vs-eager.md) +- [Streaming](streaming.md) diff --git a/docs/user-guide/expressions/index.md b/docs/user-guide/expressions/index.md new file mode 100644 index 000000000000..3724e09ce15e --- /dev/null +++ b/docs/user-guide/expressions/index.md @@ -0,0 +1,18 @@ +# Expressions + +In the `Contexts` sections we outlined what `Expressions` are and how they are invaluable. In this section we will focus on the `Expressions` themselves. Each section gives an overview of what they do and provide additional examples. + +- [Operators](operators.md) +- [Column selections](column-selections.md) +- [Functions](functions.md) +- [Casting](casting.md) +- [Strings](strings.md) +- [Aggregation](aggregation.md) +- [Null](null.md) +- [Window](window.md) +- [Folds](folds.md) +- [Lists](lists.md) +- [Plugins](plugins.md) +- [User-defined functions](user-defined-functions.md) +- [Structs](structs.md) +- [Numpy](numpy.md) diff --git a/docs/user-guide/getting-started.md b/docs/user-guide/getting-started.md new file mode 100644 index 000000000000..3ae743114cf8 --- /dev/null +++ b/docs/user-guide/getting-started.md @@ -0,0 +1,186 @@ +# Getting started + +This chapter is here to help you get started with Polars. It covers all the fundamental features and functionalities of the library, making it easy for new users to familiarise themselves with the basics from initial installation and setup to core functionalities. If you're already an advanced user or familiar with Dataframes, feel free to skip ahead to the [next chapter about installation options](installation.md). + +## Installing Polars + +=== ":fontawesome-brands-python: Python" + + ``` bash + pip install polars + ``` + +=== ":fontawesome-brands-rust: Rust" + + ``` shell + cargo add polars -F lazy + + # Or Cargo.toml + [dependencies] + polars = { version = "x", features = ["lazy", ...]} + ``` + +## Reading & writing + +Polars supports reading and writing for common file formats (e.g. csv, json, parquet), cloud storage (S3, Azure Blob, BigQuery) and databases (e.g. postgres, mysql). Below we show the concept of reading and writing to disk. + +{{code_block('user-guide/basics/reading-writing','dataframe',['DataFrame'])}} + +```python exec="on" result="text" session="getting-started/reading" +--8<-- "python/user-guide/basics/reading-writing.py:dataframe" +``` + +In the example below we write the DataFrame to a csv file called `output.csv`. After thatread it back with `read_csv` and `print` the result for inspection. + +{{code_block('user-guide/basics/reading-writing','csv',['read_csv','write_csv'])}} + +```python exec="on" result="text" session="getting-started/reading" +--8<-- "python/user-guide/basics/reading-writing.py:csv" +``` + +For more examples on the CSV file format and other data formats, start with the [IO section](io/index.md) of the User Guide. + +## Expressions + +`Expressions` are the core strength of Polars. The `expressions` offer a modular structure that allows you to combine simple concepts into complex queries. Below we cover the basic components that serve as building block (or in Polars terminology contexts) for all your queries: + +- `select` +- `filter` +- `with_columns` +- `group_by` + +To learn more about expressions and the context in which they operate, see the User Guide sections: [Contexts](concepts/contexts.md) and [Expressions](concepts/expressions.md). + +### Select + +To select a column we need to do two things: + +1. Define the `DataFrame` we want the data from. +2. Select the data that we need. + +In the example below you see that we select `col('*')`. The asterisk stands for all columns. + +{{code_block('user-guide/basics/expressions','select',['select'])}} + +```python exec="on" result="text" session="getting-started/expressions" +--8<-- "python/user-guide/basics/expressions.py:setup" +print( + --8<-- "python/user-guide/basics/expressions.py:select" +) +``` + +You can also specify the specific columns that you want to return. There are two ways to do this. The first option is to pass the column names, as seen below. + +{{code_block('user-guide/basics/expressions','select2',['select'])}} + +```python exec="on" result="text" session="getting-started/expressions" +print( + --8<-- "python/user-guide/basics/expressions.py:select2" +) +``` + +Follow these links to other parts of the User guide to learn more about [basic operations](expressions/operators.md) or [column selections](expressions/column-selections.md). + +### Filter + +The `filter` option allows us to create a subset of the `DataFrame`. We use the same `DataFrame` as earlier and we filter between two specified dates. + +{{code_block('user-guide/basics/expressions','filter',['filter'])}} + +```python exec="on" result="text" session="getting-started/expressions" +print( + --8<-- "python/user-guide/basics/expressions.py:filter" +) +``` + +With `filter` you can also create more complex filters that include multiple columns. + +{{code_block('user-guide/basics/expressions','filter2',['filter'])}} + +```python exec="on" result="text" session="getting-started/expressions" +print( + --8<-- "python/user-guide/basics/expressions.py:filter2" +) +``` + +### Add columns + +`with_columns` allows you to create new columns for your analyses. We create two new columns `e` and `b+42`. First we sum all values from column `b` and store the results in column `e`. After that we add `42` to the values of `b`. Creating a new column `b+42` to store these results. + +{{code_block('user-guide/basics/expressions','with_columns',['with_columns'])}} + +```python exec="on" result="text" session="getting-started/expressions" +print( + --8<-- "python/user-guide/basics/expressions.py:with_columns" +) +``` + +### Group_by + +We will create a new `DataFrame` for the Group by functionality. This new `DataFrame` will include several 'groups' that we want to group by. + +{{code_block('user-guide/basics/expressions','dataframe2',['DataFrame'])}} + +```python exec="on" result="text" session="getting-started/expressions" +--8<-- "python/user-guide/basics/expressions.py:dataframe2" +print(df2) +``` + +{{code_block('user-guide/basics/expressions','group_by',['group_by'])}} + +```python exec="on" result="text" session="getting-started/expressions" +print( + --8<-- "python/user-guide/basics/expressions.py:group_by" +) +``` + +{{code_block('user-guide/basics/expressions','group_by2',['group_by'])}} + +```python exec="on" result="text" session="getting-started/expressions" +print( + --8<-- "python/user-guide/basics/expressions.py:group_by2" +) +``` + +### Combination + +Below are some examples on how to combine operations to create the `DataFrame` you require. + +{{code_block('user-guide/basics/expressions','combine',['select','with_columns'])}} + +```python exec="on" result="text" session="getting-started/expressions" +--8<-- "python/user-guide/basics/expressions.py:combine" +``` + +{{code_block('user-guide/basics/expressions','combine2',['select','with_columns'])}} + +```python exec="on" result="text" session="getting-started/expressions" +--8<-- "python/user-guide/basics/expressions.py:combine2" +``` + +## Combining DataFrames + +There are two ways `DataFrame`s can be combined depending on the use case: join and concat. + +### Join + +Polars supports all types of join (e.g. left, right, inner, outer). Let's have a closer look on how to `join` two `DataFrames` into a single `DataFrame`. Our two `DataFrames` both have an 'id'-like column: `a` and `x`. We can use those columns to `join` the `DataFrames` in this example. + +{{code_block('user-guide/basics/joins','join',['join'])}} + +```python exec="on" result="text" session="getting-started/joins" +--8<-- "python/user-guide/basics/joins.py:setup" +--8<-- "python/user-guide/basics/joins.py:join" +``` + +To see more examples with other types of joins, see the [Transformations section](transformations/joins.md) in the user guide. + +### Concat + +We can also `concatenate` two `DataFrames`. Vertical concatenation will make the `DataFrame` longer. Horizontal concatenation will make the `DataFrame` wider. Below you can see the result of an horizontal concatenation of our two `DataFrames`. + +{{code_block('user-guide/basics/joins','hstack',['hstack'])}} + +```python exec="on" result="text" session="getting-started/joins" +--8<-- "python/user-guide/basics/joins.py:hstack" +``` diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md deleted file mode 100644 index 442029472d80..000000000000 --- a/docs/user-guide/index.md +++ /dev/null @@ -1,39 +0,0 @@ -# Introduction - -This user guide is an introduction to the [Polars DataFrame library](https://github.com/pola-rs/polars). -Its goal is to introduce you to Polars by going through examples and comparing it to other solutions. -Some design choices are introduced here. The guide will also introduce you to optimal usage of Polars. - -The Polars user guide is intended to live alongside the API documentation ([Python](https://docs.pola.rs/py-polars/html/reference/index.html) / [Rust](https://docs.rs/polars/latest/polars/)), which offers detailed descriptions of specific objects and functions. - -Even though Polars is completely written in [Rust](https://www.rust-lang.org/) (no runtime overhead!) and uses [Arrow](https://arrow.apache.org/) -- the [native arrow2 Rust implementation](https://github.com/jorgecarleitao/arrow2) -- as its foundation, the examples presented in this guide will be mostly using its higher-level language bindings. -Higher-level bindings only serve as a thin wrapper for functionality implemented in the core library. - -For [pandas](https://pandas.pydata.org/) users, our [Python package](https://pypi.org/project/polars/) will offer the easiest way to get started with Polars. - -### Philosophy - -The goal of Polars is to provide a lightning fast `DataFrame` library that: - -- Utilizes all available cores on your machine. -- Optimizes queries to reduce unneeded work/memory allocations. -- Handles datasets much larger than your available RAM. -- Has an API that is consistent and predictable. -- Has a strict schema (data-types should be known before running the query). - -Polars is written in Rust which gives it C/C++ performance and allows it to fully control performance critical parts -in a query engine. - -As such Polars goes to great lengths to: - -- Reduce redundant copies. -- Traverse memory cache efficiently. -- Minimize contention in parallelism. -- Process data in chunks. -- Reuse memory allocations. - -!!! rust "Note" - - The Rust examples in this guide are synchronized with the main branch of the Polars repository, rather than the latest Rust release. - You may not be able to copy-paste code examples and use them with the latest release. - We aim to solve this in the future. diff --git a/docs/user-guide/io/index.md b/docs/user-guide/io/index.md new file mode 100644 index 000000000000..5a3548871e8a --- /dev/null +++ b/docs/user-guide/io/index.md @@ -0,0 +1,12 @@ +# IO + +Reading and writing your data is crucial for a DataFrame library. In this chapter you will learn more on how to read and write to different file formats that are supported by Polars. + +- [CSV](csv.md) +- [Excel](excel.md) +- [Parquet](parquet.md) +- [Json](json.md) +- [Multiple](multiple.md) +- [Database](database.md) +- [Cloud storage](cloud-storage.md) +- [Google Big Query](bigquery.md) diff --git a/docs/user-guide/lazy/index.md b/docs/user-guide/lazy/index.md new file mode 100644 index 000000000000..be731390f09c --- /dev/null +++ b/docs/user-guide/lazy/index.md @@ -0,0 +1,10 @@ +# Lazy + +The Lazy chapter is a guide for working with `LazyFrames`. It covers the functionalities like how to use it and how to optimise it. You can also find more information about the query plan or gain more insight in the streaming capabilities. + +- [Using lazy API](using.md) +- [Optimisations](optimizations.md) +- [Schemas](schemas.md) +- [Query plan](query-plan.md) +- [Execution](execution.md) +- [Streaming](streaming.md) diff --git a/docs/user-guide/transformations/index.md b/docs/user-guide/transformations/index.md new file mode 100644 index 000000000000..cd673786643c --- /dev/null +++ b/docs/user-guide/transformations/index.md @@ -0,0 +1,8 @@ +# Transformations + +The focus of this section is to describe different types of data transformations and provide some examples on how to use them. + +- [Joins](joins.md) +- [Concatenation](concatenation.md) +- [Pivot](pivot.md) +- [Melt](melt.md) diff --git a/mkdocs.yml b/mkdocs.yml index 9918d5c2e8f3..c26fdd20902e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,24 +1,19 @@ # https://www.mkdocs.org/user-guide/configuration/ # Project information -site_name: Polars -site_url: https://docs.pola.rs +site_name: Polars User Guide +site_url: https://docs.pola.rs/ repo_url: https://github.com/pola-rs/polars repo_name: pola-rs/polars # Documentation layout nav: - - Home: index.md - - - User guide: - - user-guide/index.md + - User Guide: + - index.md + - user-guide/getting-started.md - user-guide/installation.md - - Basics: - - user-guide/basics/index.md - - user-guide/basics/reading-writing.md - - user-guide/basics/expressions.md - - user-guide/basics/joins.md - Concepts: + - user-guide/concepts/index.md - Data types: - user-guide/concepts/data-types/overview.md - user-guide/concepts/data-types/categoricals.md @@ -28,6 +23,7 @@ nav: - user-guide/concepts/lazy-vs-eager.md - user-guide/concepts/streaming.md - Expressions: + - user-guide/expressions/index.md - user-guide/expressions/operators.md - user-guide/expressions/column-selections.md - user-guide/expressions/functions.md @@ -43,6 +39,7 @@ nav: - user-guide/expressions/structs.md - user-guide/expressions/numpy.md - Transformations: + - user-guide/transformations/index.md - user-guide/transformations/joins.md - user-guide/transformations/concatenation.md - user-guide/transformations/pivot.md @@ -54,6 +51,7 @@ nav: - user-guide/transformations/time-series/resampling.md - user-guide/transformations/time-series/timezones.md - Lazy API: + - user-guide/lazy/index.md - user-guide/lazy/using.md - user-guide/lazy/optimizations.md - user-guide/lazy/schemas.md @@ -61,6 +59,7 @@ nav: - user-guide/lazy/execution.md - user-guide/lazy/streaming.md - IO: + - user-guide/io/index.md - user-guide/io/csv.md - user-guide/io/excel.md - user-guide/io/parquet.md @@ -134,6 +133,7 @@ theme: - navigation.tabs - navigation.tabs.sticky - navigation.footer + - navigation.indexes - content.tabs.link icon: repo: fontawesome/brands/github @@ -175,3 +175,10 @@ plugins: - material-plausible - macros: module_name: docs/_build/scripts/macro + - redirects: + redirect_maps: + 'user-guide/index.md': 'index.md' + 'user-guide/basics/index.md': 'user-guide/getting-started.md' + 'user-guide/basics/reading-writing.md': 'user-guide/getting-started.md' + 'user-guide/basics/expressions.md': 'user-guide/getting-started.md' + 'user-guide/basics/joins.md': 'user-guide/getting-started.md' \ No newline at end of file