Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(blog): needle haystack post #8824

Merged
merged 6 commits into from
Apr 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions docs/posts/varchar-in-a-haystack/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
---
title: "Varchar in a haystack"
author: "Tyler White"
error: false
date: "2024-03-28"
image: thumbnail.png
categories:
- blog
- data analysis
- puzzle
---

## The scenario

You're a data analyst, and a new ticket landed in your queue.

> Subject: Urgent: Data Discovery Needed for Critical Analysis
>
> Hi Data Team,
>
> I hope this message finds you well. I'm reaching out with an urgent request
> that directly impacts the company's most critical project. We need to locate
> a specific value within our database but do not know which column it's in.
> Unfortunately, we don't have documentation for this particular table. We are
> looking for the value "NEEDLE" in the table.
>
> We think it is in the X database, Y schema, and Z table. We appreciate your
> help with this urgent matter!

Whelp, let's give this a try.

## The table

To set up this particular problem, we can use pandas to create a table with 5
columns and 100 rows. We can use the `at` method to update a row with the value
"NEEDLE" to simulate what we need to find.

```{python}
#| code-fold: true
import pandas as pd
import random
import string
from ibis.interactive import *


def random_string(length=10):
return "".join(
random.choice(string.ascii_letters + string.digits) for _ in range(length)
)


data = [[random_string() for _ in range(5)] for _ in range(100)]
column_names = [f"col{i+1}" for i in range(5)]
df = pd.DataFrame(data, columns=column_names)
df.at[42, 'col4'] = "NEEDLE"
t = ibis.memtable(df, name="Z")
```

```{python}
t
```

## The solution(s)

There are a few ways we could solve this.

#### Option 1: write SQL

We could always spell it out with SQL, including each column that we want to
check in the `WHERE` clause. In this scenario, we know each column is a varchar,
so we can check each one for the value "NEEDLE".

```sql
SELECT *
FROM Z
WHERE col1 = 'NEEDLE'
OR col2 = 'NEEDLE'
OR col3 = 'NEEDLE'
OR col4 = 'NEEDLE'
OR col5 = 'NEEDLE';
```

This can be time-consuming. You might want something a little more dynamic.

#### Option 2: write dynamic SQL

Dynamically constructing the SQL query at runtime can be more complex, but it
offers more flexibility, especially if we have more than five columns.

```sql
DO $$
DECLARE
sql text;
where_clause text := '';
BEGIN
SELECT INTO where_clause
string_agg(quote_ident(column_name) || ' = ''NEEDLE''', ' OR ')
FROM information_schema.columns
WHERE table_name = 'Z'
AND table_schema = 'public'
AND data_type IN ('character varying', 'varchar', 'text', 'char');

sql := 'SELECT *
FROM Z
WHERE ' || where_clause;

EXECUTE sql;
END $$;
```

This can be difficult to troubleshoot, and it is easy to get lost in the quote
characters.

#### Option 3: use Ibis

We can make use of [`selectors`](../../reference/selectors.qmd)!

```{python}
expr = t.filter(s.if_any(s.of_type("string"), _ == "NEEDLE"))

expr
```

We can see the **NEEDLE** value hiding in `col4`.

## The explanation

`s.of_type("string")` was used to select string columns, then `s.if_any()`
builds up the ORs. The `_ == "NEEDLE"` part is the condition itself, checking
each column for the value.

Here's the SQL that was generated to help us find it, which is quite similar
to what we would have had to write if we had gone with [Option 1](#option-1-write-sql).
cpcloud marked this conversation as resolved.
Show resolved Hide resolved

```{python}
#| echo: false
ibis.to_sql(expr)
```

## The conclusion

Now that we've found the `"NEEDLE"` value, we can provide the information to the
requester. Urgent requests like this require quick and precise responses.

Our use of Ibis demonstrates how easy it is to simplify navigating large
datasets, and in this case, undocumented ones.

Please get in touch with us on [GitHub](https://github.com/ibis-project) or
[Zulip](https://ibis-project.zulipchat.com/). We'd love to hear from you!
Binary file added docs/posts/varchar-in-a-haystack/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading