From ff5e3eae2ee3ed97301b4f978ec1fbcdca46c723 Mon Sep 17 00:00:00 2001 From: Charles Frye Date: Fri, 24 Nov 2023 14:20:31 -0800 Subject: [PATCH] updates gantry df queries to match new tag API --- notebooks/lab08_monitoring.ipynb | 2632 +++++++++++++++--------------- 1 file changed, 1327 insertions(+), 1305 deletions(-) diff --git a/notebooks/lab08_monitoring.ipynb b/notebooks/lab08_monitoring.ipynb index 54c64d0..70beff2 100644 --- a/notebooks/lab08_monitoring.ipynb +++ b/notebooks/lab08_monitoring.ipynb @@ -1,1306 +1,1328 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "7yQQTA9IGDt8" - }, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MX9n-Zed8G_T" - }, - "source": [ - "# Lab 08: Monitoring" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## What You Will Learn\n", - "\n", - "- How to add user feedback and model monitoring to a Gradio-based app\n", - "- How to analyze this logged information to uncover and debug model issues\n", - "- Just how large the gap between benchmark data and data from users can be, and what to do about it" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "45D6GuSwvT7d" - }, - "outputs": [], - "source": [ - "lab_idx = 8\n", - "\n", - "\n", - "if \"bootstrap\" not in locals() or bootstrap.run:\n", - " # path management for Python\n", - " pythonpath, = !echo $PYTHONPATH\n", - " if \".\" not in pythonpath.split(\":\"):\n", - " pythonpath = \".:\" + pythonpath\n", - " %env PYTHONPATH={pythonpath}\n", - " !echo $PYTHONPATH\n", - "\n", - " # get both Colab and local notebooks into the same state\n", - " !wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py\n", - " import bootstrap\n", - " \n", - " %matplotlib inline\n", - "\n", - " # change into the lab directory\n", - " bootstrap.change_to_lab_dir(lab_idx=lab_idx)\n", - "\n", - " bootstrap.run = False # change to True re-run setup\n", - " \n", - "!pwd\n", - "%ls" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Follow along with a video walkthrough on YouTube:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import IFrame\n", - "\n", - "\n", - "IFrame(src=\"https://fsdl.me/2022-lab-08-video-embed\", width=\"100%\", height=720)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Zvi49122ho0r" - }, - "source": [ - "# Basic user feedback with `gradio`" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "56y2r9IYkY7A" - }, - "source": [ - "On top of the basic health check and event logging\n", - "necessary for any distributed system\n", - "(provided for our application by\n", - "[AWS CloudWatch](https://aws.amazon.com/cloudwatch/),\n", - "which is collects logs from EC2 and Lambda instances),\n", - "ML-powered applications need specialized monitoring solutions.\n", - "\n", - "In particular, we want to give users a way\n", - "to report issues or indicate their level of satisfaction\n", - "with the model.\n", - "\n", - "The UI-building framework we're using, `gradio`,\n", - "comes with user feedback, under the name \"flagging\"." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wXq4jcjCkNap" - }, - "source": [ - "To see how this works, we first spin up our front end,\n", - "pointed at the AWS Lambda backend,\n", - "as in\n", - "[the previous lab](https://fsdl.me/lab07-colab)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "rAZrYRnSiMER" - }, - "outputs": [], - "source": [ - "from app_gradio import app\n", - "\n", - "\n", - "lambda_url = \"https://3akxma777p53w57mmdika3sflu0fvazm.lambda-url.us-west-1.on.aws/\"\n", - "\n", - "backend = app.PredictorBackend(url=lambda_url)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "STXn1XaHkU42" - }, - "source": [ - "And adding user feedback collection\n", - "is as easy as passing `flagging=True`.\n", - "\n", - "> The `flagging` argument is here being given to\n", - "code from the FSDL codebase, `app.make_frontend`,\n", - "but you can just pass\n", - "`flagging=True` directly\n", - "to the `gradio.Interface` class.\n", - "In between in our code,\n", - "we have a bit of extra logic\n", - "so that we can support\n", - "multiple different storage backends for logging flagged data.\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Run the cell below to create a frontend\n", - "(accessible on a public Gradio URL and inside the notebook)\n", - "and observe the new \"flagging\" buttons underneath the outputs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Kgygx8d5ip9V" - }, - "outputs": [], - "source": [ - "frontend = app.make_frontend(fn=backend.run, flagging=True)\n", - "frontend.launch(share=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zV2tu8HTk242" - }, - "source": [ - "Click one of the buttons to trigger flagging.\n", - "\n", - "It doesn't need to be a legitimate issue with the model's outputs.\n", - "\n", - "Instead of just submitting one of the example images,\n", - "you might additionally use the image editor\n", - "(pencil button on uploaded images)\n", - "to crop it." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gJV79PDIk-4S" - }, - "source": [ - "Flagged data is stored on the server's local filesystem,\n", - "by default in the `flagged/` directory\n", - "as a `.csv` file:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "RbCcCxvHi2jh" - }, - "outputs": [], - "source": [ - "!ls flagged" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Koh1SP9NlA6y" - }, - "source": [ - "We can load the `.csv` with `pandas`,\n", - "the Python library for handling tabular data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "OJCnIsfEjC05" - }, - "outputs": [], - "source": [ - "from pathlib import Path\n", - "\n", - "import pandas as pd\n", - "\n", - "\n", - "log_path = Path(\"flagged\") / \"log.csv\"\n", - "\n", - "flagged_df = None\n", - "if log_path.exists():\n", - " flagged_df = pd.read_csv(log_path, quotechar=\"'\") # quoting can be painful for natural text data\n", - " flagged_df = flagged_df.dropna(subset=[\"Handwritten Text\"]) # drop any flags without an image\n", - "\n", - "flagged_df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KZieT-FgldKa" - }, - "source": [ - "Notice that richer data, like images, is stored with references --\n", - "here, the names of local files.\n", - "\n", - "This is a common pattern:\n", - "binary data doesn't go in the database,\n", - "only pointers to binary data.\n", - "\n", - "We can then read the data back to analyze our model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gWG3T3Qql_99" - }, - "outputs": [], - "source": [ - "from IPython.display import display\n", - "\n", - "from text_recognizer.util import read_image_pil\n", - "\n", - "\n", - "if flagged_df is not None:\n", - " row = flagged_df.iloc[-1]\n", - " print(row[\"output\"])\n", - " display(read_image_pil(Path(\"flagged\") / row[\"Handwritten Text\"]))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0gIpfRMFl9_D" - }, - "source": [ - "We encourage you to play around with the model for a bit,\n", - "uploading your own images.\n", - "\n", - "This is an important step in understanding your model\n", - "and your domain --\n", - "especially when you're familiar with the data types involved.\n", - "\n", - "But even when you are,\n", - "we expect you'll quickly find\n", - "that you run out of ideas\n", - "for different ways to probe your model.\n", - "\n", - "To really learn more about your model,\n", - "you'll need some actual users.\n", - "\n", - "In small projects,\n", - "these can be other team members who are less enmeshed\n", - "in the details of model development and data munging.\n", - "\n", - "But to create something that can appeal to a broader set of users,\n", - "you'll want to collect feedback from your potential userbase." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RHArpXNyRtg7" - }, - "source": [ - "# Debugging production models with `gantry`" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hbGCYG0BmvdE" - }, - "source": [ - "Unfortunately, this aspect of model development\n", - "is particularly challenging to replicate in\n", - "a course setting, especially a MOOC --\n", - "where do these users come from?\n", - "\n", - "As part of the 2022 edition of the course, we've\n", - "[been running a text recognizer application](https://fsdl-text-recognizer.ngrok.io)\n", - "and collecting user feedback on it.\n", - "\n", - "Rather than saving user feedback data locally,\n", - "as with the CSV logger above,\n", - "we've been sending that data to\n", - "[Gantry](https://gantry.io/),\n", - "a model monitoring and continual learning tool.\n", - "\n", - "That's because local logging is a very bad idea:\n", - "as logs grow, the storage needs and read/write time grow,\n", - "which unduly burdens the frontend server.\n", - "\n", - "The `gradio` library supports logging of user-flagged data\n", - "to arbitrary backends via\n", - "`FlaggingCallback`s.\n", - "\n", - "So there's some new elements to the codebase:\n", - "most importantly here, a `GantryImageToTextLogger`\n", - "that inherits from `gradio.FlaggingCallback`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pptT76DWmlB0" - }, - "outputs": [], - "source": [ - "from app_gradio import flagging\n", - "\n", - "\n", - "print(flagging.GantryImageToTextLogger.__init__.__doc__)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-3HevRM2YkbZ" - }, - "source": [ - "If we add this `Callback` to our setup --\n", - "and add a Gantry API key to our environment --\n", - "then we can start sending data to Gantry's service." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "UHnIV0e_a9o6" - }, - "outputs": [], - "source": [ - "app.make_frontend??" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jJcfaWNpRzJF" - }, - "source": [ - "The short version of how the logging works:\n", - "we upload flagged images to S3 for storage (`GantryImageToTextLogger._to_s3`)\n", - "and send the URL to Gantry along with the outputs (`GantryImageToTextLogger._to_gantry`)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uviSZDTma1RT" - }, - "source": [ - "Below, we'll download that data\n", - "and look through it in the notebook,\n", - "using typical Python data analysis tools,\n", - "like `pandas` and `seaborn`.\n", - "\n", - "By analogy to\n", - "[EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis),\n", - "consider this an \"exploratory model analysis\"." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "LFxypmESXESL" - }, - "outputs": [], - "source": [ - "import gantry.query as gq\n", - "\n", - "\n", - "read_only_key = \"VpPfHPDSk9e9KKAgbiHBh7mqF_8\"\n", - "gq.init(api_key=read_only_key)\n", - "\n", - "gdf = gq.query( # we query Gantry's service with the following parameters:\n", - " application=\"fsdl-text-recognizer\", # which tracked application should we draw from?\n", - " tags={\"env\": \"dev\"}, # which tags (here, logging environment) should we filter to?\n", - " # what time period should we pull data from? here, the first two months the app was up\n", - " start_time=\"2022-07-01T07:00:00.000Z\",\n", - " end_time=\"2022-09-01T06:59:00.000Z\",\n", - ")\n", - "\n", - "raw_df = gdf.fetch()\n", - "df = raw_df.dropna(axis=\"columns\", how=\"all\") # remove any irrelevant columns\n", - "print(\"number of rows:\", len(df))\n", - "df = df.drop_duplicates(keep=\"first\", subset=\"inputs.image\") # remove repeated reports, eg of example images\n", - "print(\"number of unique rows:\", len(df))\n", - "\n", - "print(\"\\ncolumns:\")\n", - "df.columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We'll walk through what each of these columns means,\n", - "but the three most important are the ones we logged directly from the application:\n", - "`flag`s, `input.image`s, and `output_text`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "main_columns = [column for column in df.columns if \"(\" not in column] # derived columns have a \"function call\" in the name\n", - "main_columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you're interested in playing\n", - "around with the data yourself\n", - "in Gantry's UI,\n", - "as we do in the\n", - "[video walkthrough for the lab](https://fsdl.me/2022-lab-08-video),\n", - "you'll need a Gantry account.\n", - "\n", - "Gantry is currently in closed beta.\n", - "Unlike training or experiment management,\n", - "model monitoring and continual learning\n", - "is at the frontier of applied ML,\n", - "so tooling is just starting to roll out.\n", - "\n", - "FSDL students are invited to this beta and\n", - "[can create a \"read-only\" account here](https://gantry.io/fsdl-signup)\n", - "so they can view the data in the UI\n", - "and explore it themselves.\n", - "\n", - "As an early startup,\n", - "Gantry is very interested in feedback\n", - "from practitioners!\n", - "So if you do try out the Gantry UI,\n", - "send any impressions, bug reports, or ideas to\n", - "`support@gantry.io`\n", - "\n", - "This is also a chance for you\n", - "to influence the development\n", - "of a new tool that could one day\n", - "end up at the center of continual learning\n", - "workflows --\n", - "as when\n", - "[FSDL students in spring 2019 got a chance to be early users of W&B](https://www.youtube.com/watch?t=1468&v=Eiz1zcqrqw0&feature=youtu.be&ab_channel=FullStackDeepLearning)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RmTFHvxHi4el" - }, - "source": [ - "## Basic stats and behavioral monitoring" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We start by just getting some basic statistics.\n", - "\n", - "For example, we can get descriptive statistics for\n", - "the information we've logged." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Fb3BMn7gfQRI" - }, - "outputs": [], - "source": [ - "df[\"feedback.flag\"].describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "T9OseYhc1Q8i" - }, - "source": [ - "Note that the format we're working with is the `pandas.DataFrame` --\n", - "a standard format for tables in Python.\n", - "\n", - "`pandas` can be\n", - "[very tricky](https://github.com/chiphuyen/just-pandas-things).\n", - "\n", - "It's not so bad when doing exploratory analysis like this,\n", - "but take care when using it in production settings!\n", - "\n", - "If you'd like to learn more `pandas`,\n", - "[Brandon Rhodes's `pandas` tutorial from PyCon 2015](https://www.youtube.com/watch?v=5JnMutdy6Fw&ab_channel=PyCon2015)\n", - "is still one of the best introductions,\n", - "even though it's nearly a decade old." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`pandas` objects support sampling with `.sample`,\n", - "which is useful for quick \"spot-checking\" of data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "FZ5BRRqjc1Of" - }, - "outputs": [], - "source": [ - "df[\"feedback.flag\"].sample(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "w3rZaYwSzu-D" - }, - "source": [ - "Unlike many other kinds of applications,\n", - "toxic and offensive behavior is\n", - "one of the most critical potential issues with\n", - "many ML models,\n", - "from\n", - "[generative models like GPT-3](https://www.middlebury.edu/institute/sites/www.middlebury.edu.institute/files/2020-09/gpt3-article.pdf)\n", - "to even humble\n", - "[image labeling models](https://archive.nytimes.com/bits.blogs.nytimes.com/2015/07/01/google-photos-mistakenly-labels-black-people-gorillas/).\n", - "\n", - "So ML models, especially when newly deployed\n", - "or when encountering new user bases,\n", - "need careful supervision." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-CbdSz0hzze7" - }, - "source": [ - "We use a\n", - "[Gantry tool called Projections](https://docs.gantry.io/en/stable/guides/projections.html)\n", - "to apply the NLP models from the\n", - "[`detoxify` suite](https://github.com/unitaryai/detoxify),\n", - "which score text for features like obscenity and identity attacks,\n", - "to our model's outputs." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1Z4lsgRcpQql" - }, - "source": [ - "To get a quick plot of the resulting values,\n", - "we can use the `pandas` built-in interface\n", - "to `matplotlib`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9UbBg947fAsh" - }, - "outputs": [], - "source": [ - "df.plot(y=\"detoxify.obscene(outputs.output_text)\", kind=\"hist\");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qxiIXGf0pVd5" - }, - "source": [ - "Without context, this chart isn't super useful --\n", - "is a score of `obscene=0.12` bad?\n", - "\n", - "We need a baseline!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "UbOeOkzQgBDE" - }, - "source": [ - "Once the model is stable in production,\n", - "we can compare the values across time --\n", - "grouping or filtering production data by timestamp.\n", - "\n", - "Here, for this first version of the model,\n", - "we compare the results here with the results on the test data,\n", - "which was also ingested with `gantry`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ooa-Al48f_au" - }, - "outputs": [], - "source": [ - "gdf = gq.query(\n", - " application=\"fsdl-text-recognizer\",\n", - " tags={\"env\": \"test\"}, # picks out the \"test\" environment\n", - " start_time=\"2022-08-12T02:15:00.000Z\",\n", - " end_time=\"2022-08-12T03:00:00.000Z\"\n", - ")\n", - "\n", - "raw_test_df = gdf.fetch()\n", - "test_df = raw_test_df.dropna(axis=\"columns\", how=\"all\") # remove any irrelevant columns\n", - "\n", - "test_df.sample(10) # show a sample" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TssF7sSX1Q8k" - }, - "source": [ - "To compare the two `DataFrame`s,\n", - "we `concat`enate them together\n", - "and add in some metadata\n", - "identifying where the observations came from.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "oXWqfOdfgi4o" - }, - "outputs": [], - "source": [ - "test_df[\"environment\"] = \"test\"\n", - "df[\"environment\"] = \"prod\"\n", - "\n", - "comparison_df = pd.concat([df, test_df])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "From there, we can use grouping to calculate statistics of interest:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "stats = comparison_df.groupby(\"environment\").describe()\n", - "\n", - "stats[\"detoxify.obscene(outputs.output_text)\"]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2G2tVhhY1Q8k" - }, - "source": [ - "These descriptive statistics are helpful,\n", - "but as with our simple plot above,\n", - "we want to _look_ at the data.\n", - "\n", - "Exploratory data analysis is typically very visual --\n", - "the goal is to find phenomena so obvious\n", - "that statistical testing is an afterthought --\n", - "and so is exploratory model analysis.\n", - "\n", - "`matplotlib` is based on plotting arrays,\n", - "rather than `DataFrame`s or other tabular data,\n", - "so it's not a great fit on its own here,\n", - "unless we want to tolerate a lot of boilerplate.\n", - "\n", - "`pandas` has basic built-in plotting\n", - "that interfaces with `matplotlib`,\n", - "but it's not that ergonomic for comparisons or flexible\n", - "without just dropping back to matplotlib.\n", - "\n", - "There are a number of other Python plotting libraries,\n", - "many with an emphasis on share-ability and interaction\n", - "([Vega-Altair](https://altair-viz.github.io/),\n", - "[`bokeh`](http://bokeh.org/),\n", - "and\n", - "[Plotly](https://plotly.com/),\n", - "to name a few)\n", - "and others with an emphasis on usability\n", - "(e.g. [`ggplot`](https://realpython.com/ggplot-python/)).\n", - "\n", - "The one that we like for in-notebook analysis\n", - "that balances ease of use\n", - "on tabular data with flexibility is\n", - "[`seaborn`](https://seaborn.pydata.org/)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7nZV8uoG1Q8k" - }, - "source": [ - "Comparing the distributions of the `detoxify.obscene` metric\n", - "is a single function call:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "WnGxCz1f1Q8k" - }, - "outputs": [], - "source": [ - "import seaborn as sns\n", - "\n", - "\n", - "sns.displot( # plot the dis-tribution\n", - " data=comparison_df, # of data from this df\n", - " # specifically, this column, along the x-axis\n", - " x=\"detoxify.obscene(outputs.output_text)\",\n", - " # and split it up (in color/hue) by this column\n", - " hue=\"environment\"\n", - ");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can quickly see that the obscenity scores according to `detoxify`\n", - "are generally lower in our `prod`uction environment,\n", - "so we don't have a reason to suspect\n", - "our model is behaving too badly in production\n", - "-- though see the exercises for more on this!\n", - "\n", - "We can see the same thing\n", - "without having to write query, cleaning, and plotting code\n", - "[in the Gantry UI here](https://app.gantry.io/applications/fsdl-text-recognizer/distribution?view=2022-class&compare=test-ingest) --\n", - "note that viewing the dashboard requires a Gantry account,\n", - "which you can sign up for\n", - "[here](https://gantry.io/fsdl-signup)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iKZ0l2MCjlDn" - }, - "source": [ - "## Debugging the Text Recognizer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ovp8fZ1GpUet" - }, - "source": [ - "In our application,\n", - "we don't have user corrections or labels from annotators,\n", - "so we can't calculate an accuracy, a loss, or a character error rate.\n", - "\n", - "We instead look for signals that are correlated with\n", - "those values.\n", - "\n", - "This approach has limits\n", - "(see, e.g. the analysis in the\n", - "[MLDeMon paper](https://arxiv.org/abs/2104.13621))\n", - "and setting alerts or test failures on things that are only correlated with,\n", - "rather than directly caused by, poor performance is a bad idea.\n", - "\n", - "But it's very useful to have this information logged\n", - "to catch large errors at a glance\n", - "or to provide tools for slicing, filtering, and grouping data\n", - "while doing exploratory model analysis or debugging." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0YauDrY51Q8l" - }, - "source": [ - "We can also compute these signals with Gantry Projections.\n", - "\n", - "Low entropy (e.g. repetition) is a failure mode of language models,\n", - "as is excessively high entropy (e.g. uniformly random text).\n", - "\n", - "We can review the output text entropy distributions in\n", - "production and during testing\n", - "by plotting them against one another\n", - "(here or\n", - "[in the Gantry UI](https://app.gantry.io/applications/fsdl-text-recognizer/distribution?view=2022-class&compare=test-ingest))." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "czepR9o7l2FO" - }, - "outputs": [], - "source": [ - "sns.displot(\n", - " data=comparison_df,\n", - " x=\"text_stats.basics.entropy(outputs.output_text)\",\n", - " hue=\"environment\"\n", - ");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8LiFvkoR1Q8l" - }, - "source": [ - "It appears there are more low-entropy strings in the model's outputs in production.\n", - "\n", - "With models that operate on human-relevant data,\n", - "like text and images,\n", - "it's important to look at the raw data,\n", - "not just projections.\n", - "\n", - "Let's take a look at a sample of outputs from the model running on test data:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "FQ9kTz2ZmOwR" - }, - "outputs": [], - "source": [ - "test_df[\"outputs.output_text\"].sample(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BpZ_35uD1Q8l" - }, - "source": [ - "The results are not incredible, but they are recognizably \"English with typos\"." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NVlj3vYf1Q8l" - }, - "source": [ - "Let's look specifically at low entropy examples from production\n", - "(we can also view this\n", - "[filtered data in the Gantry UI](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entropy&compare=test-ingest))." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "p0dkx1VzoJ9C" - }, - "outputs": [], - "source": [ - "df.loc[df[\"text_stats.basics.entropy(outputs.output_text)\"] < 5][\"outputs.output_text\"].sample(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Yikes! Lots of repetitive gibberish." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "stStBoCZ1Q8m" - }, - "source": [ - "Knowing the outputs are bad,\n", - "there are two culprits:\n", - "the input-output mapping (aka the model)\n", - "or the inputs." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nFaGYnjcmKf6" - }, - "source": [ - "We ran the same model in a similar environment\n", - "to get those outputs,\n", - "so it's most likely due to some difference in the inputs.\n", - "\n", - "Let's check them!\n", - "\n", - "We added Gantry Projections to look at the distribution of pixel values as well." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "uSwnexFRlaIV" - }, - "outputs": [], - "source": [ - "sns.displot(\n", - " data=comparison_df,\n", - " x=\"image.greyscale_image_mean(inputs.image)\",\n", - " hue=\"environment\"\n", - ");" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iqkWkM45yMgV" - }, - "source": [ - "There's a huge difference in mean pixel values --\n", - "almost all images have mean intensities that are very dark in the testing environment,\n", - "but we see both dark and light images in production.\n", - "\n", - "Reviewing the\n", - "[raw data in Gantry](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entropy&compare=test-ingest)\n", - "confirms that we are getting images with very different brightnesses in production\n", - "and whiffing the predictions\n", - "-- along with images that reveal a number of other interesting failure modes." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "X5uWeR6n1Q8m" - }, - "source": [ - "To take a look locally,\n", - "we'll need to pull the images down from S3,\n", - "where they are stored." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NbNMlevz1Q8m" - }, - "source": [ - "The cell below defines a quick utility for\n", - "reading from S3 without authentication.\n", - "\n", - "It is based on the `smart_open` and `boto3` libraries,\n", - "which we briefly saw in the\n", - "[model deployment lab](https://fsdl.me/lab07-colab)\n", - "and the\n", - "[data annotation lab](https://fsdl.me/lab06-colab)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-FNIm0MOovtu" - }, - "outputs": [], - "source": [ - "import boto3\n", - "from botocore import UNSIGNED\n", - "from botocore.config import Config\n", - "import smart_open\n", - "\n", - "from text_recognizer.util import read_image_pil_file\n", - "\n", - "# spin up a client for communicating with s3 without authenticating (\"UNSIGNED\" activity)\n", - "s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))\n", - "unsigned_params = {\"client\": s3}\n", - "\n", - "def read_image_unsigned(image_uri, grayscale=False):\n", - " with smart_open.open(image_uri, \"rb\", transport_params=unsigned_params) as image_file:\n", - " return read_image_pil_file(image_file, grayscale)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Run the cell below to repeatedly sample a random input/output pair\n", - "flagged in production." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Xy90rzcWobuk" - }, - "outputs": [], - "source": [ - "row = df.sample().iloc[0]\n", - "print(\"image url:\", row[\"inputs.image\"])\n", - "print(\"prediction:\", row[\"outputs.output_text\"])\n", - "read_image_unsigned(row[\"inputs.image\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oFdT2W2xtOGx" - }, - "source": [ - "### Take-aways for developing models\n", - "\n", - "The most immediate take-away from reviewing just a few examples is that\n", - "user data is way more heterogeneous than train/val/test data!\n", - "\n", - "This a\n", - "[fairly](https://browsee.io/blog/a-guide-to-session-replays-for-product-managers/)\n", - "[universal](https://medium.com/@beasles/edge-case-responsive-design-9b610138ddbd)\n", - "[finding](https://quoteinvestigator.com/2021/05/04/no-plan/).\n", - "\n", - "Let's also consider some specific failure modes in our case\n", - "and how we might resolve them:\n", - "\n", - "- Failure mode: Users mostly provide images with dark text on light background, but we train on dark background.\n", - " - Resolution: We could check image brightness and flip if needed,\n", - " but this feels like a cop-out -- most text is dark on a light background! \n", - " - Resolution: We add image brightness inversion to our train-time augmentations.\n", - "- Failure mode: Users expect our \"handwritten text recognition\" tool to work with printed and digital text.\n", - " - Resolution: We could try better sign-posting and user education,\n", - " but this is also something of a cop-out.\n", - " Users expect the tool to work on all text,\n", - " so we shouldn't violate that expectation.\n", - " - Resolution: We synthesize digital text data --\n", - " text rendering is a feature of just about any mature programming language.\n", - "- Failure mode: Users provide text on heterogeneous backgrounds\n", - " - Resolution: We collect or synthesize more heterogeneous data,\n", - " e.g. placing text (with or without background coloring)\n", - " on top of random image backgrounds.\n", - "- Failure mode: Users provide text with characters and symbols outside of our dictionary.\n", - " - Resolution: We can expand the model outputs and collect more heterogeneous data\n", - "- Failure mode: Users provide images with multiple blocks of text\n", - " - Resolution: We develop an architecture/task definition that can handle multiple regions.\n", - " We'll need to collect and/or synthesize data to support" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9rQH6zI8u7WN" - }, - "source": [ - "Notice: these are almost entirely changes to data,\n", - "and most of them involve collecting more or synthesizing it.\n", - "\n", - "This is very much typical!\n", - "\n", - "Data drives improvements to models,\n", - "[even at scale](https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Take-aways for exploratory model analysis" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mfMf1wwR1Q8n" - }, - "source": [ - "Notice that we had to write a lot of code,\n", - "which was developed and which we ran in a\n", - "tight interactive loop.\n", - "\n", - "This type of code is very hard to turn into scripts --\n", - "how do you trigger an alert on a plot? --\n", - "which makes it brittle and hard to version and share.\n", - "\n", - "It's also based on possibly very large-scale data artifacts.\n", - "\n", - "The right tool for this job is a UI\n", - "on top of a database.\n", - "\n", - "In the\n", - "[video walkthrough for this lab](https://fsdl.me/2022-lab-08-video),\n", - "we do the effectively the same analysis,\n", - "but inside Gantry,\n", - "which makes the process more fluid.\n", - "\n", - "Gantry is still in closed beta,\n", - "but if you're interested in applying it to your own applications, you can\n", - "[join the waitlist](https://gantry.io/waitlist/)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M73gui0XhgCl" - }, - "source": [ - "# Exercises" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mWWrmGiThhMw" - }, - "source": [ - "### 🌟 Examine the test data strings, both output and ground truth." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "km0nv0Mghmd_" - }, - "source": [ - "We compared our production obscenity metric to the test-time values of that same metric\n", - "and determined that we had not gotten worse,\n", - "so things were fine.\n", - "\n", - "But what if the test-time baseline is bad?\n", - "\n", - "Review the raw test ground truth data\n", - "[here](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=test-ingest),\n", - "if you\n", - "[signed up a Gantry account](https://gantry.io/fsdl-signup),\n", - "or by looking at the contents of `test_df` above.\n", - "\n", - "Sort by `detoxify.identity_attack(feedback.ground_truth_string)`\n", - "or filter to only high values of that metric.\n", - "\n", - "Review the example `feedback.ground_truth_string` texts and consider:\n", - "is this the subset of English\n", - "we want the model to be training on?\n", - "what objections might be raised to the contents?\n", - "\n", - "You might also look for cases where the `detoxify` models misunderstood meaning --\n", - "e.g. an innocuous use of a word that's often used objectionably." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1Q6mWRwS1Q8t" - }, - "source": [ - "### 🌟🌟 Start building \"regression testing suites\" by doing error analysis on these examples." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jfsCnjCg1Q8t" - }, - "source": [ - "Do this by going through feedback data one image/text pair at a time --\n", - "[in Gantry](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entrop)\n", - "or inside this notebook.\n", - "\n", - "Start by just taking notes on each example\n", - "(anywhere -- Google Sheets/Excel/Notion, or just a sheet of paper).\n", - "\n", - "The primary question you should ask is:\n", - "how does this example differ from the data shown in training?\n", - "\n", - "Check\n", - "[this W&B Artifact page](https://wandb.ai/cfrye59/fsdl-text-recognizer-2021-training/artifacts/run_table/run-1vrnrd8p-trainpredictions/v194/files/train/predictions.table.json#f5854c9c18f6c24a4e99)\n", - "to see what training data\n", - "(including augmentation)\n", - "looks like.\n", - "\n", - "Once you have some notes,\n", - "try and formalize them into a small number of \"failure modes\" --\n", - "you can choose to align them with the failure modes described in the section\n", - "on take-aways for model development or not.\n", - "\n", - "If you want to finish the loop,\n", - "you might set up Label Studio, as in\n", - "[the data annotation lab](https://fsdl.me/lab06-colab).\n", - "An annotator should add at least a\n", - "\"label\" that gives the type of issue\n", - "and perhaps also add a text annotation\n", - "while they are at it." - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "private_outputs": true, - "provenance": [], - "toc_visible": true - }, - "gpuClass": "standard", - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.13" - }, - "vscode": { - "interpreter": { - "hash": "0f056848cf5d2396a4970b625f23716aa539c2ff5334414c1b5d98d7daae66f6" - } - } - }, - "nbformat": 4, - "nbformat_minor": 1 -} + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "7yQQTA9IGDt8" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MX9n-Zed8G_T" + }, + "source": [ + "# Lab 08: Monitoring" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tv8O0V0EV09z" + }, + "source": [ + "## What You Will Learn\n", + "\n", + "- How to add user feedback and model monitoring to a Gradio-based app\n", + "- How to analyze this logged information to uncover and debug model issues\n", + "- Just how large the gap between benchmark data and data from users can be, and what to do about it" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "45D6GuSwvT7d" + }, + "outputs": [], + "source": [ + "lab_idx = 8\n", + "\n", + "\n", + "if \"bootstrap\" not in locals() or bootstrap.run:\n", + " # path management for Python\n", + " pythonpath, = !echo $PYTHONPATH\n", + " if \".\" not in pythonpath.split(\":\"):\n", + " pythonpath = \".:\" + pythonpath\n", + " %env PYTHONPATH={pythonpath}\n", + " !echo $PYTHONPATH\n", + "\n", + " # get both Colab and local notebooks into the same state\n", + " !wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py\n", + " import bootstrap\n", + "\n", + " %matplotlib inline\n", + "\n", + " # change into the lab directory\n", + " bootstrap.change_to_lab_dir(lab_idx=lab_idx)\n", + "\n", + " bootstrap.run = False # change to True re-run setup\n", + "\n", + "!pwd\n", + "%ls" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cUdTJE54V09z" + }, + "source": [ + "### Follow along with a video walkthrough on YouTube:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4J9hDxNsV09z" + }, + "outputs": [], + "source": [ + "from IPython.display import IFrame\n", + "\n", + "\n", + "IFrame(src=\"https://fsdl.me/2022-lab-08-video-embed\", width=\"100%\", height=720)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zvi49122ho0r" + }, + "source": [ + "# Basic user feedback with `gradio`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "56y2r9IYkY7A" + }, + "source": [ + "On top of the basic health check and event logging\n", + "necessary for any distributed system\n", + "(provided for our application by\n", + "[AWS CloudWatch](https://aws.amazon.com/cloudwatch/),\n", + "which is collects logs from EC2 and Lambda instances),\n", + "ML-powered applications need specialized monitoring solutions.\n", + "\n", + "In particular, we want to give users a way\n", + "to report issues or indicate their level of satisfaction\n", + "with the model.\n", + "\n", + "The UI-building framework we're using, `gradio`,\n", + "comes with user feedback, under the name \"flagging\"." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wXq4jcjCkNap" + }, + "source": [ + "To see how this works, we first spin up our front end,\n", + "pointed at the AWS Lambda backend,\n", + "as in\n", + "[the previous lab](https://fsdl.me/lab07-colab)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rAZrYRnSiMER" + }, + "outputs": [], + "source": [ + "from app_gradio import app\n", + "\n", + "\n", + "lambda_url = \"https://3akxma777p53w57mmdika3sflu0fvazm.lambda-url.us-west-1.on.aws/\"\n", + "\n", + "backend = app.PredictorBackend(url=lambda_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "STXn1XaHkU42" + }, + "source": [ + "And adding user feedback collection\n", + "is as easy as passing `flagging=True`.\n", + "\n", + "> The `flagging` argument is here being given to\n", + "code from the FSDL codebase, `app.make_frontend`,\n", + "but you can just pass\n", + "`flagging=True` directly\n", + "to the `gradio.Interface` class.\n", + "In between in our code,\n", + "we have a bit of extra logic\n", + "so that we can support\n", + "multiple different storage backends for logging flagged data.\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mxZQRklXV091" + }, + "source": [ + "Run the cell below to create a frontend\n", + "(accessible on a public Gradio URL and inside the notebook)\n", + "and observe the new \"flagging\" buttons underneath the outputs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Kgygx8d5ip9V" + }, + "outputs": [], + "source": [ + "frontend = app.make_frontend(fn=backend.run, flagging=True)\n", + "frontend.launch(share=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zV2tu8HTk242" + }, + "source": [ + "Click one of the buttons to trigger flagging.\n", + "\n", + "It doesn't need to be a legitimate issue with the model's outputs.\n", + "\n", + "Instead of just submitting one of the example images,\n", + "you might additionally use the image editor\n", + "(pencil button on uploaded images)\n", + "to crop it." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gJV79PDIk-4S" + }, + "source": [ + "Flagged data is stored on the server's local filesystem,\n", + "by default in the `flagged/` directory\n", + "as a `.csv` file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RbCcCxvHi2jh" + }, + "outputs": [], + "source": [ + "!ls flagged" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Koh1SP9NlA6y" + }, + "source": [ + "We can load the `.csv` with `pandas`,\n", + "the Python library for handling tabular data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OJCnIsfEjC05" + }, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "import pandas as pd\n", + "\n", + "\n", + "log_path = Path(\"flagged\") / \"log.csv\"\n", + "\n", + "flagged_df = None\n", + "if log_path.exists():\n", + " flagged_df = pd.read_csv(log_path, quotechar=\"'\") # quoting can be painful for natural text data\n", + " flagged_df = flagged_df.dropna(subset=[\"Handwritten Text\"]) # drop any flags without an image\n", + "\n", + "flagged_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KZieT-FgldKa" + }, + "source": [ + "Notice that richer data, like images, is stored with references --\n", + "here, the names of local files.\n", + "\n", + "This is a common pattern:\n", + "binary data doesn't go in the database,\n", + "only pointers to binary data.\n", + "\n", + "We can then read the data back to analyze our model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gWG3T3Qql_99" + }, + "outputs": [], + "source": [ + "from IPython.display import display\n", + "\n", + "from text_recognizer.util import read_image_pil\n", + "\n", + "\n", + "if flagged_df is not None:\n", + " row = flagged_df.iloc[-1]\n", + " print(row[\"output\"])\n", + " display(read_image_pil(Path(\"flagged\") / row[\"Handwritten Text\"]))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0gIpfRMFl9_D" + }, + "source": [ + "We encourage you to play around with the model for a bit,\n", + "uploading your own images.\n", + "\n", + "This is an important step in understanding your model\n", + "and your domain --\n", + "especially when you're familiar with the data types involved.\n", + "\n", + "But even when you are,\n", + "we expect you'll quickly find\n", + "that you run out of ideas\n", + "for different ways to probe your model.\n", + "\n", + "To really learn more about your model,\n", + "you'll need some actual users.\n", + "\n", + "In small projects,\n", + "these can be other team members who are less enmeshed\n", + "in the details of model development and data munging.\n", + "\n", + "But to create something that can appeal to a broader set of users,\n", + "you'll want to collect feedback from your potential userbase." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RHArpXNyRtg7" + }, + "source": [ + "# Debugging production models with `gantry`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hbGCYG0BmvdE" + }, + "source": [ + "Unfortunately, this aspect of model development\n", + "is particularly challenging to replicate in\n", + "a course setting, especially a MOOC --\n", + "where do these users come from?\n", + "\n", + "As part of the 2022 edition of the course, we've\n", + "[been running a text recognizer application](https://fsdl-text-recognizer.ngrok.io)\n", + "and collecting user feedback on it.\n", + "\n", + "Rather than saving user feedback data locally,\n", + "as with the CSV logger above,\n", + "we've been sending that data to\n", + "[Gantry](https://gantry.io/),\n", + "a model monitoring and continual learning tool.\n", + "\n", + "That's because local logging is a very bad idea:\n", + "as logs grow, the storage needs and read/write time grow,\n", + "which unduly burdens the frontend server.\n", + "\n", + "The `gradio` library supports logging of user-flagged data\n", + "to arbitrary backends via\n", + "`FlaggingCallback`s.\n", + "\n", + "So there's some new elements to the codebase:\n", + "most importantly here, a `GantryImageToTextLogger`\n", + "that inherits from `gradio.FlaggingCallback`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pptT76DWmlB0" + }, + "outputs": [], + "source": [ + "from app_gradio import flagging\n", + "\n", + "\n", + "print(flagging.GantryImageToTextLogger.__init__.__doc__)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-3HevRM2YkbZ" + }, + "source": [ + "If we add this `Callback` to our setup --\n", + "and add a Gantry API key to our environment --\n", + "then we can start sending data to Gantry's service." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UHnIV0e_a9o6" + }, + "outputs": [], + "source": [ + "app.make_frontend??" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jJcfaWNpRzJF" + }, + "source": [ + "The short version of how the logging works:\n", + "we upload flagged images to S3 for storage (`GantryImageToTextLogger._to_s3`)\n", + "and send the URL to Gantry along with the outputs (`GantryImageToTextLogger._to_gantry`)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uviSZDTma1RT" + }, + "source": [ + "Below, we'll download that data\n", + "and look through it in the notebook,\n", + "using typical Python data analysis tools,\n", + "like `pandas` and `seaborn`.\n", + "\n", + "By analogy to\n", + "[EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis),\n", + "consider this an \"exploratory model analysis\"." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LFxypmESXESL" + }, + "outputs": [], + "source": [ + "import gantry.query as gq\n", + "\n", + "\n", + "read_only_key = \"VpPfHPDSk9e9KKAgbiHBh7mqF_8\"\n", + "gq.init(api_key=read_only_key)\n", + "\n", + "gdf = gq.query( # we query Gantry's service with the following parameters:\n", + " application=\"fsdl-text-recognizer\", # which tracked application should we draw from?\n", + " # what time period should we pull data from? here, the first two months the app was up\n", + " start_time=\"2022-07-01T07:00:00.000Z\",\n", + " end_time=\"2022-09-01T06:59:00.000Z\",\n", + ")\n", + "\n", + "raw_df = gdf.fetch()\n", + "df = raw_df.dropna(axis=\"columns\", how=\"all\") # remove any irrelevant columns\n", + "df = df[df[\"tags.env\"] == \"dev\"] # filter down to info logged from the development environment\n", + "print(\"number of rows:\", len(df))\n", + "df = df.drop_duplicates(keep=\"first\", subset=\"inputs.image\") # remove repeated reports, eg of example images\n", + "print(\"number of unique rows:\", len(df))\n", + "\n", + "print(\"\\ncolumns:\")\n", + "df.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bN6YNmnCV094" + }, + "source": [ + "We'll walk through what each of these columns means,\n", + "but the three most important are the ones we logged directly from the application:\n", + "`flag`s, `input.image`s, and `output_text`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "c8SEwiAXV094" + }, + "outputs": [], + "source": [ + "main_columns = [column for column in df.columns if \"(\" not in column] # derived columns have a \"function call\" in the name\n", + "main_columns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i8HfH-BIV094" + }, + "source": [ + "If you're interested in playing\n", + "around with the data yourself\n", + "in Gantry's UI,\n", + "as we do in the\n", + "[video walkthrough for the lab](https://fsdl.me/2022-lab-08-video),\n", + "you'll need a Gantry account.\n", + "\n", + "Gantry is currently in closed beta.\n", + "Unlike training or experiment management,\n", + "model monitoring and continual learning\n", + "is at the frontier of applied ML,\n", + "so tooling is just starting to roll out.\n", + "\n", + "FSDL students are invited to this beta and\n", + "[can create a \"read-only\" account here](https://gantry.io/fsdl-signup)\n", + "so they can view the data in the UI\n", + "and explore it themselves.\n", + "\n", + "As an early startup,\n", + "Gantry is very interested in feedback\n", + "from practitioners!\n", + "So if you do try out the Gantry UI,\n", + "send any impressions, bug reports, or ideas to\n", + "`support@gantry.io`\n", + "\n", + "This is also a chance for you\n", + "to influence the development\n", + "of a new tool that could one day\n", + "end up at the center of continual learning\n", + "workflows --\n", + "as when\n", + "[FSDL students in spring 2019 got a chance to be early users of W&B](https://www.youtube.com/watch?t=1468&v=Eiz1zcqrqw0&feature=youtu.be&ab_channel=FullStackDeepLearning)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RmTFHvxHi4el" + }, + "source": [ + "## Basic stats and behavioral monitoring" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hYSQ0r7eV094" + }, + "source": [ + "We start by just getting some basic statistics.\n", + "\n", + "For example, we can get descriptive statistics for\n", + "the information we've logged." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Fb3BMn7gfQRI" + }, + "outputs": [], + "source": [ + "df[\"feedback.flag\"].describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T9OseYhc1Q8i" + }, + "source": [ + "Note that the format we're working with is the `pandas.DataFrame` --\n", + "a standard format for tables in Python.\n", + "\n", + "`pandas` can be\n", + "[very tricky](https://github.com/chiphuyen/just-pandas-things).\n", + "\n", + "It's not so bad when doing exploratory analysis like this,\n", + "but take care when using it in production settings!\n", + "\n", + "If you'd like to learn more `pandas`,\n", + "[Brandon Rhodes's `pandas` tutorial from PyCon 2015](https://www.youtube.com/watch?v=5JnMutdy6Fw&ab_channel=PyCon2015)\n", + "is still one of the best introductions,\n", + "even though it's nearly a decade old." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eG15SMkgV095" + }, + "source": [ + "`pandas` objects support sampling with `.sample`,\n", + "which is useful for quick \"spot-checking\" of data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FZ5BRRqjc1Of" + }, + "outputs": [], + "source": [ + "df[\"feedback.flag\"].sample(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w3rZaYwSzu-D" + }, + "source": [ + "Unlike many other kinds of applications,\n", + "toxic and offensive behavior is\n", + "one of the most critical potential issues with\n", + "many ML models,\n", + "from\n", + "[generative models like GPT-3](https://www.middlebury.edu/institute/sites/www.middlebury.edu.institute/files/2020-09/gpt3-article.pdf)\n", + "to even humble\n", + "[image labeling models](https://archive.nytimes.com/bits.blogs.nytimes.com/2015/07/01/google-photos-mistakenly-labels-black-people-gorillas/).\n", + "\n", + "So ML models, especially when newly deployed\n", + "or when encountering new user bases,\n", + "need careful supervision." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-CbdSz0hzze7" + }, + "source": [ + "We use a\n", + "[Gantry tool called Projections](https://docs.gantry.io/en/stable/guides/projections.html)\n", + "to apply the NLP models from the\n", + "[`detoxify` suite](https://github.com/unitaryai/detoxify),\n", + "which score text for features like obscenity and identity attacks,\n", + "to our model's outputs." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Z4lsgRcpQql" + }, + "source": [ + "To get a quick plot of the resulting values,\n", + "we can use the `pandas` built-in interface\n", + "to `matplotlib`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9UbBg947fAsh" + }, + "outputs": [], + "source": [ + "df.plot(y=\"detoxify.obscene(outputs.output_text)\", kind=\"hist\");" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qxiIXGf0pVd5" + }, + "source": [ + "Without context, this chart isn't super useful --\n", + "is a score of `obscene=0.12` bad?\n", + "\n", + "We need a baseline!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UbOeOkzQgBDE" + }, + "source": [ + "Once the model is stable in production,\n", + "we can compare the values across time --\n", + "grouping or filtering production data by timestamp.\n", + "\n", + "Here, for this first version of the model,\n", + "we compare the results here with the results on the test data,\n", + "which was also ingested with `gantry`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ooa-Al48f_au" + }, + "outputs": [], + "source": [ + "test_df = raw_df.dropna(axis=\"columns\", how=\"all\") # remove any irrelevant columns\n", + "test_df = test_df[test_df[\"tags.env\"] == \"test\"] # filter down to info logged from the test environment\n", + "\n", + "test_df.sample(10) # show a sample" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TssF7sSX1Q8k" + }, + "source": [ + "To compare the two `DataFrame`s,\n", + "we `concat`enate them together\n", + "and add in some metadata\n", + "identifying where the observations came from.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oXWqfOdfgi4o" + }, + "outputs": [], + "source": [ + "test_df[\"environment\"] = \"test\"\n", + "df[\"environment\"] = \"prod\"\n", + "\n", + "comparison_df = pd.concat([df, test_df])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5fp9gAX_V09_" + }, + "source": [ + "From there, we can use grouping to calculate statistics of interest:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NIGBxyZIV09_" + }, + "outputs": [], + "source": [ + "stats = comparison_df.groupby(\"environment\").describe()\n", + "\n", + "stats[\"detoxify.obscene(outputs.output_text)\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2G2tVhhY1Q8k" + }, + "source": [ + "These descriptive statistics are helpful,\n", + "but as with our simple plot above,\n", + "we want to _look_ at the data.\n", + "\n", + "Exploratory data analysis is typically very visual --\n", + "the goal is to find phenomena so obvious\n", + "that statistical testing is an afterthought --\n", + "and so is exploratory model analysis.\n", + "\n", + "`matplotlib` is based on plotting arrays,\n", + "rather than `DataFrame`s or other tabular data,\n", + "so it's not a great fit on its own here,\n", + "unless we want to tolerate a lot of boilerplate.\n", + "\n", + "`pandas` has basic built-in plotting\n", + "that interfaces with `matplotlib`,\n", + "but it's not that ergonomic for comparisons or flexible\n", + "without just dropping back to matplotlib.\n", + "\n", + "There are a number of other Python plotting libraries,\n", + "many with an emphasis on share-ability and interaction\n", + "([Vega-Altair](https://altair-viz.github.io/),\n", + "[`bokeh`](http://bokeh.org/),\n", + "and\n", + "[Plotly](https://plotly.com/),\n", + "to name a few)\n", + "and others with an emphasis on usability\n", + "(e.g. [`ggplot`](https://realpython.com/ggplot-python/)).\n", + "\n", + "The one that we like for in-notebook analysis\n", + "that balances ease of use\n", + "on tabular data with flexibility is\n", + "[`seaborn`](https://seaborn.pydata.org/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7nZV8uoG1Q8k" + }, + "source": [ + "Comparing the distributions of the `detoxify.obscene` metric\n", + "is a single function call:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WnGxCz1f1Q8k" + }, + "outputs": [], + "source": [ + "import seaborn as sns\n", + "\n", + "\n", + "sns.displot( # plot the dis-tribution\n", + " data=comparison_df, # of data from this df\n", + " # specifically, this column, along the x-axis\n", + " x=\"detoxify.obscene(outputs.output_text)\",\n", + " # and split it up (in color/hue) by this column\n", + " hue=\"environment\"\n", + ");" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jO6FuRCQV0-A" + }, + "source": [ + "We can quickly see that the obscenity scores according to `detoxify`\n", + "are generally lower in our `prod`uction environment,\n", + "so we don't have a reason to suspect\n", + "our model is behaving too badly in production\n", + "-- though see the exercises for more on this!\n", + "\n", + "We can see the same thing\n", + "without having to write query, cleaning, and plotting code\n", + "[in the Gantry UI here](https://app.gantry.io/applications/fsdl-text-recognizer/distribution?view=2022-class&compare=test-ingest) --\n", + "note that viewing the dashboard requires a Gantry account,\n", + "which you can sign up for\n", + "[here](https://gantry.io/fsdl-signup)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iKZ0l2MCjlDn" + }, + "source": [ + "## Debugging the Text Recognizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ovp8fZ1GpUet" + }, + "source": [ + "In our application,\n", + "we don't have user corrections or labels from annotators,\n", + "so we can't calculate an accuracy, a loss, or a character error rate.\n", + "\n", + "We instead look for signals that are correlated with\n", + "those values.\n", + "\n", + "This approach has limits\n", + "(see, e.g. the analysis in the\n", + "[MLDeMon paper](https://arxiv.org/abs/2104.13621))\n", + "and setting alerts or test failures on things that are only correlated with,\n", + "rather than directly caused by, poor performance is a bad idea.\n", + "\n", + "But it's very useful to have this information logged\n", + "to catch large errors at a glance\n", + "or to provide tools for slicing, filtering, and grouping data\n", + "while doing exploratory model analysis or debugging." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0YauDrY51Q8l" + }, + "source": [ + "We can also compute these signals with Gantry Projections.\n", + "\n", + "Low entropy (e.g. repetition) is a failure mode of language models,\n", + "as is excessively high entropy (e.g. uniformly random text).\n", + "\n", + "We can review the output text entropy distributions in\n", + "production and during testing\n", + "by plotting them against one another\n", + "(here or\n", + "[in the Gantry UI](https://app.gantry.io/applications/fsdl-text-recognizer/distribution?view=2022-class&compare=test-ingest))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "czepR9o7l2FO" + }, + "outputs": [], + "source": [ + "sns.displot(\n", + " data=comparison_df,\n", + " x=\"text_stats.basics.entropy(outputs.output_text)\",\n", + " hue=\"environment\"\n", + ");" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8LiFvkoR1Q8l" + }, + "source": [ + "It appears there are more low-entropy strings in the model's outputs in production.\n", + "\n", + "With models that operate on human-relevant data,\n", + "like text and images,\n", + "it's important to look at the raw data,\n", + "not just projections.\n", + "\n", + "Let's take a look at a sample of outputs from the model running on test data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FQ9kTz2ZmOwR" + }, + "outputs": [], + "source": [ + "test_df[\"outputs.output_text\"].sample(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BpZ_35uD1Q8l" + }, + "source": [ + "The results are not incredible, but they are recognizably \"English with typos\"." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NVlj3vYf1Q8l" + }, + "source": [ + "Let's look specifically at low entropy examples from production\n", + "(we can also view this\n", + "[filtered data in the Gantry UI](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entropy&compare=test-ingest))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "p0dkx1VzoJ9C" + }, + "outputs": [], + "source": [ + "df.loc[df[\"text_stats.basics.entropy(outputs.output_text)\"] < 5][\"outputs.output_text\"].sample(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iMmcPuynV0-C" + }, + "source": [ + "Yikes! Lots of repetitive gibberish." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "stStBoCZ1Q8m" + }, + "source": [ + "Knowing the outputs are bad,\n", + "there are two culprits:\n", + "the input-output mapping (aka the model)\n", + "or the inputs." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nFaGYnjcmKf6" + }, + "source": [ + "We ran the same model in a similar environment\n", + "to get those outputs,\n", + "so it's most likely due to some difference in the inputs.\n", + "\n", + "Let's check them!\n", + "\n", + "We added Gantry Projections to look at the distribution of pixel values as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uSwnexFRlaIV" + }, + "outputs": [], + "source": [ + "sns.displot(\n", + " data=comparison_df,\n", + " x=\"image.greyscale_image_mean(inputs.image)\",\n", + " hue=\"environment\"\n", + ");" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iqkWkM45yMgV" + }, + "source": [ + "There's a huge difference in mean pixel values --\n", + "almost all images have mean intensities that are very dark in the testing environment,\n", + "but we see both dark and light images in production.\n", + "\n", + "Reviewing the\n", + "[raw data in Gantry](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entropy&compare=test-ingest)\n", + "confirms that we are getting images with very different brightnesses in production\n", + "and whiffing the predictions\n", + "-- along with images that reveal a number of other interesting failure modes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X5uWeR6n1Q8m" + }, + "source": [ + "To take a look locally,\n", + "we'll need to pull the images down from S3,\n", + "where they are stored." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NbNMlevz1Q8m" + }, + "source": [ + "The cell below defines a quick utility for\n", + "reading from S3 without authentication.\n", + "\n", + "It is based on the `smart_open` and `boto3` libraries,\n", + "which we briefly saw in the\n", + "[model deployment lab](https://fsdl.me/lab07-colab)\n", + "and the\n", + "[data annotation lab](https://fsdl.me/lab06-colab)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-FNIm0MOovtu" + }, + "outputs": [], + "source": [ + "import boto3\n", + "from botocore import UNSIGNED\n", + "from botocore.config import Config\n", + "import smart_open\n", + "\n", + "from text_recognizer.util import read_image_pil_file\n", + "\n", + "# spin up a client for communicating with s3 without authenticating (\"UNSIGNED\" activity)\n", + "s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))\n", + "unsigned_params = {\"client\": s3}\n", + "\n", + "def read_image_unsigned(image_uri, grayscale=False):\n", + " with smart_open.open(image_uri, \"rb\", transport_params=unsigned_params) as image_file:\n", + " return read_image_pil_file(image_file, grayscale)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SxBpmPYrV0-F" + }, + "source": [ + "Run the cell below to repeatedly sample a random input/output pair\n", + "flagged in production." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Xy90rzcWobuk" + }, + "outputs": [], + "source": [ + "row = df.sample().iloc[0]\n", + "print(\"image url:\", row[\"inputs.image\"])\n", + "print(\"prediction:\", row[\"outputs.output_text\"])\n", + "read_image_unsigned(row[\"inputs.image\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oFdT2W2xtOGx" + }, + "source": [ + "### Take-aways for developing models\n", + "\n", + "The most immediate take-away from reviewing just a few examples is that\n", + "user data is way more heterogeneous than train/val/test data!\n", + "\n", + "This a\n", + "[fairly](https://browsee.io/blog/a-guide-to-session-replays-for-product-managers/)\n", + "[universal](https://medium.com/@beasles/edge-case-responsive-design-9b610138ddbd)\n", + "[finding](https://quoteinvestigator.com/2021/05/04/no-plan/).\n", + "\n", + "Let's also consider some specific failure modes in our case\n", + "and how we might resolve them:\n", + "\n", + "- Failure mode: Users mostly provide images with dark text on light background, but we train on dark background.\n", + " - Resolution: We could check image brightness and flip if needed,\n", + " but this feels like a cop-out -- most text is dark on a light background!\n", + " - Resolution: We add image brightness inversion to our train-time augmentations.\n", + "- Failure mode: Users expect our \"handwritten text recognition\" tool to work with printed and digital text.\n", + " - Resolution: We could try better sign-posting and user education,\n", + " but this is also something of a cop-out.\n", + " Users expect the tool to work on all text,\n", + " so we shouldn't violate that expectation.\n", + " - Resolution: We synthesize digital text data --\n", + " text rendering is a feature of just about any mature programming language.\n", + "- Failure mode: Users provide text on heterogeneous backgrounds\n", + " - Resolution: We collect or synthesize more heterogeneous data,\n", + " e.g. placing text (with or without background coloring)\n", + " on top of random image backgrounds.\n", + "- Failure mode: Users provide text with characters and symbols outside of our dictionary.\n", + " - Resolution: We can expand the model outputs and collect more heterogeneous data\n", + "- Failure mode: Users provide images with multiple blocks of text\n", + " - Resolution: We develop an architecture/task definition that can handle multiple regions.\n", + " We'll need to collect and/or synthesize data to support" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9rQH6zI8u7WN" + }, + "source": [ + "Notice: these are almost entirely changes to data,\n", + "and most of them involve collecting more or synthesizing it.\n", + "\n", + "This is very much typical!\n", + "\n", + "Data drives improvements to models,\n", + "[even at scale](https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2P5MrIW5V0-F" + }, + "source": [ + "### Take-aways for exploratory model analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mfMf1wwR1Q8n" + }, + "source": [ + "Notice that we had to write a lot of code,\n", + "which was developed and which we ran in a\n", + "tight interactive loop.\n", + "\n", + "This type of code is very hard to turn into scripts --\n", + "how do you trigger an alert on a plot? --\n", + "which makes it brittle and hard to version and share.\n", + "\n", + "It's also based on possibly very large-scale data artifacts.\n", + "\n", + "The right tool for this job is a UI\n", + "on top of a database.\n", + "\n", + "In the\n", + "[video walkthrough for this lab](https://fsdl.me/2022-lab-08-video),\n", + "we do the effectively the same analysis,\n", + "but inside Gantry,\n", + "which makes the process more fluid.\n", + "\n", + "Gantry is still in closed beta,\n", + "but if you're interested in applying it to your own applications, you can\n", + "[join the waitlist](https://gantry.io/waitlist/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M73gui0XhgCl" + }, + "source": [ + "# Exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mWWrmGiThhMw" + }, + "source": [ + "### 🌟 Examine the test data strings, both output and ground truth." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "km0nv0Mghmd_" + }, + "source": [ + "We compared our production obscenity metric to the test-time values of that same metric\n", + "and determined that we had not gotten worse,\n", + "so things were fine.\n", + "\n", + "But what if the test-time baseline is bad?\n", + "\n", + "Review the raw test ground truth data\n", + "[here](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=test-ingest),\n", + "if you\n", + "[signed up a Gantry account](https://gantry.io/fsdl-signup),\n", + "or by looking at the contents of `test_df` above.\n", + "\n", + "Sort by `detoxify.identity_attack(feedback.ground_truth_string)`\n", + "or filter to only high values of that metric.\n", + "\n", + "Review the example `feedback.ground_truth_string` texts and consider:\n", + "is this the subset of English\n", + "we want the model to be training on?\n", + "what objections might be raised to the contents?\n", + "\n", + "You might also look for cases where the `detoxify` models misunderstood meaning --\n", + "e.g. an innocuous use of a word that's often used objectionably." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Q6mWRwS1Q8t" + }, + "source": [ + "### 🌟🌟 Start building \"regression testing suites\" by doing error analysis on these examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jfsCnjCg1Q8t" + }, + "source": [ + "Do this by going through feedback data one image/text pair at a time --\n", + "[in Gantry](https://app.gantry.io/applications/fsdl-text-recognizer/data?view=2022-class-low-entrop)\n", + "or inside this notebook.\n", + "\n", + "Start by just taking notes on each example\n", + "(anywhere -- Google Sheets/Excel/Notion, or just a sheet of paper).\n", + "\n", + "The primary question you should ask is:\n", + "how does this example differ from the data shown in training?\n", + "\n", + "Check\n", + "[this W&B Artifact page](https://wandb.ai/cfrye59/fsdl-text-recognizer-2021-training/artifacts/run_table/run-1vrnrd8p-trainpredictions/v194/files/train/predictions.table.json#f5854c9c18f6c24a4e99)\n", + "to see what training data\n", + "(including augmentation)\n", + "looks like.\n", + "\n", + "Once you have some notes,\n", + "try and formalize them into a small number of \"failure modes\" --\n", + "you can choose to align them with the failure modes described in the section\n", + "on take-aways for model development or not.\n", + "\n", + "If you want to finish the loop,\n", + "you might set up Label Studio, as in\n", + "[the data annotation lab](https://fsdl.me/lab06-colab).\n", + "An annotator should add at least a\n", + "\"label\" that gives the type of issue\n", + "and perhaps also add a text annotation\n", + "while they are at it." + ] + } + ], + "metadata": { + "colab": { + "private_outputs": true, + "provenance": [], + "toc_visible": true + }, + "gpuClass": "standard", + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.13" + }, + "vscode": { + "interpreter": { + "hash": "0f056848cf5d2396a4970b625f23716aa539c2ff5334414c1b5d98d7daae66f6" + } + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file