Exploring & Explaining Data in the Wild¶

March 13 2023¶

Cait Harrigan¶

cait.harrigan@mail.utoronto.ca

UofT CS logo DSI logo Vector logo

Today:¶

  1. Context-aware data analysis
  2. EDA checklist
  3. Practical example
  4. Story building, extracting data insights

Download this notebook & data:¶

https://github.com/harrig12/hummingbird_eda

Where does your data come from?¶

Data in the wild "Domesticated" data
lion cat
Messy Clean! (relatively)
Domain expertise needed Well studied benchmarks
May require special handling (ethics approval, privacy, IP) May be simulated, or otherwise not representative of the real world

Data analysis should be context-aware¶

Depending on your context standards for how to treat your data will vary widely...

Some examples:

  • A statistical theory researcher using linear regression as a toy model
  • An uber intern training a model to keep an autonomous vehicle within road lines
  • In my research, I look for underlying data structure which tells us something new about cancer biology

The nature of the data and problem setting shapes what approaches are suitable to the task. Our methods have to be selected with respect to potential risks and benefits, budget (time/compute/monetary), and standards of the field.

The stats theorist might not be so interested in querying the data itself, but how their method behaves across all possible datasets

My uber intern probably wants to make sure they have high coverage of the data, few failure modes - or at least easily detectable failure modes. Risks: car crash! Generalization to new environments may be critical for this setting.

For my data, if I find out something useful for even a subset of patients, this is a good result! Doesn't havet to apply to the whole dataset. Interpolation is more acceptable. Data access is tightly controlled for health privacy.

Your lion tamers: domain experts¶

Datasets tend to have hidden complexities that only reveal themselves once you’ve spent some time with them.

The domain expert is the person who knows about the technicalities of data collection, and is aware of the interests of your stakeholders.

(Sometimes you have to act as both the datascientist, and the domain expert!)

The domain expert is the person you have to ask your quesitons to Usually you have to look at the data first, before any suitable questions come up They also parameterize the expectations/outcomes of an analysis Maybe we do a linear regression to find out the coefficient of correlation between variables, but it's the outside knowledge of the condition/limitations of the setting which actually makes this actionable.

How to make effective use experts?¶

Challenges:¶

  • Querying the expert is expensve
  • Getting on the same page
  • Asking the right questions
  • Querying the expert is expensve
    • limited time to discuss
    • minimal instructions
  • Getting on the same page
    • establish the goals of the analysis (come back to these at the story telling stage)
    • build a shared language
    • check your assumptions on the data
  • Asking the right questions
    • critical to spend some time looking at the data before going to your expert
    • Most often, I find myself asking: "I saw xyz in the data. Is this expected?"

https://www.mrdbourke.com/a-gentle-introduction-to-exploratory-data-analysis/

EDA is a separate activity from data storytelling¶

EDA is a separate activity from data storytelling¶

EDA is a separate activity from data storytelling¶

"never trust summary statistics alone; always visualize your data" - Albert Cairo¶

Datasaurus Dozen http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html

Simpson's paradox: be on the lookout for hidden structure in your data!¶

palmer penguins bill diagram

Artwork by @allison_horst

In [3]:
# load plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# load penguin data
from palmerpenguins import load_penguins
penguins = load_penguins()
In [4]:
sns.lmplot(penguins, x='bill_length_mm', y='bill_depth_mm', aspect = 2)
plt.title('Penguin bill depth is negatively correlated with length?', fontsize = 15);