https://github.com/harrig12/hummingbird_eda
Data in the wild | "Domesticated" data |
---|---|
![]() |
|
Messy | Clean! (relatively) |
Domain expertise needed | Well studied benchmarks |
May require special handling (ethics approval, privacy, IP) | May be simulated, or otherwise not representative of the real world |
Depending on your context standards for how to treat your data will vary widely...
Some examples:
Datasets tend to have hidden complexities that only reveal themselves once you’ve spent some time with them.
The domain expert is the person who knows about the technicalities of data collection, and is aware of the interests of your stakeholders.
(Sometimes you have to act as both the datascientist, and the domain expert!)
Datasaurus Dozen http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
# load plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
# load penguin data
from palmerpenguins import load_penguins
penguins = load_penguins()
sns.lmplot(penguins, x='bill_length_mm', y='bill_depth_mm', aspect = 2)
plt.title('Penguin bill depth is negatively correlated with length?', fontsize = 15);