|Data in the wild||"Domesticated" data|
|Domain expertise needed||Well studied benchmarks|
|May require special handling (ethics approval, privacy, IP)||May be simulated, or otherwise not representative of the real world|
Depending on your context standards for how to treat your data will vary widely...
Datasets tend to have hidden complexities that only reveal themselves once you’ve spent some time with them.
The domain expert is the person who knows about the technicalities of data collection, and is aware of the interests of your stakeholders.
(Sometimes you have to act as both the datascientist, and the domain expert!)
Datasaurus Dozen http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
# load plotting libraries import matplotlib.pyplot as plt import seaborn as sns # load penguin data from palmerpenguins import load_penguins penguins = load_penguins()
sns.lmplot(penguins, x='bill_length_mm', y='bill_depth_mm', aspect = 2) plt.title('Penguin bill depth is negatively correlated with length?', fontsize = 15);