ELEC-D7010 Engineering for Humans

Lecture 9 - Experimental Methods

Aurélien Nioche

Aalto University

 Learning objectives


Learn by a series of examples the elementary rules of experimentation, and the main threats to validity

Define a research problem & the hypothesis

The research problem constitutes the question(s) you aim to answer.

The hypothesis constitute(s) the answer(s) to this(these) possible question(s).

 Important


Strong restriction: the hypothesis has to be a falsifiable statement (Popper, 1959 as a response to the problem of induction of Hume, 1739-40)

For a contextualized explanation, you can refer to the Stanford Encyclopedia: Andersen, Hanne and Brian Hepburn, "Scientific Method", The Stanford Encyclopedia of Philosophy (Summer 2016 Edition), Edward N. Zalta (ed.), URL = https://plato.stanford.edu/archives/sum2016/entries/scientific-method/.

H: "All swans are white"


H: "Most people do not really want freedom, because freedom involves responsibility, and most people are frightened of responsibility."

Is this statement falsifiable?


Examples of non falsifiable statements:

  • tautologies
  • vague open-ended statements

Note: do not cofound *working* hypothesis and *testing* hypothesis

 Important


Be aware of the confirmation bias

...but even using a scientific method, we can be tempted...

Operationalize

Operationalize: express or define (something) in terms of the operations used to determine or prove it.

Differentiate:
  • Dependent variable: what we measure
  • Independent variable: what we manipulate

Let's take an example:


Let's assume that I want to test the hypothesis that the gain in speed using a swipe typing keyboard is depending on the age of the user.

Several operationalizations are possible but here is one:

  • Dependent variable: Word Per Minute (WPS) on a typing test.
  • Independent variable: Age group (15-20, 20-25, etc.).

Note: they can be several dependant and independant variables


An experiment is a comparison, so...

 Important


Choose carefully your baseline and your control, in order to have interpretable results.

Why it is important? Let's take an example...

Example

Choose the experimental design

 Important


Differentiate:
  • between subjects design
  • within subjects design

To make it short...
Do you expect large variability in your subjects?

  • No: between subjects
  • Yes: whithin subjects

If you choose "yes", be aware of possible counfound factors, such as the order effect


Collect the data

 Important


Choose carefully your sample, i.e., be aware of selection bias

Even where you would not have expect them, a lot of differences can come from (arbitrary order):

  1. age
  2. gender
  3. cultural background / ethnicity
  4. education
  5. level of expertise of X or Y than can interfere with your experimental task

Why it is important? Let's take an example...

Example

Analyze and interpret the data

How to go from observations to interpretations to answer (or contributes to the understanding to) your research problem

Look at the raw data

 Important


Look at your raw data first

Why it is important? Let's take an example...

Example

Dataset 1

Let's load the data from circle-data.csv

Code
In [83]:
# Import the libraries
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import scipy.special as sps  # For gamma function
In [84]:
# Load the data
df = pd.read_csv(
    os.path.join("data", "circle-data.csv"),
    index_col=[0])

# Print the top of the file
df
Out[84]:
x y
0 53.500480 82.351329
1 50.629577 84.715953
2 49.932717 79.284252
3 48.788095 81.244959
4 49.496214 78.992123
... ... ...
137 39.580304 21.216219
138 85.303704 41.078480
139 51.310546 84.395317
140 50.594653 81.292113
141 48.743542 82.405670

142 rows × 2 columns

You could be tempted to begin to compute descriptive statistics such as mean instead of looking to your data...

In [85]:
# For both variables
for var in "x", "y":
    
    # Compute the mean and variance and print the result showing only 2 digits after the comma
    print(f"Mean '{var}': {np.mean(df[var]):.2f} +/- {np.std(df[var]):.2f} STD")
Mean 'x': 54.27 +/- 16.70 STD
Mean 'y': 47.83 +/- 26.84 STD

And still without looking at the raw data, let's do a barplot:

Visualize with a simple bareplot
In [86]:
# Let's flip the dataframe (inverse row and columns)
df_flipped = df.melt()
# Do a barplot
sns.barplot(x="variable", y="value", data=df_flipped, ci="sd")
plt.title("Dataset 1")
plt.show()

Dataset 2

Let's consider a second dataset...

Let's load the data from dino-data.csv

Code
In [87]:
# Load the data
df_other = pd.read_csv(
    os.path.join("data", "dino-data.csv"),
    index_col=[0])

# Look at the top of the file
df_other
Out[87]:
x y
0 55.384600 97.179500
1 51.538500 96.025600
2 46.153800 94.487200
3 42.820500 91.410300
4 40.769200 88.333300
... ... ...
137 39.487200 25.384600
138 91.282100 41.538500
139 50.000000 95.769200
140 47.948700 95.000000
141 44.168231 92.657053

142 rows × 2 columns

In [90]:
# For both variables...
for var in ("x", "y"):
    
    # Print the means and variances for the original dataset
    print(f"Dataset 1 - Mean '{var}': {np.mean(df[var]):.1f} +/- {np.std(df[var]):.2f} STD")
print()

# For both variables...
for var in ("x", "y"):
    
    # Print the means and variances for the second dataset
    print(f"Dataset 2 - Mean '{var}': {np.mean(df_other[var]):.1f} +/- {np.std(df_other[var]):.2f} STD")
Dataset 1 - Mean 'x': 54.3 +/- 16.70 STD
Dataset 1 - Mean 'y': 47.8 +/- 26.84 STD

Dataset 2 - Mean 'x': 54.3 +/- 16.71 STD
Dataset 2 - Mean 'y': 47.8 +/- 26.84 STD
Visualize with a simple bareplot
In [25]:
# Do a barplot
sns.barplot(x="variable", y="value", data=df_other.melt(), ci="sd")
plt.title("Dataset 2")
plt.show()

Compare by looking at the raw data

They look quite alike, isn't it?

In [105]:
# Create figure and axes
fig, axes = plt.subplots(ncols=2)

# Dot the left barplot
sns.barplot(x="variable", y="value", data=df.melt(), ax=axes[0], ci="sd")
# Set the title
axes[0].set_title("Original dataset")

# Do the right barplot
sns.barplot(x="variable", y="value", data=df_other.melt(), ax=axes[1], ci="sd")
# Set the title
axes[1].set_title("Other dataset")

plt.tight_layout()
plt.show()

However...

In [23]:
# Create figure and axes
fig, axes = plt.subplots(ncols=2, figsize=(12, 9))


# For both dataset
for i, (label, data) in enumerate((("Dataset 1", df), ("Dataset 2", df_other))):
    
    # Do a scatter plot
    ax = axes[i]
    sns.scatterplot(x="x", y="y", data=data, ax=ax)
    
    # Set the title
    ax.set_title(label)
    
    # Set the limits of the axes
    ax.set_xlim(0, 100)
    ax.set_ylim(0, 100)
    
    # Make it look square
    ax.set_aspect(1)

plt.tight_layout()
plt.show()

The descriptive statistics are (almost identical) but the distributions are very different. Look at your raw data first!

A few more like this:

Compute descriptive statistics

Descriptive statistics (by opposition to inferential statistics) allow to summarize the observations in your sample.

 Important


Always represent a dispersion measure of your data (e.g., standard deviation)

Why is it important? Let's take an example!

Example

Generate data

In [33]:
# Seed the random number generator
np.random.seed(4)

# Set the parameters
mean_1 = 150.0
mean_2 = 200.0

small_std = 10.0
large_std = 50.0

n = 100
In [34]:
# Create the samples
val1_small_std = np.random.normal(mean_1, scale=large_std, size=n)
val2_small_std = np.random.normal(mean_2, scale=large_std, size=n)

val1_large_std = np.random.normal(mean_1, scale=small_std, size=n)
val2_large_std = np.random.normal(mean_2, scale=small_std, size=n)

# Print a few values
print("val1_small_std (3 first values):", val1_small_std[:3])
print("val2_small_std (3 first values):", val2_small_std[:3])
print("val1_large_std (3 first values):", val1_large_std[:3])
print("val2_large_std (3 first values):", val2_large_std[:3])
val1_small_std (3 first values): [152.52808536 174.99756666 100.20455344]
val2_small_std (3 first values): [236.91235556 232.53766155 230.73703148]
val1_large_std (3 first values): [145.27337184 150.96267944 162.29213395]
val2_large_std (3 first values): [211.9451626  195.01279275 199.19260386]

Visualize the distribution

In [36]:
# Create figure and axes
fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(16, 9))

# For each dataset (containing each two samples)
for i, (val1, val2) in enumerate(((val1_large_std, val2_large_std),
                                  (val1_small_std, val2_small_std))):

    # Create histograms
    ax = axes[i, 0]
    sns.histplot(x=val1, ax=ax, color="C0", kde=False, alpha=0.5, lw=0)
    sns.histplot(x=val2, ax=ax, color="C1", kde=False, alpha=0.5, lw=0)
    
    # Plot the theoretical mean
    ax.axvline(mean_1, ls='--', color='black', alpha=0.1, lw=2)
    ax.axvline(mean_2, ls='--', color='black', alpha=0.1, lw=2)
    
    # Set the axis lables
    ax.set_ylabel("Proportion")
    ax.set_xlabel("value")

    # Create a barplot
    ax = axes[i, 1]
    df = pd.DataFrame({"x": val1, "y": val2}).melt()
    sns.barplot(x="variable", y="value", ax=ax, data=df, ci="sd")
    
    # Add horizontal lines representing the means
    ax.axhline(mean_1, ls='--', color='black', alpha=0.1, lw=2)
    ax.axhline(mean_2, ls='--', color='black', alpha=0.1, lw=2)
    
    # Set the y limits
    ax.set_ylim(0, max(mean_1, mean_2) + large_std * 1.25)

plt.tight_layout()
plt.show()

The difference of means are identical but the dispersions are different. In one case, it seems adequate to consider that there is a difference between $X$ and $Y$, while it is not that evident in the other. Always look at the dispersion (STD/variance)!

Compute inferential statistics

What for? To reply to the question:

Can we generalize what we observe in our sample to the parent population?

Several test exist depending on the distribution of your data, the experimental design, and more generally the conditions of applications of each test:

Each of those gives a probability (p-value) to reject the null-hypothesis by mistake.

 Important


A statistical test have conditions of application

Example of application

Let's re-use the data from the previous example...

Visualize the data

In [48]:
# Create figure and axes
fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(16, 9))

# For each dataset (containing each two samples)
for i, (val1, val2) in enumerate(((val1_large_std, val2_large_std),
                                  (val1_small_std, val2_small_std))):

    # Create histograms
    ax = axes[i, 0]
    sns.histplot(x=val1, ax=ax, color="C0", kde=False, alpha=0.5, lw=0)
    sns.histplot(x=val2, ax=ax, color="C1", kde=False, alpha=0.5, lw=0)
    
    # Plot the theoretical mean
    ax.axvline(mean_1, ls='--', color='black', alpha=0.1, lw=2)
    ax.axvline(mean_2, ls='--', color='black', alpha=0.1, lw=2)
    
    # Set the axis lables
    ax.set_ylabel("Proportion")
    ax.set_xlabel("value")

    # Create a barplot
    ax = axes[i, 1]
    df = pd.DataFrame({"x": val1, "y": val2}).melt()
    sns.barplot(x="variable", y="value", ax=ax, data=df, ci="sd")
    
    # Add horizontal lines representing the means
    ax.axhline(mean_1, ls='--', color='black', alpha=0.1, lw=2)
    ax.axhline(mean_2, ls='--', color='black', alpha=0.1, lw=2)
    
    # Set the y limits
    ax.set_ylim(0, max(mean_1, mean_2) + large_std * 1.25)

plt.tight_layout()
plt.show()

Run test

In [49]:
# Run a Student's t-test
t, p = stats.ttest_ind(val1_small_std, val2_small_std)

# Print the results
print(f"t={t}, p={p}")
t=-7.277602743357337, p=7.789522376440221e-12
In [50]:
# Run a Student's t-test
t, p = stats.ttest_ind(val1_large_std, val2_large_std)

# Print the results
print(f"t={t}, p={p}")
t=-34.11619742828222, p=7.512953246606366e-85

It turns out that the $n$ is so large, that the difference is statistically significant. Inferential statistics are a good tool to know if we could generalize what we observed (STD/variance)!

Intuition about the meaning of the p-value

The p-value is the probability to reject the null hypothesis by mistake.

So let's simulate and see how often I could be wrong...

Generate data

In [68]:
# Seed the random number generator
np.random.seed(1234)

# Set the parameters
n = 100
mu1, mu2 = 100, 110
sigma1, sigma2 = 30, 60

# Generate two samples
x1 = np.random.normal(mu1, scale=sigma1, size=n)
x2 = np.random.normal(mu2, scale=sigma2, size=n)

# Look at the data
pd.DataFrame(data={"x1": x1, "x2": x2})
Out[68]:
x1 x2
0 114.143055 127.472322
1 64.270729 143.992022
2 142.981209 140.215506
3 90.620443 127.117741
4 78.382338 139.057287
... ... ...
95 97.541588 84.107029
96 89.657020 100.331786
97 115.848644 163.349450
98 67.930336 127.302611
99 84.643561 46.907664

100 rows × 2 columns

Visualize data

In [69]:
# Create figure and axes
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 4))

# Create the histograms
for (x, color) in ((x1, "C0"), (x2, "C1")):
    sns.histplot(x, ax=ax1, color=color, linewidth=0, alpha=0.5, 
                 stat="density")  # y-axis is density instead of counts

# Put labels
ax1.set_ylabel("Density")
ax1.set_xlabel("value")

# Plot the barplot
df = pd.DataFrame({"x1": x1, "x2": x2}).melt(var_name="group")
sns.barplot(x="group", y="value", ax=ax2, data=df, ci="sd")

plt.tight_layout()
plt.show()

Run test

In [71]:
# Run a Welch t-test
t, p = stats.ttest_ind(x1, x2, equal_var=False)

# Print the results
print(f"t = {t}, p = {p}")
t = -0.8967917151340026, p = 0.37129608235308287

Reproduce test result

...by looking at which frequency I observed the inverse difference

In [72]:
# Compute the observed means for the two samples
obs_mu1 = np.mean(x1)
obs_mu2 = np.mean(x2)

# Compute the observed standard deviations for the two samples
obs_sigma1 = np.std(x1)  
obs_sigma2 = np.std(x2)
In [73]:
# Set the parameters
n_dataset = 10000

# Container for the results 
mu1_sup_mu2 = np.zeros(n_dataset)

# Create data for each dataset
for i in range(n_dataset):
    
    # Create two samples using parameters observed in the sample, and of the same size 
    x1 = np.random.normal(obs_mu1, scale=obs_sigma1, size=n)  
    x2 = np.random.normal(obs_mu2, scale=obs_sigma2, size=n)
    
    # Look if observed mean for sample 1
    # is superior for sample 2
    r = np.mean(x1) > np.mean(x2)
    
    # Store the result 
    mu1_sup_mu2[i] = r

# Compute the frequence with which 
# 'inverse' difference is observed
err_freq = np.mean(mu1_sup_mu2)

Visualize the results

In [75]:
# Create figure and axis
fig, ax = plt.subplots()

# Define labels
labels = ['$\mu1 > \mu2$ (freq)', "p-value / 2"]

# Defines values
values = [err_freq, p/2]

# Create barplot
ax.bar(labels, values)

# Set the limit of the y-axis
ax.set_ylim(0, 0.5)

plt.show()

Check for validity threats

Differentiate internal and external validity
  • Internal: relative to your data; are the manipulation that you did the (main) cause of what you observe?
  • External: relative to the relation between your data and the outside world; can we generalize what you observe?

Correlation is not causation

 Important


Remember that correlation does not imply causation

Why it is important? Let's take an example...

Example

In [45]:
# Load the data
data = pd.read_csv(os.path.join("data", "corr.csv"))

# Plot the top of the file
data
Out[45]:
Per capita consumption of cheddar cheese (US) Pounds (USDA) People killed by immunosuppressive agents Deaths (US) (CDC) year
0 9.7 7 2000
1 9.9 8 2001
2 9.6 3 2002
3 9.2 1 2003
4 10.3 24 2004
5 10.3 16 2005
6 10.4 14 2006
7 10.0 13 2007
8 10.1 15 2008
9 10.1 13 2009
In [115]:
# Create shorcuts for these very long labels
lab_cheese = "Per capita consumption of cheddar cheese (US) Pounds (USDA)"
lab_death = "People killed by immunosuppressive agents Deaths (US) (CDC)"

# Check for mispellings
assert lab_cheese in data.columns
assert lab_death in data.columns

# Create the figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

# Create a line for the cheese consumption
sns.lineplot(x="year", y=lab_cheese, data=data, marker="o", color="C0",
             label="cheese consumption", ax=ax, legend=False)

# Create a duplicate of the axis for having a second y-axis
ax = ax.twinx()

# Create a line for the death number
sns.lineplot(x="year", y=lab_death, data=data, marker="P", color="C1",
             label="death number", ax=ax, legend=False)

# Make the legend 
ax.figure.legend()

# Manually the placement of the x-axis ticks
plt.xticks(range(min(data["year"]), max(data["year"])+1, 3))

plt.show()
In [116]:
# Create figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the linear regressioin
sns.regplot(x=lab_cheese, y=lab_death, data=data, ax=ax)

plt.show()
In [117]:
# Compute the correlation coefficient
r, p = stats.pearsonr(data[lab_cheese], data[lab_death])

# Print the results
print(f"r = {r}, p = {p}")
r = 0.8835087706585109, p = 0.0006984113903627763

You can find a lot of surprising spurious correlations here (and also create your own): http://www.tylervigen.com/spurious-correlations

Counfound factors

 Important


Check for counfound factors

Example

Misrepresentation = misinterpretation

 Important


Choose a representation adapted to the type of your data: misrepresentations leads to misinterpreations

Why is it important? Let's take an example...

Example

In [77]:
# Import the data
df = pd.read_csv(os.path.join("data", "rr.csv"))

# Plot the top of the file
df
Out[77]:
Country Year Debt Growth
0 Australia 1946 190.419080 -3.557951
1 Australia 1947 177.321371 2.459475
2 Australia 1948 148.929811 6.437534
3 Australia 1949 125.828699 6.611994
4 Australia 1950 109.809398 6.920201
... ... ... ... ...
1170 US 2005 62.766724 3.054518
1171 US 2006 63.489973 2.672807
1172 US 2007 63.985488 2.141613
1173 US 2008 74.091061 0.438166
1174 US 2009 83.482835 -2.730170

1175 rows × 4 columns

In [81]:
# Create bins
df['DebtBin'] = pd.cut(df.Debt, bins=range(0, 250, 40), include_lowest=False)

# Compute the mean of each bins
y = df.groupby('DebtBin').Growth.mean()

# For the x-axis, compute the middle value of each bin
x = [i.left + (i.right - i.left)/2 for i in y.index.values]

# Create the barplot
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x=x, y=y.values, palette="Blues_d", ax=ax)

# Set the axis labels
ax.set_xlabel("Debt")
ax.set_ylabel("Growth");

However, here is what the raw data look like:

In [78]:
# Create the figure and axis
fig, ax = plt.subplots(figsize=(12, 9))

# Plot a scatter instead
sns.scatterplot(x="Debt", y="Growth", data=df, ax=ax);

The 'step' effect is an artefact due to the misrepresentation of the data. So: (i) Look at your raw data!, (ii) Choose a representation adapted to the structure of your data.

One figure from the original paper:

It is probably possible to do better: this representation leads to misinterpretation!

Occam's razor

 Important


In case of tie: Applying an Occam's razor is often a good strategy

For a contextualized explanation, you can refer to the Stanford Encyclopedia: Baker, Alan, "Simplicity", The Stanford Encyclopedia of Philosophy (Winter 2016 Edition), Edward N. Zalta (ed.), URL = https://plato.stanford.edu/archives/win2016/entries/simplicity/.


Laplace went in state to Napoleon to present a copy of his work [...]. Someone had told Napoleon that the book contained no mention of the name of God; Napoleon [...] received it with the remark, 'M. Laplace, they tell me you have written this large book on the system of the universe, and have never even mentioned its Creator.' Laplace [...] answered bluntly, [...] "I had no need of that hypothesis."[...] Napoleon, greatly amused, told this reply to Lagrange, who exclaimed [...] "Ah, it is a fine hypothesis; it explains many things."

From W.W. Rouse Ball A Short Account of the History of Mathematics, 4th edition, 1908.

If all of chemistry can be explained in a satisfactory manner without the help of phlogiston, that is enough to render it infinitely likely that the principle does not exist, that it is a hypothetical substance, a gratuitous supposition. It is, after all, a principle of logic not to multiply entities unnecessarily (Lavoisier 1862, pp. 623–4).

[T]he grand aim of all science…is to cover the greatest possible number of empirical facts by logical deductions from the smallest possible number of hypotheses or axioms (Einstein, quoted in Nash 1963, p. 173).

Both quotations from the Stanford Encyclopedia.