{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to mathematical statistics \n", "\n", "Welcome to the lecture 10 in 02403\n", "\n", "During the lectures we will present both slides and notebooks. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import scipy.stats as stats\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "import statsmodels.stats.power as smp\n", "import statsmodels.stats.proportion as smprop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Intro to ANOVA\n", "\n", "Is there a difference (in mean) between groups A, B, and C?\n", "Analysis of variance (ANOVA) can be used for the analysis,
\n", "provided **the observations in each group can be assumed to be normally distributed.**\n", "\n", "| Group A | Group B | Group C |\n", "|---------|---------|---------|\n", "| 2.8 | 5.5 | 5.8 |\n", "| 3.6 | 6.3 | 8.3 |\n", "| 3.4 | 6.1 | 6.9 |\n", "| 2.3 | 5.7 | 6.1 |\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make pandas dataframe with grouped data:\n", "data = pd.DataFrame({\n", " 'value': [2.8, 3.6, 3.4, 2.3, 5.5, 6.3, 6.1, 5.7, 5.8, 8.3, 6.9, 6.1], \n", " 'group': [\"A\", \"A\", \"A\", \"A\", \"B\", \"B\", \"B\", \"B\", \"C\", \"C\", \"C\", \"C\"]})\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.plot.scatter('group', 'value') # Pandas scatter plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.boxplot(by=\"group\", showmeans=True) # Pandas boxplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The three groups have different means - indicated by the green triangles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Estimate parameters $\\mu$, $\\alpha_i$ and $\\sigma^2$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute the overall mean and add to dataframe:\n", "data['overall_mean'] = data[\"value\"].mean()\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compute the mean within each group and add to dataframe:\n", "data['group_mean'] = data.groupby(\"group\")['value'].transform('mean')\n", "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compute the \"alpha\" for each group and add to dataframe:\n", "data[\"alpha\"] = data[\"group_mean\"] - data[\"overall_mean\"]\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variation within groups (Residual variation left after the model):\n", "\n", "$\n", "SSE = \\sum_{i=1}^{k} \\sum_{j=1}^{n_i} \\left( y_{ij} - \\bar{y}_i \\right)^2\n", "$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate the individual contribution to SSE and add to dataframe:\n", "data['sse_contribution'] = (data['value']-data['group_mean'])**2\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "$\n", "\\hat{\\sigma}^2 = MSE = \\frac{SSE}{n-k}\n", "$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate SSE and MSE:\n", "SSE = data[\"sse_contribution\"].sum()\n", "MSE = SSE / (12-3)\n", "print([SSE, MSE])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LM Method" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "one = np.repeat(1,4)\n", "zero = np.repeat(0,4)\n", "col1 = np.append(np.append(one,zero),zero)\n", "col2 = np.append(np.append(zero,one),zero)\n", "col3 = np.append(np.append(zero,zero),one)\n", "\n", "X1 =np.array([col1,col2,col3]).T\n", "\n", "col1 = np.append(np.append(one,one),one)\n", "X2 =np.array([col1,col2,col3]).T\n", "\n", "col2 = np.append(np.append(one,zero),-one)\n", "col3 = np.append(np.append(zero,one),-one)\n", "\n", "\n", "X3 =np.array([col1,col2,col3]).T\n", "X3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try to calculate H for all design matrices to check that results are the same" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "H = X1@np.linalg.inv(X1.T@X1)@X1.T\n", "X0 = np.array([np.repeat(1,12)]).T\n", "H0 = (1/(X0.T@X0))*(X0@X0.T)\n", "I = np.identity(12)\n", "y = data[\"value\"]\n", "SST = y.T @ (I-H0) @ y\n", "SSE = y.T @ (I-H) @ y\n", "SSTr = y.T @ (H-H0) @ y\n", "\n", "dfSSE = np.sum(np.diag(I-H))\n", "dfTr = np.sum(np.diag(H-H0))\n", "dfTot = np.sum(np.diag(I-H0))\n", "## Coefficient table\n", "aov_Tab = np.array([[dfTr,dfSSE,dfTot],[SSTr,SSE,SST],\n", " [SSTr/dfTr,SSE/dfSSE,SST/dfTot]]).T\n", "col_names = [\"df\",\"SS\",\"ME\"]\n", "row_names = [\"Treatment\",\"Residual\",\"Total\"]\n", "aov_Tab = pd.DataFrame(aov_Tab,columns = col_names, index = row_names)\n", "aov_Tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: ANOVA table with python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make the ANOVA table:\n", "fit = smf.ols(\"value ~ group\", data=data).fit()\n", "anova_table = sm.stats.anova_lm(fit)\n", "print(anova_table)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(fit.summary(fit,slim=True))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.linalg.inv(X2.T@X2)@X2.T@y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.boxplot(column=\"value\", by=\"group\", showmeans=True) # Pandas boxplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As in chapter 8" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.linalg.inv(X3.T@X3)@X3.T@y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compare with values in ANOVA table:\n", "print(anova_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Skive Fjord" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is the an effect of \"vandmiljø\"-plans? Use the September. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SkiveAvg = pd.read_csv(\"../week1/skiveAvg.csv\", sep=';')\n", "SkiveAvg[\"vmp\"] = pd.Categorical(SkiveAvg[\"vmp\"])\n", "SkiveAvg" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dat_september = SkiveAvg[SkiveAvg[\"month\"]==9] ## Only September\n", "fit = smf.ols(\"Nload ~ vmp\", data=dat_september).fit()\n", "anova_table = sm.stats.anova_lm(fit)\n", "print(anova_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Residual analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dat_september.boxplot(\"Nload\",by=\"vmp\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vmp = SkiveAvg[\"vmp\"]\n", "vmp = vmp[SkiveAvg[\"month\"]==9]\n", "\n", "rstudent = fit.outlier_test()\n", "rstudent[\"vmp\"] = vmp\n", "rstudent.boxplot(\"student_resid\",by=\"vmp\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "e = rstudent[\"student_resid\"]\n", "n = len(e)\n", "e1 = e[1:(n-1)]\n", "e2 = np.roll(e,-1)[1:(n-1)]\n", "print(np.corrcoef(e1,e2)[0,1])\n", "rstudent[\"student_resid\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n = len(rstudent[\"student_resid\"])\n", "fig = sm.qqplot(rstudent[\"student_resid\"], stats.t, distargs=(n-4,),loc=0,scale=1,line=\"q\",a=1/2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Post hoc comparison" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "M = 4 * 3 / 2\n", "alpha_bon = 0.05 / M\n", "n0 = np.sum(vmp==0)\n", "n1 = np.sum(vmp==1)\n", "n2 = np.sum(vmp==2)\n", "n3 = np.sum(vmp==3)\n", "dat0 = dat_september[vmp==0]\n", "dat1 = dat_september[vmp==1]\n", "dat2 = dat_september[vmp==2]\n", "dat3 = dat_september[vmp==3]\n", "N_load0 = np.mean(dat0[\"Nload\"])\n", "N_load1 = np.mean(dat1[\"Nload\"])\n", "N_load2 = np.mean(dat2[\"Nload\"])\n", "N_load3 = np.mean(dat3[\"Nload\"])\n", "crit_vals = np.array([-1,1]) * stats.t.ppf(1-alpha_bon,df=n-4) \n", "sig = np.sqrt(fit.scale)\n", "\n", "CI01 = N_load0 - N_load1 + crit_vals * sig * np.sqrt(1/n0+1/n1)\n", "CI02 = N_load0 - N_load2 + crit_vals * sig * np.sqrt(1/n0+1/n2)\n", "CI03 = N_load0 - N_load3 + crit_vals * sig * np.sqrt(1/n0+1/n3)\n", "CI12 = N_load1 - N_load2 + crit_vals * sig * np.sqrt(1/n1+1/n2)\n", "CI13 = N_load1 - N_load3 + crit_vals * sig * np.sqrt(1/n1+1/n3)\n", "CI23 = N_load2 - N_load3 + crit_vals * sig * np.sqrt(1/n2+1/n3)\n", "\n", "np.array([[CI01],[CI02],[CI03],[CI12],[CI13],[CI23]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.array([n0,n1,n2,n3])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 2 }