science-notes/source/statistics/tests/parametric/ztest.rst
2023-10-19 10:01:43 +02:00

153 lines
6.1 KiB
ReStructuredText

Z-Test
-------
The z-test is used to assess if the mean :math:`\overline{x}` of sample :math:`X` differs from the one of a known population.
The *significance level* of this difference is determined by a *p-value* threshold chosen prior doing the test.
Conditions for using a z-test:
#. Population is normally distributed
#. Population :math:`\mu` and :math:`\sigma` is known
#. Sample size is greater than 30 (see note below)
.. note::
According to central limit theorem, a distribution is well approximated when reaching 30 samples.
See `here <https://statisticsbyjim.com/basics/central-limit-theorem/>`__ for more infos.
To perform a z-test, you should compute the *standard score* (or *z-score*) of your sample :math:`X`.
The *z-score*, noted :math:`Z`, characterizes how far from the population mean :math:`\mu` your sample mean :math:`\overline{x}` is, in unit of standard deviation :math:`\sigma`.
It is computed as follow:
.. math::
Z=\frac{\overline{x}-\mu}{\sigma}
.. note::
The following formula can also be seen, when the original population :math:`\sigma` is unknown:
.. math::
Z=\frac{\overline{x}-\mu}{\mathrm{SEM}}=\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}}
In this case, :math:`Z` technically follow a t-distribution (student test).
However, if :math:`n` is sufficiently large, the distribution followed by :math:`Z` is very close to a normal one.
So close that, using z-test in place of the student test to compute *p-values* leads to nominal differences (`source <https://stats.stackexchange.com/questions/625578/why-is-the-sample-standard-deviation-used-in-the-z-test>`__).
From :math:`Z`, a *p-value* can be derived using the :math:`\mathcal{N}(0,1)` :ref:`CDF <CDF>` noted :math:`\Phi_{0,1}(x)`:
* Left "tail" of the :math:`\mathcal{N}(0,1)` distribution:
.. math::
\alpha &= P(\mathcal{N}(0,1)<Z\sigma)
&=P(\mathcal{N}(0,1)<Z\times 1)
&=P(\mathcal{N}(0,1)<Z)=\Phi_{0,1}(Z)
* Right "tail" of the :math:`\mathcal{N}(0,1)` distribution:
.. math::
\alpha &= 1-P(\mathcal{N}(0,1)<Z\sigma)
&=1-P(\mathcal{N}(0,1)<Z\times 1)
&=1-P(\mathcal{N}(0,1)<Z)=1-\Phi_{0,1}(Z)
.. image:: ../../figures/normal_law_tails.svg
:align: center
|
| If the test is done over one tail (left OR right) it is called a **one-tailed** z-test.
| If the test is done over both tails (left AND right) it is called a **two-tailed** z-test.
| A *one-tailed* z-test checks whether :math:`\overline{x} < \mu` or :math:`\overline{x} > \mu`.
| A *two-tailed* z-test checks whether :math:`\overline{x} \ne \mu`.
The following code shows you how to obtain the *p-value* in R:
.. literalinclude:: ../../code/ztest_pvalue.R
:language: R
Output example:
.. code-block:: console
Alpha approximated is 0.0359588035958804
Alpha from built-in CDF 0.0359303191129258
If the :math:`\alpha` value given by the test is lower or equal to the *p-value* threshold chosen initially,
:math:`H_0` is rejected and :math:`H_1` is considered accepted.
An alternative way of doing the z-test is to build a **rejection region** from the *p-value*.
This is done by using the reverse :ref:`CDF <CDF>` function :math:`\Phi_{0,1}^{-1}(x)` as in the following code:
.. literalinclude:: ../../code/ztest_rejection_region.R
:language: R
Output:
.. code-block:: console
Rejection region for left tail: Z in ]-inf,-3.43161440362327]
Rejection region for right tail: Z in [3.4316144036233,+inf[
Thus, if the z-score if part of one of the rejection regions, :math:`H_0` is rejected and :math:`H_1` is considered accepted.
Examples
========
One-tailed
^^^^^^^^^^^
This exercice is inpired from `this video <https://www.youtube.com/results?search_query=ztest>`__ *(be careful the video uses a wrong formula)*.
A complain was registered stating that the boys in the municipal school are underfed.
The average weight of boys of age 10 is 32kg with a standard deviation of 9kg.
A sample of 25 boys of age 10 from the school is selected. Their average weight is 29.5kg.
We want to check whether the complain is true or not with a confidence level of :math:`\alpha=0.05`.
**--- Solution ---**
Hypothesis:
* :math:`H_0` : No significant difference (:math:`\overline{x} \ge 32`), the boys from the are not underfed
* :math:`H_1` : There is significant difference (:math:`\overline{x} < 32`), the boys from the are underfed
.. math::
Z=\frac{29.5-32}{9}=-0.2777778
From this z-score, the *p-value* is 0.3905915. As it is greater than 0.05, we cannot reject :math:`H_0`.
Thus, the boys from the are not underfed.
Two-tailed
^^^^^^^^^^^
This exercice is inpired from `this website <https://www.mathandstatistics.com/learn-stats/hypothesis-testing/two-tailed-z-test-hypothesis-test-by-hand>`__.
The USA mean public school yearly funding is $6800 per student per year, with a standard deviation of $400.
We want to assess if a certain state in the USA, Michigan, receives a significantly different amount of public school funding (per student) than the USA average,
with :math:`\alpha=0.05`. A sample of 1000 students reveals that in average, they received $6873.
.. note::
Notice, we are not saying **"significantly lower amount"** or **"significantly higher amount"** but **"significantly different amount"**.
This is a sign that a two-tailed z-test is required since we should check for both, lower and higher.
**--- Solution ---**
Hypothesis:
* :math:`H_0` : No significant difference (:math:`\overline{x} = 6800`), Michigan receives the same amount of public school funding per student
* :math:`H_1` : There is significant difference (:math:`\overline{x} \ne 6800`), Michigan do not receives the same amount of public school funding per student
.. math::
Z=\frac{6873-6800}{400}=0.1825
| The *p-value* associated with the left tail (using :math:`-Z` with the CDF) is 0.4275952.
| Thus, as we are doing a *two-tailed* z-test the *p-value* is :math:`2\times 0.4275952 = 0.8551904`.
| We multiply by two has the two tails of the normal law are symetric.
Since :math:`0.8551904 >> 0.05` we cannot reject the null hypothesis :math:`H_0`.
Thus, Michigan receives the same amount of public school funding per student.