# Annex B. Technical notes on analyses in this report

This report explores the teacher and school characteristics and practices that matter for student outcomes. The outcome indicators included in the analyses are sourced from the Programme for International Student Assessment (PISA) 2018 student data. These student outcomes are:

Student achievement in the PISA score in reading, mathematics and science (Chapters 2 and 4).

Student social and emotional outcomes, such as the indices of classroom disciplinary climate, teacher enthusiasm and student perception of difficulty of the PISA test, as well as the dummy variable of student expectation of completing at least a tertiary degree (Chapter 3).

School-level disparities in outcomes between girls and boys that are defined as the difference between the average school-level PISA score for girls minus the average school-level PISA score for boys (Chapter 4). Differences are positive when they are in favour of girls and negative when they are in favour of boys.

In order to make the most of the TALIS-PISA link,1 a broad range of variables were included in the analyses aiming to identify the teacher and school characteristics and practices that matter for student outcomes. The teacher and school factors included in the analyses represent variables from the TALIS components (both teacher and principal datasets) of the TALIS-PISA link 2018 data. The selection of teacher and school variables was guided by theory and previous research findings. Table A B.1 provides an overview of the 18 teacher and school dimensions and the almost 150 variables within these dimensions that are included in the analyses. These variables are jointly introduced in lasso regressions (see below) and are introduced separately by blocks for each dimension in standard regression analyses (see below).

However, apart from relying on theory and previous research findings, other considerations were also taken into account to guide the initial selection of TALIS indicators used for the analyses. Notably, it was deemed important to:

Limit the loss of observations, due to missing values, given the large number of predictors included in the analyses. This was ensured mainly by opting for the inclusion of teacher questionnaire variables instead of principal questionnaire variables whenever it was possible. Indeed, because teacher data are aggregated at the school level, variables derived from the teacher questionnaire are less likely to have missing values.

Include only those variables that were administered in all TALIS-PISA link participating countries and economies.

Prioritise the use of complex scales that were available in the TALIS-PISA link datasets over the individual items contributing to these scales.

Due to the survey design of the TALIS-PISA link, teachers and students can be linked only at the school and not at the classroom level. In other words, the data do not allow matching teachers with their students; rather, the data only allow matching a sample of teachers teaching 15-year-old students in a school with a sample of 15-year-old students of that same school. Therefore, information that is based on teachers’ responses is always averaged at the school level within this report. Depending on the analysis, variables based on teachers’ responses are averaged either for all teachers within the school, or only for subject domain teachers (i.e. reading, mathematics or science teachers) within the school. For more detail on the share of schools sampled within the TALIS-PISA link by the number of subject domain teachers, see Table A B.2.

Most of the analyses presented in this report also include controls for student characteristics, such as student gender, migrant background and socio-economic status, based on data available from the PISA student dataset. In the case of the school-level analyses of Chapter 4, these student characteristics are averaged at the school level and introduced as controls.

The least absolute shrinkage and selection operator (also known as lasso), which is a machine learning technique within the family of supervised statistical learning methods, is applied throughout this report as a compass to guide the selection of key teacher and school factors related to student achievement, social-emotional skills and gaps in student performance within schools. In this report, lasso is used for model selection, although it can be used for prediction and inference as well. Lasso has several attributes that makes it an attractive tool for selecting among the many variables collected through the TALIS questionnaires those that are potentially key predictors of student outcomes. These attributes are:

Lasso is designed to select variables that are important and should be included in the model.

The outcome variable guides the model selection process (i.e. supervised statistical learning method).

Lasso can handle high-dimensional models where the number of variables is high relative to the number of observations.

Lasso is most useful when only a few out of many potential variables affect the outcome (Hastie, Tibshirani and Friedman, 2017[1]; Hastie, Tibshirani and Wainwright, 2015[2]; Tibshirani, 1996[3]). The assumption that the number of coefficients that are non-zero (i.e. correlated with the outcome variable) in the true model is small relative to the sample size is known as a sparsity assumption. The approximate sparsity assumption requires that the number of non-zero coefficients in the model that best approximates the true model be small relative to the sample size.

Lasso estimates coefficients in a model. It selects variables that correlate well with the outcome in one dataset (training sample) and then tests whether the selected variables predict the outcome well in another dataset (validation sample). Lasso proceeds with model selection by estimating model coefficients in such a way that some of the coefficient estimates are exactly zero and, hence, excluded from the model, while others are not (Hastie, Tibshirani and Friedman, 2017[1]; Hastie, Tibshirani and Wainwright, 2015[2]; Tibshirani, 1996[3]). In the context of model selection, lasso may not always be able to distinguish an irrelevant predictor that is highly correlated with the predictors in the true model from the true predictors (Wang et al., 2019[4]; Zhao and Yu, 2006[5]).

Lasso for linear models solves an optimisation problem. The lasso estimate is defined as:

${\widehat{\beta}}^{lasso}=\mathrm{}\mathrm{arg}\underset{\beta}{\mathrm{min}}\left\{\frac{1}{2N}\sum _{i=1}^{N}{\left({y}_{i}-{\beta}_{0}-\sum _{j=1}^{p}{x}_{ij}{\beta}_{j}\right)}^{2}+\lambda \sum _{j=1}^{p}{\omega}_{j}\left|{\beta}_{j}\right|\right\}\mathrm{}$

Where $y$ is the outcome variable, $x$ refers to the potential covariates, $\beta $ is the vector of coefficients on $X$, $\lambda $ is the lasso penalty parameter, $\omega $ refers to the parameter-level weights known as penalty loadings and ${\sum}_{1}^{p}\left|\beta \right|$ is the ${L}_{1}$ lasso penalty. As the lasso penalty term is not scale invariant, one needs to standardise the variables included in the model before solving the optimisation problem.

Thus, the optimisation problem contains two parts, the least-squares fit measure:

$\frac{1}{2N}\sum _{i=1}^{N}{\left({y}_{i}-{\beta}_{0}-\sum _{j=1}^{p}{x}_{ij}{\beta}_{j}\right)}^{2}$

$\lambda \sum _{j=1}^{p}{\omega}_{j}\left|{\beta}_{j}\right|$

The $\lambda $ and $\omega $ parameters (also called “tuning” parameters) specify the weight applied to the penalty term. When $\lambda $ is large, the penalty term is also large, which results in lasso selecting few or no variables. As $\lambda $ decreases, the penalty associated with each non-zero $\beta $ decreases, which results in an increase in the number of coefficient estimates kept by lasso. When $\lambda =0$, then lasso reduces to the ordinary least squares (OLS) estimator without any coefficient estimates being excluded from the model.

Two commonly used methods to select the so called “tuning” parameters are cross-validation (CV), and the adaptive lasso. CV finds the ${\lambda}^{*}$ that minimises the out-of-sample prediction error. Although CV works well for prediction, it tends to include covariates whose coefficients are zero in the true model that best approximates the data. The adaptive lasso, which consists of two CVs, is more parsimonious when it comes to model selection. After finding a CV solution for ${\lambda}^{*}$, it does another CV among the covariates selected in the first step by using weights ($\omega =1/\left|\widehat{\beta}\right|$, where $\widehat{\beta}$ are the penalised estimates from the first CV) on the coefficients in the penalty function. Covariates with smaller coefficients are more likely to be excluded in the second step (Drukker and Lui, 2019[6]).

The third commonly used method is the plugin lasso. Among the three methods, the plugin lasso is the most parsimonious and also the fastest one in terms of computational time. Instead of minimising a CV function as presented above for the CV and adaptive lasso methods, the plugin function uses an iterative formula to find the smallest value of $\lambda $ that is large enough to dominate the estimation error in the coefficients. The plugin lasso selects the penalty loadings to normalise the scores of the (unpenalised) fit measure for each parameter and then it chooses a value for $\lambda $ that is greater than the largest normalised score with a probability that is close to 1 (Drukker and Lui, 2019[6]). For more detail on the plugin lasso, see Belloni et al. (2012[7]) and Drukker and Liu (2019[8]). The plugin lasso tends to select the most important variables and it is good at not including covariates that do not belong to the true model. However, unlike the adaptive method, the plugin lasso can overlook some covariates with large coefficients and select covariates with small coefficients (Drukker and Lui, 2019[6]). Given its favourable model selection attributes, the plugin method is applied in Chapters 2 and 3.

Yet, in the school-level analysis of Chapter 4, which explores the teacher and school factors that could play a role in mitigating within-school disparities in performance between girls and boys, the adaptive lasso is used. Due to the analysis being conducted at the school level rather than at the student level, sample sizes decrease considerably, leading to the selection of fewer variables by lasso. Therefore, the use of adaptive lasso is preferred, as it results in more covariates being selected as compared to the more parsimonious plugin lasso. It is also important to note that the model selection properties of lasso have limitations. Notably, irrespective of the way in which the tuning parameters are selected, lasso may not always be able to distinguish an irrelevant predictor that is highly correlated with the predictors in the true model from the true predictors.

Dividing the sample into training and validation sub-samples allows for validating the performance of the lasso estimator (or estimators, if different methods to select the tuning parameters are tested).2 In this report, the training and validation samples are generated by randomly splitting the overall TALIS-PISA link sample into two sub-samples, with 85% of the observations allocated to the training sample and 15% kept for the validation sample. While the proportion of observations allocated to each sub-sample could be considered somewhat arbitrary, sensitivity analysis shows that the lasso regression results reported herein are fairly robust to how the TALIS-PISA link sample is split into training and validation sub-samples.3 It is also important to note that sample split is performed after creating a balanced sample that includes only those observations that have full information (i.e. observations with missing information are excluded) for all variables included in the model.4

Applying lasso for model selection means finding a model that fits the data, not finding a model that allows for interpreting estimated coefficients as effects. Thus, when used for model selection, lasso selects variables and estimates coefficients, but it does not provide the standard errors required for performing statistical inference. Indeed, lasso’s covariate-selection ability makes it a non-standard estimator and prevents the estimation of standard errors.

In this report, lasso is applied for model selection based on the overall population of 15-year-old students (Chapters 2 and 3) and schools (Chapter 4) surveyed within the TALIS-PISA link (i.e. the pooled sample across all participating countries and economies). As the model selection is based on the pooled sample, country fixed effects are imposed on lasso to ensure they are always included among the selected covariates. In addition, the controls for student characteristics – such as student gender, migrant background and socio-economic status – are also imposed on lasso to ensure they are always included among the selected covariates. Moreover, sampling weights are not used in the lasso regression analysis. This may be a limitation; hence, caution is warranted while interpreting the results of lasso regression analyses within this report.

Lasso regressions are estimated using the Stata (version 16.1) (StataCorp, 2019[9])“lasso” module – see Drukker and Lu (2019[6]) for an introduction.

Standard regression analyses are conducted to examine the teacher and school dimensions that explain most of the differences in school average performances (i.e. variance decomposition analysis) and also to explore the country-level relationships between student outcomes, or within-school gender gaps in student achievement (in the case of Chapter 4), and teacher and school dimension (taken separately). In comparison to lasso, standard regressions provide the confidence intervals of the coefficient estimates that, in turn, allow for drawing inferences about the overall population. Moreover, they lead to more accurate coefficient estimates through the introduction of final and balanced repeated replicate weights and the use of plausible values of student performance. Multiple linear regression is used in those cases where the dependent (or outcome) variable is considered continuous. Binary logistic regression is employed when the dependent (or outcome) variable is a binary categorical variable. Regression analyses are carried out for each country separately. The TALIS-PISA link average refers to the arithmetic mean of country-level estimates.

Control variables are included in the standard regression models, with the exception of the variance decomposition analyses. The control variables are selected based on theoretical reasoning and, preferably, limited to the most objective measures or those that do not change over time. Controls for student characteristics include student’s gender, migrant background and socio-economic status. Controls for the average classmates’ characteristics within the school include: the share of students whose first language is different from the language(s) of instruction, low academic achievers, students with special needs, students with behavioural problems, students from socio-economically disadvantaged homes, academically gifted students, students who are immigrants or with a migrant background and students who are refugees. Controls for classmates’ characteristics are only included in the analyses presented in Chapters 2 and 3. Controls for classmates’ characteristics are excluded from the analyses featured in Chapter 4.5

The controls for the characteristics of students and classmates are introduced into the models in steps. This approach also requires that the models at each step be based on the same sample. Each regression model for each teacher or school dimension is estimated based on the same restricted and balanced sample with full information (i.e. observations with missing information are excluded) for all variables included in the analyses. The sample only varies according to how teacher variables are averaged at the school level, i.e. if the focus is on all teachers or teachers of a given subject.

### Multiple linear regression analysis

Multiple linear regression analysis provides insights into how the value of the continuous outcome variable changes when any one of the explanatory variables varies while all other explanatory variables are held constant. In general, and with everything else held constant, a one-unit increase in the explanatory variable (${x}_{i}$) increases, on average, the outcome variable ($Y$) by the units represented by the regression coefficient (${\beta}_{i}$):

$Y={\beta}_{0}+{\beta}_{1}{x}_{1}+\dots +{\beta}_{i}{x}_{i}+\epsilon $

When interpreting multiple regression coefficients, it is important to keep in mind that each coefficient is influenced by the other explanatory variables in a regression model. The influence depends on the extent to which explanatory variables are correlated. Therefore, each regression coefficient does not capture the total effect of explanatory variables on the outcome variable. Rather, each coefficient represents the additional effect of adding that variable to the model, considering that the effects of all other variables in the model are already accounted for. It is also important to note that, because cross-sectional survey data are used in these analyses, no causal conclusions can be drawn.

Regression coefficients in bold in the data tables presenting the results of regression analysis (included in Annex C) are statistically significantly different from 0 at the 95% confidence level.

### Binary logistic regression analysis

Binary logistic regression analysis enables the estimation of the relationship between one or more explanatory variables and an outcome variable with two categories. The regression coefficient ($\beta $) of a logistic regression is the estimated increase in the log odds of the outcome per unit increase in the value of the predictor variable.

More formally, let $Y$ be the binary outcome variable indicating no/yes with 0/1, and $p$ be the probability of $Y$ to be 1, so that $p=prob(Y=1)$. Let ${x}_{1},\dots {x}_{k}$ be a set of explanatory variables. Then, the logistic regression of $Y$ on ${x}_{1},\dots {x}_{k}$ estimates parameter values for ${\beta}_{0},{\beta}_{1}$,…,${\beta}_{k}$ via the maximum likelihood method of the following equation:

$Logit\left(p\right)=\mathrm{log}\left(p/\left(1-p\right)\right)={\beta}_{0}+{\beta}_{1}{x}_{1}+\dots +{\beta}_{k}{x}_{k}$

### Variance decomposition analysis

Variance decomposition analysis is applied to complement the findings from the lasso regressions as it can reveal the relative importance of each teacher and school dimension in explaining the average differences in student performances across schools (and ultimately, variance in student achievement). The share of between-school variance explained by a teacher or school dimension $j$ is estimated as the ratio of the variance explained by dimension $j$ (${R}_{Dim.j}^{2})$ to the total variance in student outcomes explained at the school level (${R}_{Total}^{2})$, hence:

${Between-school\mathrm{}VAR}_{Dim.j}={R}_{Dim.j}^{2}/{R}_{Total}^{2}$

where the total variance in student outcomes explained at the school level (${R}_{Total}^{2})$ is estimated as the R² of a linear regression model with the outcome variable ($Y$) regressed on school fixed effects. R² represents the proportion of the observed variation in the outcome variable that can be explained by the explanatory variables. Similarly, the share of variance explained by dimension $j$ (${R}_{Dim.j}^{2})$ is estimated as the R² of a linear regression model with the outcome variable ($Y$) regressed on the variables included in dimension $j$.

Chapter 3 of this report features a binary outcome variable, students’ educational expectations, as a measure of student interest in school. PISA measures educational expectations by asking students which educational level they expect to complete. Their responses are used to create a dummy variable that equals 1 if the student expects to complete at least a tertiary degree and, otherwise, equals 0. Since student educational expectation is coded as a binary variable, the share of variance cannot be estimated as it can be for the continuous variables. Indeed, binary logistic regressions cannot provide a goodness-of-fit measure that would be equivalent to the R². Unlike linear regressions with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximise the likelihood function of logistic regressions; thus, an iterative process must be used instead. Yet, the goodness-of-fit of binary logistic models can be evaluated by the pseudo-R².6 Similarly to the R², the pseudo-R² also ranges from 0 to 1, with higher values indicating better model fit. Nevertheless, pseudo-R² cannot be interpreted as one would interpret the R². In Chapter 3, the pseudo-R² of the logistic regression of student educational expectation on school fixed effects is used as a proxy for the percentage of total variance in student expectation of completing at least a tertiary degree, explained at the school level. Then, the share of variance explained by dimension $j$ (${R}_{Dim.j}^{2})$ is approximated with the pseudo-R² of a logistic regression model with the outcome variable ($Y$) regressed on the variables included in dimension $j$.

It has to be noted that the variance decomposition analysis presented within this report has a limitation, as the shares of between-school variance explained by each dimension may be artificially driven by the number of variables included in a given dimension. Indeed, the dimensions that have the lowest explanatory power tend to include few variables, while the number of variables included in the dimensions that explain the largest shares of the differences in school average performance is high in comparison to other dimensions. Thus, caution is warranted when interpreting these results.

In contrast with standard linear regression, which estimates the conditional mean of the outcome variable given a set of explanatory variables, quantile regression provides information about the association between the outcome variable and the explanatory variables at the different points in the conditional distribution of the outcome variable (Koenker, 2017[10]; Koenker, 2005[11]; Koenker and Bassett, 1978[12]). Quantile regression estimates an equation expressing a quantile (or percentile) of the conditional distribution of the outcome variable as a linear function of the explanatory variables. At the $q$th quantile, the quantile regression estimator, $\widehat{{\beta}_{q}}$, minimises over ${\beta}_{q}$ the objective function (Cameron and Trivedi, 2009[13]):

$Q\left({\beta}_{q}\right)=\sum _{i:{y}_{i}\ge {x}_{i}^{\mathrm{\text{'}}}\beta}^{N}q\left|{y}_{i}-{x}_{i}^{\mathrm{\text{'}}}{\beta}_{q}\right|+\sum _{i:{y}_{i}<{x}_{i}^{\mathrm{\text{'}}}\beta}^{N}\left(1-q\right)\left|{y}_{i}-{x}_{i}^{\mathrm{\text{'}}}{\beta}_{q}\right|$

Where $0<q<1$ and different choices of $q$ estimate different values of $\beta $. The higher the value of $q$, the more weight is placed on prediction for observations with $y\ge {x}^{\text{'}}\beta $ than for observations with $y<{x}^{\text{'}}\beta $. The objective function is optimised using linear programming methods. The estimator that minimises$Q\left({\beta}_{q}\right)$ has well-established asymptotic properties.

Quantile regression has several attributes that make its use attractive as compared to standard linear regression (Cameron and Trivedi, 2009[13]). For instance, it provides a richer characterisation of the relationship between the outcome variable and the explanatory variables by allowing the effects of the explanatory variables to vary over different quantiles of the conditional distribution. Notably, it allows for examining the impact of explanatory variables on both the location and scale parameters of the model. Moreover, quantile regression is more robust to outliers and also in terms of the assumptions about the distribution of regression errors. Namely, it does not require assumptions about the parametric distribution of regression errors. Hence, quantile regression is a suitable tool to analyse models characterised by a change in the variance of the error terms (i.e. heteroskedasticity).

However, it is important to note that quantile regression estimates tend to be more precise at the centre of the distribution as compared to upper and lower quantiles (meaning that standard errors tend to be smaller at the centre of the distribution) (Cameron and Trivedi, 2009[13]). Thus, for those relationships where the variation in the effects of the explanatory variables over the different quantiles of the conditional distribution of the outcome variable is limited, quantile regression tends to be less likely to find significant regression coefficients at the tails of a distribution than at the centre.

The TALIS-PISA link and PISA samples were collected following a stratified two-stage probability sampling design. This means that teachers and students (second stage units, or secondary sampling units) were to be randomly selected from the list of in-scope teachers and students in each of the randomly selected schools (first stage units, or primary sampling units). For these statistics to be meaningful for a country, they need to reflect the whole population from which they were drawn and not merely the sample used to collect them. Thus, survey weights must be used in order to obtain design-unbiased estimates of population or model parameters. Except for the lasso regression analysis, survey weights are used in all other analyses presented in this report.

The analyses presented in Chapters 2 and 3, as well as the quantile regression analysis included in Chapter 4, are based on the student-level merged TALIS-PISA dataset (i.e. student data merged with principal data and teacher data aggregated at the school level). The statistics resulting from the standard regression analyses are estimated using the final TALIS-PISA link student weight (estimation weight), as well as the TALIS-PISA link student-level balanced repeated replicate weights (for more detail, see Annex A).

In Chapter 4, the analysis conducted at the school level that explores the teacher and school factors that could play a role in mitigating within-school disparities in performance between girls and boys is based on the school-level merged TALIS-PISA dataset (i.e. student data aggregated at the school level merged with principal data and teacher data aggregated at the school level). The statistics resulting from the standard regression analysis are estimated using final TALIS-PISA link school weight (estimation weight), as well as the TALIS-PISA link school-level balanced repeated replicate weights (for more detail, see Annex A).

The teacher-level merged TALIS-PISA dataset (i.e. teacher data merged with principal data and student data aggregated at the school level) is not used for this report. Nevertheless, the final TALIS-PISA link teacher weights are used for averaging teachers’ responses at the school level.

The statistics in this report represent estimates based on samples of teachers and principals, rather than values that could be calculated if every teacher and principal in every country had answered every question. Consequently, it is important to measure the degree of uncertainty of the estimates. Hence, each estimate presented in this report, with the exception of lasso regression results, has an associated degree of uncertainty that is expressed through a standard sampling error. When used for model selection, lasso selects variables and estimates coefficients, but it does not provide the standard errors required for performing statistical inference. Yet, standard errors are computed and presented for the estimates of all other analyses. The use of confidence intervals provides a way to make inferences about the population means and proportions in a manner that reflects the uncertainty associated with the sample estimates. From an observed sample statistic and assuming a normal distribution, it can be inferred that the corresponding population result would lie within the confidence interval in 95 out of 100 replications of the measurement on different samples drawn from the same population. The reported standard errors were computed with a balanced repeated replication (BRR) methodology.

The analyses within this report that focus on student achievement (Chapters 2 and 4) are based on plausible values. PISA report student performance through plausible values in order to account for measurement error (OECD, 2009[14]). This error results from the fact that no test can perfectly measure proficiency in broad subjects, such as reading, mathematics and science. Hence, plausible values can be considered as a representation of the range of abilities that a student might reasonably have.7 In turn, the standard errors reported for statistics based on plausible values do not only include sampling error but also measurement error.

The TALIS-PISA link average corresponds to the arithmetic mean of the respective country estimates with available data. In Chapter 2, the TALIS-PISA link average covers all participating countries and economies, excluding Viet Nam.8 In Chapter 3, the TALIS-PISA link average covers all participating countries and economies. In Chapter 4, the TALIS-PISA link average excludes Viet Nam in the case of student-level analyses and it excludes Malta9 and Viet Nam in the case of school-level analyses.

In the case of some countries, data may not be available for specific indicators, or specific categories may not apply. Therefore, readers should keep in mind that the term “TALIS-PISA link average” refers to the countries included in the respective averages. Each of these averages may not necessarily be consistent across all columns of a table.

Grade repetition can potentially introduce bias into the estimates of school performance, estimated as the student performance averaged at the school level. Indeed, if grade repetition is common and grade repeaters tend to attend different grade levels than their peers who did not repeat a grade, and if those different grade levels belong to different schools, then estimates of school performance can be biased. Indeed, school performance would be overestimated in schools attended by non-grade repeaters and underestimated in schools attended by grade repeaters. Among the countries and economies participating in the TALIS-PISA link, the retention rate seems especially high in Ciudad Autónoma de Buenos Aires (Argentina) (henceforth CABA [Argentina]) and Colombia, where more than 75% of students enrolled at ISCED level 2 and fewer than 15% of students enrolled at ISCED level 3 are grade repeaters (Table A B.3). However, these two countries/economies also happen to be the ones with the highest share of sampled schools (greater than 80%) that enrol students at both ISCED levels 2 and 3 (Table A B.4). Hence, students who have repeated a grade are often enrolled in the same school as students who have not. Thus, students sampled in a school can be considered as being representative of the school, even in CABA (Argentina) and Colombia.

## References

[7] Belloni, A. et al. (2012), “Sparse models and methods for optimal instruments with an application to eminent domain”, *Econometrica: Journal of the Econometric Society*, Vol. 80/6, pp. 2369-2429, http://dx.doi.org/10.3982/ecta9626.

[13] Cameron, A. and P. Trivedi (2009), *Microeconometrics Using Stata*, Stata Press, College Station, TX.

[8] Drukker, D. and D. Liu (2019), *A plug-in for Poisson lasso and a comparison of partialing-out Poisson estimators that use different methods for selecting the lasso tuning parameters*.

[6] Drukker, D. and D. Lui (2019), “An introduction to the lasso in Stata”, *The Stata Blog*, https://blog.stata.com/2019/09/09/an-introduction-to-the-lasso-in-stata/.

[1] Hastie, T., R. Tibshirani and J. Friedman (2017), “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, in *Springer Series in Statistics*, Springer, New York, https://web.stanford.edu/~hastie/ElemStatLearn//printings/ESLII_print12.pdf.

[2] Hastie, T., R. Tibshirani and M. Wainwright (2015), “Statistical learning with sparsity: The lasso and generalizations”*, Monographs on Statistics and Applied Probability*, No. 143, https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf.

[10] Koenker, R. (2017), “Quantile regression: 40 years on”, *Annual Review of Economics*, Vol. 9/1, pp. 155-176, http://dx.doi.org/10.1146/annurev-economics-063016-103651.

[11] Koenker, R. (2005), *Quantile Regression*, Cambridge University Press, Cambridge, http://dx.doi.org/10.1017/CBO9780511754098.

[12] Koenker, R. and G. Bassett (1978), “Regression quantiles”, *Econometrica: The Journal of the Econometric Society*, Vol. 46/1, pp. 33-50, http://dx.doi.org/10.2307/1913643.

[14] OECD (2009), *PISA Data Analysis Manual: SAS, Second Edition*, PISA, OECD Publishing, Paris, https://dx.doi.org/10.1787/9789264056251-en.

[9] StataCorp (2019), *Stata Statistical Software: Release 16.1*, StataCorp LLC, College Station, TX.

[3] Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso”, *Journal of the Royal Statistical Society: Series B (Methodological)*, Vol. 58/1, pp. 267-288, http://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x.

[4] Wang, H. et al. (2019), “Precision lasso: Accounting for correlations and linear dependencies in high-dimensional genomic data”, *Bioinformatics*, Vol. 35/7, pp. 1181-1187, http://dx.doi.org/10.1093/bioinformatics/bty750.

[5] Zhao, P. and B. Yu (2006), “On model selection consistency of lasso”, *Journal of Machine Learning Research*, Vol. 7/90, pp. 2541-2563, https://www.jmlr.org/papers/volume7/zhao06a/zhao06a.pdf.

## Notes

← 1. TALIS-PISA link: Teaching and Learning International Survey (TALIS) and Programme for International Student Assessment (PISA) link covers schools that participated in both TALIS and PISA.

← 2. For linear models, the goodness of fit measures that can be used to assess the performance of an estimator include the mean squared error (MSE), the R^{2} and the Bayes information criterion (BIC), while, for non-linear models, such as logit, probit, and poisson models, such measures include the deviance and the deviance ratio.

← 3. Apart from the 85% (training sample) versus 15% (validation sample) split, lasso regressions were estimated using 75% versus 25%, 95% versus 5% and even a no-split scenario with all observations used for the training sample. The sensitivity analysis showed that covariates selected by lasso were robust to the different sample splits. Indeed, the lists of covariates that got selected for each sample split were almost identical.

← 4. Depending on how teacher variables are aggregated for a given model, there are four different balanced samples used for the analyses within this report – one each for: teacher variables averaged for all teachers, teacher variables averaged only for reading teachers, teacher variables averaged only for mathematics teachers and teacher variables averaged only for science teachers.

← 5. In the case of quantile regressions conducted in Chapter 4, these controls are excluded due to the computational process (i.e. non-convergence of the quantile regression model). Indeed, the estimates for the models including controls for classmates’ characteristics were missing for certain countries. Therefore, the models that include controls for classmates’ characteristics are neither reported nor commented on. In the case of the analysis of within-school gender gaps in student performance, controls for classmates’ characteristics are excluded due to the inclusion of school-level average student characteristics.

← 6. Among the various different types of pseudo-R^{2}, this report applies McFadden’s pseudo-R^{2}.

← 7. Generating plausible values on an education test consists of drawing random numbers from the posterior distributions. For more detail on plausible values in general and on how to perform analyses with plausible values, see the *PISA Data Analysis Manual* (OECD, 2009[14]).

← 8. Since Viet Nam does not have data on PISA test scores, it is not included in the analyses presented in Chapters 2 and 4.

← 9. In Malta, there are only 17 out of the 44 schools that are not single-gender schools (i.e. all students surveyed in the school are same-gender students) and where the within-school differences in performance between girls and boys can be computed. Thus, it is not included in the school-level analysis presented in Chapter 4.