Data Analysis

1. Population and Sampling Distributions
1. Density function
2. Cumulative distribution function
3. Statistics of Sample
  1. Unbiased mean (1/n) and variance (1 / [n-1] )
  2. Covariance
    1. Measure of the linear relationship between pairs of random variables
4. Distribution
  1. Normal Distribution
    1. All mutually independent
  2. The 𝑡-distribution
  3. The 𝐹-distribution
  4. The 𝜒2-distribution
    1. The sums of squared independent standard normal distributed variables
2. Bivariate Regression Analysis
1. Box-Cox Transformation
  1. Causes for extreme observations
  2. Only work for observations larger than zero
2. LOESS Smoother
  1. A sliding window moves over the value range of X
  2. In each window a local regression line is estimated
  3. These local regression lines are “splined” together
3. Bivariate regression analysis cannot control for confounding effects
4. KEY ASSUMPTIONS OF REGRESSION ANALYSIS
  1. The disturbances are required to be normal i.i.d. (independently identically distributed)
  2. Residual is uncorrelated with predicted value and independent variables
  3. The regression line must go through the means
  4. Regression coefficients have their own distribution
5. Statistic Test
  1. Two-sided test
    1. T-test
      1. 𝐻0: 𝛽1 = 0 (no base line level)
      2. t = 𝛽1 / standard error
      3. 𝐻0: 𝛽1 = C (has base line level with constant C)
      4. t = (𝛽1 -C) / standard error
    2. F-test
      1. F = ESS / (K-1) / RSS / (n-K)
      2. better for multiple regression
      3. The p-value of the entire line model is calculated by this
  2. One-sided test
    1. 𝐻0: 𝛽1 ≤ 0 against 𝐻1: 𝛽1 > 0 𝐻0: 𝛽1 ≥ 0 against 𝐻1: 𝛽1 < 0
      1. P-value enlarge 2 times
  3. Confidence Intervals
    1. If 0 in the confidence interval, we could not reject None hypothesis.
    2. Closer to the mean of x, higher accuracy of prediction.
    3. Much wider than the interval for the entire line since we only use the information around that point instead of the whole sample
6. Elasticity
  1. Transform the bivariate model into the log-log form, and the estimated regression coefficient is interpreted as a relative rate of change
  2. For 𝑏1 > 1, means 𝑦 changes relatively faster than 𝑥
Bold: Basic Theory or definition Red: Very important Topic Blue: Important Topic
3. Multiple Regression Analysis
1. Additional variables always reduce the stochastic error in the dependent variable
  1. That's why R Square always go down when you add additional variables
2. Partial regression coefficients
  1. The set of estimated parameters may change as new variables are added to the model
  2. Partial Effects
3. Standardized Regression Coefficients
  1. Measures independent variable changes by one standard deviation, how many standard deviations does dependent variable change?
  2. The larger the absolute value，the more influence has the independent variable on the variation of the dependent variable.
  3. The intercept is always zero since the mean of standardized variables is always zero
4. Statistic Test
  1. Global F-test
    1. Null hypothesis testing for all parameters (except intercept) equal to zero
    2. The global F-test as special case of the partial F-test
  2. Partial F-test
    1. This allows comparison across models.
      1. anova( )
    2. To test whether these independent variables add significantly to the model
5. Interaction effects
  1. Exogenous variables do not act independently on the dependent variable.
  2. Products of the independent variables lead to interaction effects.
  3. Interaction effects increase the risk of multicollinarity because the product shares common information from both parent variables
  4. Polynomial Analysis would produce this situation (quadratic term)
  5. Use conditional effect plots to see the influence of independent variables
6. Factor Analysis in Regression Analysis
  1. Indicator variables allow us to model different regression regimes simultaneously and to cope with model heterogeneity
  2. Each regime has its own intercept, slope or even both.
  3. One-hot encoding
    1. There are as many dummy variables for a categorical variable as there are categorical levels
  4. Dummy encoding
    1. Since one category is redundant from one-hot encoding, so this method drop one category to avoid multicollinear
4. Instrumental Variable Regression
1. Endogenous Variable
  1. If the independence assumption between 𝑥 and the disturbances 𝜀 breaks down [𝐶𝑜𝑣(𝑥, 𝑧) ≠ 0 ], the variable 𝑥 becomes an endogenous regressor
2. Instrumental Variable
  1. Not related to disturbances 𝜀
  2. Related to the endogenous variable
  3. Instrumental estimator becomes 𝛽𝐼𝑉 = 𝐶𝑜𝑣 (y,z) ⁄ 𝐶𝑜𝑣 (x,z)
3. Exogenous Variable
  1. Regular regressors that are uncorrelated with the disturbances
4. Statistic Test
  1. Partial F-test
    1. Additional instrumental variables improve the model substantially or not
  2. Modified Hausman test
    1. Test the exogeneity of endogenous variable
    2. 𝐻0: endogenous variable is uncorrelated with the disturbances 𝛆
  3. Sargan test
    1. Test for instrument 𝐗𝐼𝑉 validity
    2. 𝐻0: 𝑅 square = 0. all instrument variables are exogenous.
5. Regression Diagnostics
1. Assumptions in Linear Regression
  1. Regressors X is uncorrelated with the model’s disturbances
  2. Expectation of 𝜀 equal to 0 in the population
  3. Homoscedasticity：Var(𝜀) = Constant
    1. Heteroscedasticity
      1. Diagonal of the covariance matrix of the disturbance would no longer be constant
    2. Autocorrelation
      1. off-diagonal elements would no longer be zero
  4. Independence among the disturbances (no spatial autocorrelation)
  5. The disturbances are normal distributed
    1. Instrumental Regression
      1. Disturbances are correlated with endogenous variables
2. Identifying and Overcoming Assumption Violations
  1. Identifying
    1. Perform bivariate scatterplots
      1. The pairwise relationships in scatterplots is more informative
      2. Deal with non-linearity
      3. Box-Cox transformation
      4. Quadratic terms
      5. Interaction terms
      6. X-X combinations
      7. potential multicollinearity
    2. Box-plot
      1. investigate factors
  2. Overcoming
    1. Feasible General Least Squares estimator
      1. Accommodate heteroscedasticity
    2. Remove outliers
      1. Non-normality: t and F-test become unreliable for small sample sizes
      2. Observations in heavy-tails (outliers) have a substantially stronger impact on the estimates
      3. Transformations to pull outlying observations in
3. Influential combinations of Y-X
  1. Standardize residuals & Studentized residuals
    1. The confidence interval around a regression line gets wider as we move away from the center
    2. For Standardize residuals, their exact distribution is unknown (not normal distribution any more)
    3. For studentized residuals, These residuals follow a t-distribution
      1. Outlying observations are best investigated using studentized residuals
  2. Leverage
    1. Measures the distance of the i-th observations from the center
    2. Far distance may produce a larger impact on estimation
  3. DFBETAS
    1. Underlying idea: re-estimated the regression coefficients with the i-th case deleted
    2. Larger score, higher impact
  4. Cook’s Distance
    1. It measures the influence of the i-th case on the model as a whole rather than on individual partial regression coefficients.
  5. Tukey Test
    1. The Tukey test evaluates whether by adding a squared term of an independent variable
    2. A 𝑡-test, that evaluates whether the quadratic term is significant
    3. The global Tukey test adds the squared predicted values to the model and performs a 𝑡-test, whether this term is significant
4. Multicollinearity
  1. Strong bivariate correlation among the independent variables is only a first indicator for multicollinearity
  2. The correlation among the estimated regression parameters is a better indicator
  3. VIF (variance inflation factor)
    1. High inflation factor represents our data is highly redundancy
5. Autocorrelation
  1. first order component
    1. The expected value of dependent variable, can be modeled with the use of exogenous information
  2. second order component
    1. The covariance in the random error terms leads to a stochastic random process.
6. Spatial Autocorrelation
1. Maximum Likelihood
  1. Given the observed data {𝑥1, … , 𝑥𝑛}, which is the most likely population parameter 𝜃 that has generated the observations?
  2. Requires
    1. The underlying distribution of each random variable is explicitly known and that the sample data match this distribution
    2. the observations are statistically independent (Could transfer auto-correlated observation to independent)
  3. Transforming into logarithmic form
    1. Monotonically increasing
    2. Transfer Product to Summation
    3. Easy to find maximum (Derivatives of summations can skip the product rule)
  4. Properties
    1. ML estimators are asymptotically unbiased
    2. The ML estimator is asymptotically consistent
    3. ML estimators are asymptotically normally distributed
    4. Comparable test statistics to the t-test, F-test and partial F-test can be calculated
  5. Test
    1. The Wald test
      1. Full model with all parameters is estimated
      2. a substitute for the single t-test (deviation over the standard error)
    2. Likelihood Ratio Test (LR)
      1. Chi 2 = -2 * (ln(L[k-h]) - ln(L[k]) )) with h degrees of freedom
      2. Like partial F test
2. Heteroscedasticity
  1. no constant variance
  2. For the simple model with just one weights variable 𝐳1, that is 𝜎𝑖2 = exp(𝛾0 ∙ 1 + 𝛾1 ∙ ln(𝑧𝑖1))
    1. 𝛾1 > 0 : 𝜎𝑖2 is increasing with increasing 𝑧𝑖1
    2. 𝛾1 ≅ 0 : 𝜎𝑖2 is not affected by 𝑧𝑖1 ⇒ homoscedasticity
    3. 𝛾1 < 0 : 𝜎𝑖2 is decreasing with increasing 𝑧𝑖1
  3. A likelihood ratio test can be performed
    1. −2 ∙ (logit(OLS model) − logit(weighted model)) ~ Chi distribution with P-1 degree of freedom
3. Residual Spatial Autocorrelation Test
  1. Spatial Link Matrix
    1. The spatial connectivity matrix operationalizes the underlying structure of the potential spatial relationships among the observations
    2. Potential distance relationships
      1. Distance matrix
    3. Potential neighborhood relationships
      1. Binary spatial connectivity matrix
      2. Row-sum standardized link matrix 𝐕 / W
  2. Reason of autocorrelation in regression residuals
    1. Misspecification Rational
      1. Caused by unknown variable which has a spatial pattern
    2. Spatial Process Rational
      1. The spatial object exhibit some spatial exchange relationships
      2. Interaction flows
      3. Competition effects
      4. Agglomerative advantages
    3. Spatial Aggregation Rational
      1. If areal objects are split into parts and these split parts are merged with adjacent areal objects then these aggregated objects share parts of the information that they inherited from the split objects.
  3. Moran's I
    1. Standard deviate
      1. (Moran's I - Expectation) / sqrt(Variance) = Standard deviate
    2. The expectation and variance of Moran's 𝐼 dependent on the regression matrix X
    3. The observed value of Moran's 𝐼 depends on the exogenous variables 𝐗
    4. Moran’s plot
      1. The observed residuals against the average values of their neighboring observations
  4. Residuals are best mapped by a bipolar map theme
4. Feasible general least square
  1. Gaussian spatial processes
    1. Simultaneous autoregressive spatial process
    2. Conditional autoregressive spatial process
    3. Moving average spatial process
  2. Transfer dependent disturbance to independent and follow Gaussian distribution
7. Logistic Regression Analysis
1. The dependent variable is categorical
  1. Observations could be both individual or grouped records
    1. Individual Observations
      1. Two Categories (Binary distribution)
      2. Multiple Categories (Binary multinomial distribution)
    2. Grouped Observations
      1. Two Categories (Binomial distribution)
      2. Multiple Categories (Multinomial distribution)
2. Basic logistic regression
  1. Focuses on individual observations, only two categories for dependent variables.
  2. The variance of the population disturbance is given by: Var (e) = (1 - u) * u
  3. The spread of the disturbances depends on the predicted value (probability), therefore, their variances become heteroscedastic
  4. Solution to the heteroscedasticity
    1. Maximum likelihood estimation
    2. Iteratively re-weighted least squares
3. Why logistic regression
  1. The predicted value can only fall into the feasible range of probabilities [0,1]
8. The Generalized Linear Model
1. All based on exponential distribution family
  1. Normal
  2. Exponential
  3. Binomial
  4. Gamma
  5. Poisson
2. Potential reasons for observing excess dispersion
  1. This is a classic case of model misspecification.
  2. Observation are correlated with other counts
  3. An incorrect assumption about the distributional model
  4. The choice of the link function is incorrect.
  5. There are outliers in the data.
  6. Estimated regression coefficients remain unbiased， But their standard error comes incorrect
3. The Offset Term (baseline expectation)
  1. How the exogenous variables influence the variation of the individual expectations around their baseline expectations
  2. The regression coefficient of offset should be fixed ( usually set as 1 ), and offsets should be log-format.