Scholars Index: Symbol of Professional Status: Articles

Published by Ch. Mahmood Anwar On Date 2016-11-15 11:43:27

Author : Ch. Mahmood Anwar (Scholars Index)

This illustrative study reincarnates the philosophical assumption of Methodology for science among other assumptions like Epistemology and Ontology in the context of social and behavioral sciences. Based on literature review the study divided the overlapping and perplexing Gaussian Linear Regression Model (GLRM) assumptions into two comprehensive groups. The study modelled straightforward diagnostics for GLRM assumptions violations by using the data collected from 150 postgraduate university students. Finally, the study provides the remedial directions to address possible problems created by GLRM assumptions violations.

Introduction

Unlike natural sciences, in which the scientific theories are rooted in hard facts of the world, social sciences raise few definitional issues because the social theories are built on personal opinions or tentative thoughts (Kangai, 2012) and people interpretation of world (phenomenological view) (Moustakas, 1994), while some are socially constructed realities (Searle, 1995). Generally, social sciences study human behavior, social groups and institutions; subdivided into multifarious areas like Anthropology, Commerce, Behavioural Science, Economics, Political Science, Education, Management, Psychology, Public Administration etc. Due to the complexity and unpredictability of The Human Element, social studies can only said to be scientific if all observations leading to theories are carried out carefully and impartially to attain objective and secure footing for science (see Chalmers, 1999). At this point, it can easily be observed in social knowledge context that among other philosophical assumptions for science like Epistemology and Ontology (Bryman, 2001), Methodology assumption (Cohen, Manion & Morrison, 2001) is more important whether using positivist or constructivist paradigms. No doubt, nowadays social scientists are conflicting with Epistemological assumption (Ghoshal, 2005), till present the Epistemological and Ontological assumptions hold the arena of social sciences strongly. The present precise introduction to scientific philosophy hence emerged the fact that beyond the vitality of Epistemological and Ontological assumptions the Methodological assumption plays an important role to label social studies knowledge as “science”. This point satisfies the explanation of Chalmers (1999) for what constitutes science.

After establishing my position about true value of Methodology, I would rather move inside the assumption. In the context of positivist paradigm, methodology consists of research design, data collection, measurements and data analysis methods (Tharenou, Donohue & Cooper, 2007). Among these ingredients of methodology, data make soul for labelling the body of social knowledge as “social sciences”. If health of the data is good than the social theories built on the data will reflect good science and vice versa. Therefore, social scientists must investigate and report the health and unbiasedness of data to satisfy the data health assumptions and diagnostics (Gujarati, 2004).

I was motivated to write this article while surfing and studying the “Instructions for authors” given by top ranked management, behavioral and social science journals. I carefully examined the author resources of highest quality journals like Academy of Management Journal, Journal of Management, Journal of Organizational Behavior, Journal of Economic Literature etc, and found no specific detailed instructional statement endorsing or encouraging authors regarding reporting the health of data in terms of statistical assumptions and diagnostics. However, very few journals understand (e.g. American Psychological Association Journals, International Journal of Management, Economics and Social Sciences) the significance of data health and encourage the authors to report it in the submitted manuscripts. I personally feel that the instructions for authors by a particular journal are most influential masterpiece for authors submitting their manuscripts for publication. Therefore, this study is aimed at highlighting importance of investigating health of data (soul) leading inferences and estimations in social sciences (body). This article is very significant because researchers and students of social and behavioral sciences will now be able to understand Gaussian Linear Regression Model (GLRM) assumptions and their diagnostics evocatively under one umbrella.

Gaussian Linear Regression Model (GLRM) Assumptions

Cuthbertson, Hall and Taylor (1995) pen a million dollar truth that the application of statistics on data is not like a mechanical mechanism but it requires deep knowledge, intuition and adroitness. We know that ordinary least square (OLS) method is sufficient to approximate the population regression function (PRF) estimators, but in social sciences we are more interested to draw the inferences rather than just mathematical estimations, hence, need accurate values of model estimators. In addition, PRF also depends upon the disturbances which make the model more flimsy if underlying assumptions are violated. The accuracy can only be achieved by taking few assumptions into account. These assumptions are eleven in counting and known as Gaussian Linear Regression Model (GLRM) assumptions (Gujarati, 2004). To keep it simple, I would only review six assumptions because I think these assumptions are more important in social science research.

The range of diagnostic tests actually belongs to Gaussian Linear Regression Model (GLRM) assumptions which I will touch in methodology latter. A wide class of diagnostic tests have been reported in literature including examination of residuals, Durbin–Watson d-test, RESET test (Ramsey, 1969), Lagrange Multiplier test (Engle, 1982), discrimination and discerning (Harvey, 1990), J Test (Davidson and MacKinnon, 1981), the JA test, Cox test, Mizon–Richard test, the P test (Baltagi, 1998), outliers, leverage, and overly influential cases, recursive least squares, Chow’s prediction failure test etc.

The breadth of the topic under consideration is as wide as many specialized books are required to cover it. However, following Peter Kennedy’s keep it stochastically simple principle, I would discuss only those diagnostic statistics which are simple and easy to calculate or examine for students and researchers by using conventional statistical software like SPSS.

Now, the question arises that among these eleven assumptions how much one should pay attention to some while neglecting others? For instance, Gujarati (2004) weighted all assumptions equally but Tabachnick and Fidell’s (2001) discussed linearity, normality, multicollinearity, homoscedasticity and outliers. Many text books amalgamate these assumptions which often confuse the readers and students. I feel that researchers should not understand these assumptions allegorically, but literally. In fact, there exist two types of assumptions; one is about model specification and disturbances, while other is about data. Linearity, homoscedasticity, independence, model specs and Gaussianity assumptions belong to first type, however, singularity (multicollinearity) and model bias part of model specification assumption belongs to second type (see Wetherill, 1986). Now I will explain each assumption precisely and use to the point approach.

-Linearity

This assumption states that the regression model should be linear in parameters. Most people think linearity as if the conditional expectation function (CEF) of regressand is a linear function of regressors i.e. a straight line. In fact, conditional expectation function should be a linear function of the model estimators. It simply means that the estimators must have a power of one and must not multiply or divide with each other. The parametric linearity of the regression function is essential because the regressand and regressors may be linear or non-linear but parameters should be linear to satisfy this assumption (Gujrati, 2014).

-Homoscedasticity

According to this assumption the conditional variances of stochastic disturbances should be identical for conditional expectation function (i.e. equal variance). Linguistically, the meaning of homoscedasticity is equal spread (homogeneity of variance) and the word was derived from Greek word. The opposite situation is known as heteroscedasticity in which disturbances are not identical for conditional expectation function. Homoscedasticity simply means that the variation around the regression line should be identical across the values of regressor. The probability of heteroscedasticity is greater on cross sectional data (Gujrati, 2014).

-Independence

This assumption states that all regressors should be independent form each other. In simple words, the correlations among the disturbances of two or more regressors must be zero (no autocorrelation). For cross sectional data, the chance of dependence among regressors is less with the random sampling and increase with convenience sampling. Non-random sampled data may sometimes indicate spatial dependence among regressors, but the problem of autocorrelation is more serious in time series data, especially when the time interval between data collection points is short.

Nowadays, many researchers use the term autocorrelation and serial correlation synonymously. But, in fact autocorrelation is the lagged correlation of a series of data with itself, whereas, lagged correlation between different data series is known as serial correlation (Tintner, 1965).

-Singularity

There should be no perfect linear relationship among the regressors according to this assumption. This concept was first introduced by Ragnar Frisch in 1934, which simply means the perfect linear relation among few or all regressors in the regression model. We know that in real life nothing is perfect, so, nowadays researchers are using it as multicollinearity. However, in the case of perfect multicollinearity (singularity), model estimators would be sitting on the fence having infinite disturbances. While, with less than perfect multicollinearity, the estimators can be determined, having large disturbances leading to inaccurate and imprecise estimators. In the near to multicollinearity or with small number of observations, the OLS estimators are still BLUE but have large variances and co-variances leading to imprecise estimations and wrong statistical inferences. Goldberger introduced the term micronumerosity for effects of sample size on estimation. Montgomery and Peck (1982) indicated many sources of multicollinearity like data collections errors, model specifications, over determined model, model constraints and regressors sharing common trend in time series data.

-Gaussianity

The GLRM also assume that the disturbances of each regressor should follow Gaussian (i.e. normal) distribution. To understand this, it is essential to inform readers that the theoretical justification of Gaussianity is rooted in the famous central limit theorem (CLT) (Fischer, 2011). According to the theorem, the sum of large number of independent and identically distributed (i.i.d) random variables leads to Gaussian distribution as the variables increase indefinitely. This hints that the dependent variable is actually influenced by the disturbances from the number of independent variables in the regression model. In addition, because the linear function of Gaussian random variable is itself Gaussian, hence, the probability distributions of model estimators can easily be derived.

The assumption of Gaussianity is not important if the objective is only estimation, because the OLS estimators are BLUE even if disturbances are non-Gaussian. But mostly, the objective of researchers in social sciences is testing hypotheses and making inferences with small or medium sized sample, in this case the assumption of Gaussianity becomes critical. Tharenou, Donohue & Cooper (2007) explained that multivariate Gaussianity is more difficult to test; researchers should ensure univariate Gaussianity which reduces the chances of multivariate Gaussianity.

-Model Specification and Bias

Diagnostic tests are the sub-procedures to check the assumption of selecting correctly specified regression model for analysis whilst violation of the assumption leads to model specific errors (under or over fitting) or bias (Gujarati, 2004). The presence of these errors in the regression model can be looked with the help of regression fishing.

We know that conditional expectation function (CEF) is parametric in nature and needs transformation into statistic called stochastic sample regression function (SRF) which estimates the CEF. SRF informs that differences between the actual and estimated values of any dependent variable are important and termed as residuals. The residuals can have positive or negative values. Now the question of interest is that how these residuals influence regression model? To have an answer, I will proceed to next section.

-Outliers and Overly Influential Cases

Outliers and Overly Influential Cases are nothing but data points in regression model. To understand them precisely, recall few basic concepts of regression estimations. SRF can only be estimated precisely if the sum of residuals is as small as possible, but in reality, some residuals receive equal weights while others receive unequal weights. Thus, Gauss a well known scientist proposed Ordinary Least Square (OLS) method to resolve this problem because least square criteria assigns more weight to underestimated residuals (Gujarati, 2004). Hence, outliers can easily be understood as cases from different population or the case having greater effect than the majority of other sample cases. Both outliers and overly influential cases distort the regression line and reduce generalizability of the regression model

There are two types of outliers, simple and multivariate. Simple outliers reflect cases having extreme value with respect to one variable, whereas, multivariate outliers are the cases with excessive values with respect to several variables (Garson, 2012).

Discussion

A good study or teaching aid should make several contributions. First, this illustrative study established the true value and importance of Methodology assumption (among other philosophical assumptions for science) responsible to label social knowledge as “science”. Second, the study established the fact that in social sciences the most important issue is the diagnosis of data health which should be done by researchers painstakingly. The meticulous analyses of data will determine whether the data will provide true inferences in testing hypotheses. Third, the study clarified the confusion of researchers and students by introducing two groups of GLRM assumptions. After establishing this position, fourth, the study provided simple visual and numerical diagnostic techniques to detect possible data health problems. Fifth, the study endorsed that researcher should try to build their skills up to the level so that they may analyze the violations to the assumptions graphically as the use of distributional tests and stats shape indexes are not preferred substitute of graphical methods for analysis of disturbances. The summary based tests, shape indexes and increase in sample size create issues in detecting distributional irregularities in disturbances. Therefore, according to many researchers the graphical methods should be the first priority to analyze the GLRM assumptions violations (Wilkinson & TFSI, 1999).

Now, question arises, what to do if data are found to be suffered from GLRM assumptions violations? Although this question is out of the scope of this study but I will provide helpful directions to the researchers to rectify the violations to the assumptions.

In the view of Blanchard (1967) “do nothing school of thought”, multicollinearity is essentially due to micronumerosity, and in social science researchers have no control over data available for empirical analysis. Therefore, researchers should follow a do nothing approach. Although, we cannot estimate one or more OLS estimators with quality precision but estimable function can be estimated efficiently (Conlisk, 1971). On the contrary, statisticians suggest to drop variables and specification biases, transformation of variables, new data collection, reducing collinearity in polynomial regressions, orthogonal polynomials (Draper & Smith, 1981), principal components and ridge regression (Chatterjee & Bertram Price,1977; Vinod, 1978) methods to deal with multicollinearity.

Although heteroscedasticity does not annihilate consistency and unbiasedness of OLS estimators but it can affect the precision of hypothesis testing and make it ambiguous. Gujati (2004) suggests, if constant or homoscedastic variance of residuals is known than weighted least square method is useful to get BLUE estimators (heteroscedasticity correction). But these constants are rarely know hence the other method to correct heteroscedasticity is to measure White’s Heteroscedasticity-Consistent Variances and Standard Errors.

Just like the case of multicollinearity, transformation of variables are also recommended for heteroscedasticity, outlier removal, non-normality and non-linearity of data (Tabachnick & Fidell, 2001). The ultimate objective of transformation (i.e. log transformation, square root transformation and inverse transformation) is to normalize your data. But Tabachnick and Fidell (2001) suggested avoiding transformations and using them in only extreme cases. This is also worth mention to inform readers that many researchers like Bollinger & Chandra (2005) and (Garson, 2012) prefer winsorizing of data instead of dropping outliers directly.

Conclusion

This illustrative study motivates researchers to understand the true value of GLRM assumptions and provides “fingertips approach” to data health diagnostics. Social scientists and researchers must test collected data for its health before testing, building or extending social science theories. It is better to realize soul-body relationship example in case of social sciences as body has no value without soul.

Note: This article was published in European Online Journal of Natural and Social Sciences.

Data Health Assurance in Social and Behavioral Sciences Research

More Articles by This Author