Welcome to the continuation of our look at statistical analysis with ArcMAP. Recall that the theme being explored with statistics is methamphetamine lab busts around Charleston West Virginia. These past two weeks of analysis have been the bulk of the work for this project. The overall objectives of the analysis portion of this project are to review and understand regression analysis basics, and a couple key techniques. Define what the dependent and independent variables for the study are as they apply to the regression analysis. Perform (multiple renditions) of an Ordinary Least Squares (OLS) regression model. Finally, complete 6 statistical sanity checks based on the OLS model outcomes.
In the previous post we looked at a big overview of the area that is being analyzed. There are 54 lab busts from the 2004-2008 time frame from the DEA's National Clandestine Laboratory data. Decennial census data from 2000 and 2010 at the census tract level was spatially joined to these 54 lab busts. The data was then normalized into a percentage by census tract into 31 categories for analysis in the OLS model. These 31 categories of data were then fed into the model and systematically removed while analyzing their affect on the model. Ultimately as good a model as possible was arrived at with some results below.
This is an extract of the table that I put together from the ArcMAP generated output to depict the OLS results. This was a cleaner format than the straight screen shot because it incorporates the descriptions of the individually labeled data elements. Key things to note for the table are that there are now only 11 variables being incorporated into the OLS model of the original 31. How were variables removed you might ask? There are six checks or questions to answer to determine the validity of a variables use in the OLS model: does an independent variable help or hurt the model; is its relationship to the dependent variable as expected; are there redundant explanatory variables; is the model biased; are there variables missing or unexplained residuals; how well does the model predict the dependent variable? The first three of these were generally grouped into 1 solid check for determining if a variable should stay or go. The remaining checks were applied to the model results as a whole. The key attributes to look at for a variable fall in line with [a], [b], [c], as depicted on the table. As long as you had a coefficient not near zero, probability lower than .4, and VIF less than 7.5 a variable could stay. Not all of these match this criteria now, however you have to look at model functionality as a whole. The R-Squared [d] value is right at .7 (rounded up) which means that the model as it is accounts for 70% of the meth labs location based on the variables in use. This is pretty good when working with sociological data of this type. After looking at this data table its time to transition to the visual interpretation seen below.
This map depicts the standard residual for the OLS model depicted in the table. It symbolizes areas using a standard deviation style outlook. However rather than wanting a more Gaussian curve style of data showing some of every color you ideally want values to be in the -0.5 to +0.5 range because that is said to be highly accurate. Darker browns indicate areas that the model predicted less meth labs than there actually were, and darker blues indicate high value areas where the model expected more meth labs than were actually present. Remember though that from our table above we are only doing a good job of predicting 70% of the total meth labs and the majority of the study area is still within 1 standard deviation.
This weeks focus was not to describe the data results, but to accomplish the analysis leading up to it. Please follow up next week for a look at the finalized product. Thank you.
In the previous post we looked at a big overview of the area that is being analyzed. There are 54 lab busts from the 2004-2008 time frame from the DEA's National Clandestine Laboratory data. Decennial census data from 2000 and 2010 at the census tract level was spatially joined to these 54 lab busts. The data was then normalized into a percentage by census tract into 31 categories for analysis in the OLS model. These 31 categories of data were then fed into the model and systematically removed while analyzing their affect on the model. Ultimately as good a model as possible was arrived at with some results below.
This is an extract of the table that I put together from the ArcMAP generated output to depict the OLS results. This was a cleaner format than the straight screen shot because it incorporates the descriptions of the individually labeled data elements. Key things to note for the table are that there are now only 11 variables being incorporated into the OLS model of the original 31. How were variables removed you might ask? There are six checks or questions to answer to determine the validity of a variables use in the OLS model: does an independent variable help or hurt the model; is its relationship to the dependent variable as expected; are there redundant explanatory variables; is the model biased; are there variables missing or unexplained residuals; how well does the model predict the dependent variable? The first three of these were generally grouped into 1 solid check for determining if a variable should stay or go. The remaining checks were applied to the model results as a whole. The key attributes to look at for a variable fall in line with [a], [b], [c], as depicted on the table. As long as you had a coefficient not near zero, probability lower than .4, and VIF less than 7.5 a variable could stay. Not all of these match this criteria now, however you have to look at model functionality as a whole. The R-Squared [d] value is right at .7 (rounded up) which means that the model as it is accounts for 70% of the meth labs location based on the variables in use. This is pretty good when working with sociological data of this type. After looking at this data table its time to transition to the visual interpretation seen below.
This map depicts the standard residual for the OLS model depicted in the table. It symbolizes areas using a standard deviation style outlook. However rather than wanting a more Gaussian curve style of data showing some of every color you ideally want values to be in the -0.5 to +0.5 range because that is said to be highly accurate. Darker browns indicate areas that the model predicted less meth labs than there actually were, and darker blues indicate high value areas where the model expected more meth labs than were actually present. Remember though that from our table above we are only doing a good job of predicting 70% of the total meth labs and the majority of the study area is still within 1 standard deviation.
This weeks focus was not to describe the data results, but to accomplish the analysis leading up to it. Please follow up next week for a look at the finalized product. Thank you.
No comments:
Post a Comment