Close to Home

Posted by Hunter Sapienza on December 21, 2019

Four modules later, the first project I completed in Flatiron School’s online data science program still remains my favorite project thus far. Rewind just slightly to my time as an undergraduate at the University of Washington, where I studied human geography with a focus on urban development and city demographics, specfically the gentrification of the Capitol Hill neighborhood in which I lived during my senior year. With that educational background under my belt and the majority of my childhood spent in a King County suburb just outside the city, the King County Housing Data project provided the opportunity to utilize all my prior experiences in an engaging and informative first project.

The King County Housing Data set provided my first comprehensive experience in tackling a data science problem through each stage of the OSEMN framework, from start to finish. Using the knowledge I gained in the first module of the bootcamp program, I started by obtaining the scrubbing the dataset to create a usable dataframe for the more engaging, exploratory component of the project. However, even the scrubbing step provided some excitement, requiring me to make decisions about missing data that best suited the context of the problem. By diving into an exploration of the meaning and particular values associated with the ‘waterfront’, ‘views’, and ‘yr_renovated’ features, I was able to fill the missing data in these columns with values that added to the understanding of our data, rather than taking away from the completness of our dataframe. Such decisions were exciting and empowering to make, especially early in the program when I was just starting to gather experience working with big data.

However, in the exploration phase of the OSEMN framework, I truly began to immerse myself in a process that I have come to love throughout the program. Through various methods and visualization tools, I pulled apart the data in many different ways, exploring the features through various lenses, while focusing on specific features that impact housing value. Paired with my preexisting knowledge of the King County region, I thoroughly enjoyed the insights I began to uncover and found myself passionately immersed within the project. Any moment I could spare, I would flip open my laptop, hop onto my Jupyter notebook, and begin to explore the dataframe from yet another angle. During this process, I created scatterplots to better understand the impact of living space characteristics, barplots to reveal the highest-value zipcodes, lineplots that illustrated trends in housing price over time, and more, all in efforts to create a better understanding of property values across the region.

The most challenging component of the project arrived when I attempted to model the data. While we had used various techniques for optimizing linear models in labs throughout the first module, the King County Housing Data project was the first time I needed to make decisions about which combination of optimization strategies would be best for this particular dataset. While I tested many different versions of the linear regression model, ultimately, I found that I needed to standardize and log transform many of my continuous features and target variable, as well as reduce the number of features used as predictors by evaluating p-values from the intial model and examining multicollinearity issues. These trimming process allowed us to reach a r-squared valued of 0.882, with low mean-squared error values for both the training and test sets. While providing substantial challenge for me in my first ever data science project, I found the process to be so fulfilling, as I understood more deeply the underlying concepts of linear regression and saw the rewards of these efforts when analyzing the final result. In a more specific look at the top predictors of housing value, I noticed several zipcodes present at the top of the list which represent Queen Anne, South Lake Union, Laurelhurst, Medina, and Madison Park, some of the wealthiest pockets of the Seattle area. Waterfront views appeared high on that list as well, refelctive of the value Seattlites place in their lakefront properties. Pairing my experience as a Seattle native with this concrete data was exhilirating, a feeling that has stayed with me throughout the remainder of the program as I begin to envision my future career in data science.

All in all, when the context of a dataset hits that close to home, exploring that data becomes all the more engaging, thorough, and allows one to more successfully analyze the results. Throughout that first project, I truly enjoyed drawing upon my past educational and personal experiences to inform the course of exploration and to interpret my findings. Moving forward, I hope that future projects (and eventually my career) can provide that same engagement and excitement as I apply both my skills and experience to impact our world with data.