Reflections On My First Classification Problem

Posted by Hunter Sapienza on December 21, 2019

Of all the module projects I’ve complete through the Flatiron School data science program, this one proved to be the most challenging thus far. Whereas in the first two projects, students are provided with a dataset and assigned very specific goals to meet, in the third project, students must find their own dataset in order to answer a question of their choosing. Students can either start with a problem and find a dataset fit to solve it or begin with a dataset and develop questions to answer by exploring the data. However, the project must be of their choosing, and involve some sort of classification modeling, in addition to thoroughly addressing the other components of the data science process covered in the first two modules.

While offering numerous challenges, both those of technical nature and those testing my perseverance, this project offered the opportunity to work through the entire OSEMN data science process independently for the first time. Through the experience I gained a better understanding of: (1) what it means to discover passion behind a particular topic, (2) how to obtain usable data from the vast resources available to our society, (3) the deeper conceptual nature of many different models available to data scientists, and (4) how to contextualize these models appropriately within a real-world scenario. In this blog post, I will touch upon various points throughout this process and explain the knowledge I gained, with regards to data science both as a career field and within the technical nature of the work.

Finding a Data Set

When I first read the Module 3 project description, I glanced briefly over the guideline to “find a topic you’re passionate about.” The project involved answering a question using classifier models, and I assumed I could find any dataset labeled with “classification problem” on Kaggle, UCI, or any other database, and simply get started on the models! I couldn’t have been more wrong. As I have learned throughout this project, it is key to find a topic that you have a passion for, as it increases the quality and thoroughness of the final product. As I sent dataset upon dataset to my instructor, he asked, “Does it interest you?” or “Are you passionate about it?” I thought to myself, I guess. Maybe yes, maybe no, but I certainly create models for it!How naïve I was…while it remains important to find persistence and dedication to work that may not be of the greatest interest to me, I found that as my passion for this topic diminished throughout the course of the project (if it even existed to begin with), a topic that truly interests and drives my motivation popped up, as I began taking frequent breaks to brainstorm and build ideas for my final Flatiron project instead. I think frequently of new ideas and questions to explore, as I remain eager to begin that project instead. In those moments, I recognize the feelings I hope to foster within myself as an amateur data scientist. And as I begin what I hope to be a long data science career, I am discovering the importance of finding a cause I truly believe in. Data is everywhere in today’s society. Nearly every industry employs, or will soon employ, data scientists to make sense of this data and utilize it to move their mission forward. As I move closer to completing the program, I must begin to think more deeply about how I wish to use these skills and what I hope to accomplish for society as a data scientist.

Determining What Question to Answer

When I finally settled on UCI’s Cervical Cancer Risk Factors Classification dataset via Kaggle, I jumped immediately into exploring the data, assuming the simple end goal of classifying positive and negative cases for cancer throughout the 858 records. However, I soon discovered a problem: four target variables exist within the data, with little indication about how to utilize them together within the classification. The dataset provides four tests for cervical cancer: Hinselmann, Schiller, Citology, and Biopsy. At first, I attempted to combine the four by creating a new column ‘Cancer’ as the sum of all four for each record. This quickly turned into a convoluted multiclass classification modeling problem with little to no indication about how the four targets were actually related, so I went back to square one. This time, after research, I found that the biopsy is the definitive test for a cervical cancer diagnosis, so I eliminated the other three entirely and focused only on ‘Biopsy’ as the target variable. However, the models I developed proved to be weak in classifying cancer cases without the information of Hinselmann, Schiller, and Citology, so I started from the beginning once again. This time, on my third and final attempt, I included the other three tests as predictive features and use them in my models alongside the other variables to classify the biopsy variable (renamed ‘Cancer’) as positive or negative. Although this assumes the presence of information about the three tests prior to classifying cases of cancer via biopsy, the models produce performed much more effectively than any prior, so I stuck to this decision for the sake of the project. For any future work, I would attempt to:

  • Perform a multiclass classification model and analysis with a combination of the four tests as the target variable.
  • Compare models for classification between Hinselmann, Schiller, Citology, and Biopsy.
  • Create classification models for Biopsy in the absence of the other three diagnoses. Ultimately, through this process, I learned the importance of defining a question at the onset of a project and developing a thorough understanding of the data provided. I created much more work than needed for myself, as I ran models over and over again due to my lack of understanding. While I learned a great deal through this exploratory process, I commit to doing my research from the start next time, before jumping too quickly into modeling and data exploration.

Principal Component Analysis

Throughout the Module 3 lessons and labs, I recognized the usefulness of principal component analysis (PCA); it reduced the complexity of our data, sped up the runtime of our code, and, in many cases, allowed us to create more effective models. However, I failed to understand the true conceptual nature of this technique. As we employed vector theory to transform our data, I was unable to connect the beginning and end products of this process. What does the new data mean? What are my columns titled? How do I know which features were selected as most influential?All valid, but (as I now realize) naïve questions.

In my initial code, I seamlessly ran PCA to find the number of components I needed to describe 80% of the variance in my data, extracted these components to create a PCA-transformed dataframe, and used these features to develop additional potential models for solving my classification problem. When I returned later to dive deeper into the concepts within my markdown cells, I found that I could not adequately interpret or summarize the code I had run before. Fortunately, I hopped over to Google and quickly found some articles to help me out. Through other data scientist’s contributions on Medium, stack overflow, and Towards Data Science, I learned that you should only use PCA if you can answer yesto the following three questions:

  1. Do you want to reduce the number of variables, but aren’t able to identify variables to remove from consideration?
  2. Do you want to ensure your variables are independent of one another?
  3. Are you comfortable making your independent variables less interpretable?

The article focused especially on the third question. Why? Because, by using PCA, you completely transform your features via the underlying vector theory work, resulting in new features with little to nothing in common with the original features other than the formulas used for transformation. Thus, the new dataframe does not have column titles (other than PC1, PC2, PC3, … , PCn), and we cannot use the features for interpreting the data, solely for the remaining classification models. While this represents a simple understanding of PCA, it meant a great deal to me, as I not only understood more with regards to the conceptual nature of this topic, but more importantly learned how vital it can be to have this conceptual understanding prior to simply running code without any regard to what it actually means.

Contextualizing my Data Science Models

Throughout this project, I ran dozens of models using different variations of the data. While I found it easy to press “command+return” to run my code, I quickly discovered it to be much more difficult to understand and interpret the output. Many of the models produced similar results, with accuracy scores ranging from 91-98%, and little means to differentiate their performance. However, I soon realized that the accuracy scores only tell a partial story; with a small proportion of positive biopsy cases, it remained easy for a model to obtain a high accuracy score by favoring the predicted classification as negative. Within the medical community, however, false positives are much preferred over false negatives, as the potential emotional trauma associated with a misdiagnosis as positive is much less severe than a missed diagnosisthat remains untreated. Thus, I began to investigate the precision scores of my models (measured as the number of true positives divided by the total sum of positives) via their confusion matrices. Through this exploration, I discovered that while some models performed with lower accuracy scores, their precision scores remained high, substantiating the claim that they may be more effective models within the context of this cancer classification problem. Furthermore, I learned how to utilize ROC curves and the AUC score to further support certain models, balancing precision with recall to find the optimal balance that minimized both false negatives and false positives. No one model can rise above the rest as a frontrunner until you contextualize each model within a thorough understanding of problem. I learned this throughout the Module 3 project, and I will continue to return to this lesson in all future work. It is easy to simply run cells of code and compare the output, but the hard part is connecting these numbers to the real-world context and impact. Data can do a great deal of good, but it can also cause a great deal of harm. It’s up to us, as the scientists behind the data, to determine which road we take.

In Summary

The Module 3 project proved challenging for me, but through these difficulties, I learned several important lessons. As I move forward with my work as a data scientist, I commit to the following:

  • Find work that drives my motivation. Pick projects that build my passion for data!
  • Choose an initial question before jumping into the data and determine how my dataset will support me in finding an answer.
  • Understand the conceptual nature of each program and code that I run. If I can’t explain it, I shouldn’t run it.
  • Recognize the implications of how I contextualize my data. The impact of my findings remains my responsibility.

Continuing forth toward the end of the Flatiron School program, I remain excited to put these lessons into practice as I build my data science skills and portfolio. Click here for a more in-depth technical and conceptual overview of my Module 3 project.