Uw foster opmgt 565 final project writeup
WATER POTABILITY
Introduction
Access to clean and potable water is a basic human right. However, declining quality of water has become a global
issue of concern as climate change, human population explosion, and expansion of industrial and agricultural
activities threaten to cause major alterations to the hydrological cycle.
The quality of water used for consumption is influenced by multiple natural and human factors. Typically, water
quality is determined by comparing the physical and chemical characteristics of a water sample with water quality
guidelines or standards set by the WHO. An example of a prevalent water quality problem is eutrophication – a
result of high-nutrient loads (mainly phosphorus and nitrogen), which substantially impairs beneficial uses of water.
The presence of certain contaminants in our water can lead to health issues, including gastrointestinal illness,
reproductive problems, and neurological disorders. Hence, testing the quality of water is extremely important on all
geographical levels to collect vital information on potability of a particular body of water, and whether it may need
special treatment before consumption.
Objective
For this project, we found a dataset that has the information on chemical and physical characteristics such as pH
value, hardness, presence of carbon compounds, etc., for samples collected from various water bodies. We will use
this dataset to perform a step-by-step analysis using three Machine Learning models and summarize our insights and
recommendations. The goal of this project is to use a business analytics model with highest relative accuracy to
predict whether a sample of water is safe for consumption based on its physical characteristics and chemical
composition.
Data Definition
In a summary, the dataset has 10 features/variables and 3276 observations. The data definition is as follows:
pH value: It is also the indicator of acid-base balance and acidic or alkaline condition of water status. WHO has
recommended maximum permissible limit of pH from 6.5 to 8.5.
Hardness (in mg/L): Hardness is the capacity of water to precipitate soap. It is mainly caused by calcium and
magnesium salts. These salts are dissolved from geologic deposits through which water travels.
Solids (Total dissolved solids – TDS in ppm): Inorganic and some organic minerals may produce un-wanted taste
and diluted color in appearance of water. Desired TDS range is 500-1000 for drinking water.
Chloramines (in ppm): Major disinfectants used in public water systems; chloramines are most formed when
ammonia is added to chlorine to treat drinking water. Levels up to 4 ppm are considered safe in drinking water.
Sulfate (in mg/L): Sulfates naturally occur in minerals, soil, and rock and are commercially used in the chemical
industry. Sulfate concentration in seawater is about 2,700 mg/L.
Conductivity (in μS/cm): Electrical conductivity (EC) measures the ionic process of a solution that enables it to
transmit current. According to WHO standards, EC value should not exceed 400 μS/cm.
Organic_carbon (in ppm): TOC is a measure of the total amount of carbon in organic compounds in pure water.
According to US EPA Turbidity: Turbidity is a measure of cloudiness or haziness of a fluid caused by large numbers of individual
particles that are generally invisible to the naked eye. The higher the turbidity level, the higher the risk that people
may develop gastrointestinal diseases. The mean WHO recommended value is 5.00 NTU.
Potability: Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.
Data Analysis
1. Data exploration
A summary of the data is defined as follows:
Key points to note from this summary:
–
Potability variable is an integer type column. Rest all are double.
pH, sulfate, and trihalomethanes variables have missing values represented by the NA’s count
Mean of Potability variable is 0.39
Hence, the data requires some pre-processing to give accurate analysis results. This step may lead to a decrease
in the number of rows we have but it’s better to treat the data because missing values of sulfate and
trihalomethanes content in water sample, which impact potability, could lead to wrong prediction of water
quality.
2. Pre-processing
2.1 Missing Values
As identified in our data exploration phase, three variables have following count of missing values.
–
pH: 491
Sulfate: 781
Trihalomethanes: 162
We considered the following options to treat missing values:
1.
2.
3.
Continue the analysis with missing values
Remove the rows of data that have missing values
Replace missing values with median of the variable (imputation)
Option 1 has the risk of using a model that gives wrong prediction of water quality, option 2 may lead to
significant data loss. Hence, we chose option 3 and replaced the missing values with median.
A summary of the imputed dataset is as follows:
2.2 Multicollinearity
We ran an initial collinearity test to see if there are any multicollinearity between two variables to be cognizant
about it in the main analysis. The correlation table can be seen as follows:
As observed above, none of the variables have a collinearity coefficient greater than 70% with any other
variable.
2.3 Outlier detection
We also ran a box plot outlier test to remove any outliers from the data. The boxplots are as shown in Appendix
2. We didn’t find any significant outliers in the dataset except in one variable. Hence, we decided not to remove
outliers as it wouldn’t significantly impact our analysis.
3. Methodology
After initial discussions on the approach to solve the problem statement, we concluded that it could be best
solved by using classification models. The reason why we chose classification for this project is because we are
trying to predict the outcome of a dependent variable (Potability) based on previous observations to answer a
yes/no (discrete) question: Is this water sample potable? – i.e., classifying a water sample as potable or nonpotable. Hence, we chose logistic regression (binary classification), Classification of regression trees (CART)
and random forest (ensemble classification) models. For each model, we used the following method of
analysis:
1.
2.
3.
4.
Split the dataset: 75% training 25% test
Run the chosen ML model on the training data and measure accuracy and specificity
Predict the model on test data on 0.5 probability (default) threshold. Measure accuracy and specificity.
Predict the model on test data on 0.75 probability threshold. Measure accuracy and specificity
5.
Measure Area under the curve using ROC graph
We chose a refined threshold probability value of 75% because we want high specificity. The cost of falsely
labeling a sample of non-potable water as potable (false positive) could lead to significant health risks.
Analysis through each of these models are discussed as follows.
Logistic Regression
Logistic regression is a type of predictive modeling technique used to find the relationship between a dependent
(categorical) variable and one or more independent variables. As observed in our data exploration phase, the
dependent variable Potability is an integer type. Hence, to apply logistic regression, we first converted the
Potability into a factor to use it as a categorical variable.
We then split the dataset into training and test and ran the logistic regression test on training dataset. The result
is as follows:
A key observation here is that the model predicts that none of the variables have a significance value of less
than 0.05. Moreover, all variables have negative intercepts except chloramines and solids, which means the
higher the value of all variables except solids and chloramines, the more the odds of non-potability.
Further, we predict the model on test data and measure the accuracy of the model against the benchmark.
As seen, the accuracy of logistic regression model of 60.97% is almost the same as the benchmark accuracy of
60.99%, whereas the AUC value of the model is 0.48.
CART Classification
We used the Classification of Regression Trees or CART classification model on the training dataset and then
predicted its accuracy on the test dataset. The decision tree on training data is as follows:
As seen above, the model predicts Sulfate as the key determining variable to classify the dataset. The model
gave an accuracy of 62.31%, which is greater than the benchmark accuracy. We then measured the accuracy of
our model on a probability threshold of 75%. Results are as follows:
As seen, highest specificity of 97.6% is for the model with a probability threshold of 75% at a slightly lower
accuracy than the model with 50% probability threshold. The AUC value of the model is better than that of
logistic regression at 0.55.
Random Forest Classification
To implement the random forest model, we first ran the model on training dataset, and then made predictions on
test dataset using 0.5 and 0.75 probability thresholds respectively, measuring the accuracy and specificity of
each model. Results are as follows:
As seen, this model has the highest AUC value among all the models evaluated at a value of 0.67 and has high
specificity value of 99.6 for 75% probability threshold. However, the model accuracy is brought down by 5%
by increasing the probability threshold.
Summary
The accuracy and specificity results are summarized as follows:
Benchmark Accuracy = 60.99%
Model
Logistic Regression
Logistic Regression with t=75%
CART model
CART model with t=75%
Random Forest
Random Forest with t=75%
Accuracy %
62.31%
61.46%
62.31%
61.46%
66.82%
61.9%
Specificity%
99% on training
NA
90.2%
97.6%
89.6%
99.6%
AUC value
0.48
0.55
0.67
Conclusion
Based on our analysis, we conclude the following factors about predicting water potability:
1.
2.
3.
The dataset overall needs more significant variables to accurately determine the potability of water
samples. Some examples include the content of zinc, cyanide, etc. or proximity to a populated or industrial
area, etc.
The analysis concludes that Random Forest Classification model at probability threshold of 50% best
solves the problem statement with an accuracy of 66.82%.
Since this problem relates potential harmful effects of classifying a polluted water sample as potable,
specificity should be high, and the model should be further evaluated with different threshold values to
reach a balanced accuracy and specificity level.
Hence, we recommend using Random Forest classification model to predict water potability for future samples.
Appendix 1: R Code
Link to our final R code is here: https://drive.google.com/file/d/1uAe6dZ3fTXyB8WNA40RyYIPem4rKh3X/view?usp=sharing
Dataset: https://www.kaggle.com/adityakadiwal/water-potability
Sharing as a link to avoid final page number creep.
Appendix 2: Data Visualization
2.1 Missing Data Rates
2.2 Variable summary histograms
2.3 Correlation Plot
2.4 Outliers Box Plot
2.5 ROC curve Logistic Regression
2.6 ROC Curve CART
2.7 ROC curve Random Forest
Name:
Description:
…