Interpretation of surface water quality using principal components analysis and cluster analysis

Variety approaches are being used to interpret the concealed variables that determine the variance of observed water quality of various source points. A considerable proportion of these approaches are statistical methods, multivariate statistical techniques in particular. The use of multivariate statistical technique(s) is/are required when the number of variables is large and greater than two for easy and robust evaluation. By means of multivariate statistics of principal components analysis (PCA) and cluster analysis (CA), this study attempted to determine major factors responsible for the variations in the quality of 30 surface ponds used for domestic purposes in six (6) selected communities of Akoko Northeast LGA, Ondo State, Nigeria. The samples’ locations were classified into mutually exclusive unknown groups that share similar characteristics/properties. The laboratory results of 20 parameters comprising 6 physicals, 8 chemicals, 4 heavy metals and 2 microbial from the sampled springs were subjected to PCA and CA for further interpretation. The result shows that 5 components account for 97.52% of total variance of the surface spring quality while 2 cluster groups were identified for the locations. Based on the parameters concentrations and the land uses impacts, it was concluded that domestic and agricultural waste strongly influenced the variation and the quality of ponds in the area.


INTRODUCTION
The complexity of water quality as a subject is reflected in various types of measurements.These measurements include simple (in situ), basic and more complex parameters (Laboratory).For instance, pH, temperature and DO could be measured with a portable in-situ pH meter, a mercury thermometer and M90 Mettler Toledo AG DO meter, respectively (USGS, 2006).BOD, TSS, Cu, Fe, Total bacterial counts, Total Coliforms etc. could be analyzed in the laboratory using standard methods for water samples examination (Ayoade, 1988;APHA, 2005;WHO, 2006;USGS, 2006).
The surface water quality assessment is a matter of serious concern today due to its role in servicing domestic water needs of water stress areas (Yerel, 2010;Ayeni et al., 2011).The surface water quality is principally influenced by the natural and the anthropogenic processes particularly in the urban areas and agricultural activities around the rural areas (Ayeni, 2010;Ayeni et al., 2011).The level of water quality is relatively determined by the content of physical, chemical and biological parameters present in it.Relationship between two parameters may also lead to increases or decrease in the concentration of others.This relationship or association is usually achieved using multivariate statistical techniques (Ifabiyi, 1997;Mazlum et al., 1999;Jaji et al., 2007).This is because some analysis is primarily concerned with relationships between samples, while others trepidation are largely with relationships between variables.*Corresponding author.E-mail: aayeni@unilag.edu.ng.
According to Mazlum et al. (1999) and Yerel (2010), many multivariate statistical techniques have the capacity to summarize large data by means of relatively few parameters.Nonetheless, the choice of using any of the multivariate statistical techniques lies in the nature of the data, problem, and objectives of the study.In view of the fact that the daily drinking and domestic water needs of the majority of residents in the area are met by unsafe surface water, in particular surface springs (Ayeni, 2010), there is the need to understand the variables that control the variations in their quality.Principal Component Analysis (PCA) and Cluster Analysis (CA) of multivariate techniques are therefore adopted for the study.According Praus (2005) PCA is used to search new abstract orthogonal eigen-values which explain most of the data varies in a new harmonize structure.Each principal component (PC) is a linear combination of the original variables and describes different source of information by eigenvalue based on the decomposition of the covariance/correlation matrix (Geladi and Kowalski, 1986).PCA is designed to modify the observe variables into uncorrelated variables of linear combinations of the original variables called "principal components" (Praus, 2005;Yerel, 2010) as well as to investigate the factors which caused variations in the observed datasets (Mazlum et al., 1999).The principal component therefore provides information for interpre-tation and better understanding of the most meaningful parameters which describes the whole data set through data reduction with a minimum loss of the original information.Cluster analysis (CA) is an exploratory analysis technique for classifying a set of observations into two or more mutually exclusive unknown groups based on combinations of interval variables (Stockburger, 1997;Trochim, 2006;Murali-Krishna et al., 2008;Yerel, 2010).According to Yerel (2010), CA organizes sampling entities into discrete clusters, such that within-group similarity is maximized and among-group similarity is minimized according to some objectives criteria.Its purpose is to discover a system of organizing observations and sort them into groups so that it is statistically easier to predict behavior of such observations based on group membership that share similar identities/properties.In this study, observation and sampling location classi-fication were done by the use of Hierarchical Cluster Analysis (HCA) procedure.HCA identify relatively homo-geneous groups of variables (cases) through dendrogram based on selected characteristics.Dendrogram clearly distinguished locations bahaviours and interprets the description of the hierarchical clustering in a graphical format (Hastie et al., 2001;Ryberg, 2006).

Study area and sampling locations
The study area lies between longitude 5° 38' and 6° 04 'E, and latitude 7° 26' and 7° 42'N in (Ayeni, 2010).The complex composed mainly of granite, mischist, gneisses and metasediment (Barbour et al., 1982;Adekunle et al., 2007).The area falls within sub-tropical climate with average rainfall over 1500 mm per annum.The temperature ranging from about 30 to 38°C while the vegetation cover is dominated by derived secondary rain forest.The soil is classified as Ferric Acrisols with relatively higher cation profiles (Fasona et al., 2007;Nwachokor and Uzu, 2008).

METHODS
Twenty water quality parameters from 30 surface springs were monitored for 12 months.For each month, water sample from selected springs are collected and analysed in the laboratory using APHA ( 2005) standard methods for the examination of water and wastewater.The coordinates of sampled springs are interpolated on geo-rectified map of the study area (Figure 1).The selected surface water quality parameters for the study are pH, temperature, dissolved oxygen (DO), biochemical oxygen demand (BOD), total suspends solid (TSS), total dissolved solid (TDS), Turbidity, total hardness (TH), calcium hardness (Ca + ), Magnesium hardness (Ma 2+ ), Chloride (Cl -), Nitrate ( NO3 -), Phosphate (PO4 3-), Oil and grease, Cupper (Cu), Iron (Fe), Manganese (Mn), Zink (Zn), total bacterial counts (TBC) and total coliforms (TC).
The laboratory results were evaluated using multivariate statistical techniques of PCA for selected parameters and CA for sample locations.The principal component is thus given by the formula: Where, z = component score, a = component loading, x = measured value of variable, i = component number, j = sample number, and n = the total number of variables.
In the case of cluster analysis, the formula is given thus: Where d 2 ij = the Euclidean distance, zik = the values of variable k for object I; zjk = the values of variable k for object j, and m = the number of variables.

Quality assessment
The pH of springs water ranges between 5.3 and 8.3 (Figure 2a).Findings reveal that 13 ponds (Omiidu, Agboomi, Isemo, Alapoti, Arae, Agbo-Ilepa, Isun, Imurun, Omi-Alagoke, Omi-Olokungboye, Odewo and Adangbara in Ikare, and Ajagun in Ise) are below the regulatory limits of 6.5 and 8.5.The spring water temperature ranges between 19.6 and 29.2°C with the lowest and highest recorded at Gurusi in Ikare and Ajagun in Ise respectively (Figure 2b).DO in the springs range from 1.13 to 10.0 mg/l.DO is generally lower when compared with their mean except Arae, Igbarake, Omi-Alagoke, Omi-Olokungboye, Asanmo, Ogbogi and Arae at Ikare, Ajagun at Ise, Otuu at Iboropa and, OtunadumI and Gonga Obane at Akunnu, which are above the regulatory limits.The lowest and highest values were recorded at Otunshoe and Otunadumi at Akunnu respectively (Figure 2c).TSS in the springs ranges between 2.0 and 75.0 mg/l with a mean value of 15.0 mg/l (Figure 2d).

Cl
-in all the sampled springs ranges between 8.0 and 144 mg/l (Figure 2k).Findings reveal that Cl -in all sampled sources are generally low when compared with WHO (2006) minimum limit of 200 mg/l.NO 3 in the springs water ranges between 0.10 and 0.91 mg/l (Figure 2l).When this is put into comparison with the WHO (2006) regulatory limits of 5 and 10 mg/l, NO 3 in all springs are below limits.
PO 4 3-in the sampled springs ranges between 0.80 and 8.20 mg/l (Figure 2m).Findings reveal that PO 4 -in all sampled sources exceed the regulatory limits of 0.30 and 0.05 mg/l for WHO (2006) and SON (2007) respectively.Oil and grease in all sampled springs range between 0.01 and 3.12 mg/l (Figure 2n).In comparison with WHO limit of 0.1 mg/l, fifteen springs (Isemo, Ajagun, Ounra, Isun, Imurun, Omi-Alagoke, Omi-Olokungboye, Odewo, Oroki, Asanmo, Arae and Adangbara in Ikare and, Otunshoe, Otunadumi, Gonga-Obane in Akunnu) are higher than the limit while the remaining sampled sources fall below the limit.TBC in the sampled springs range between 1.10 and 2.40 cfu/ml with a mean value of 1.47 cfu/ml (Figure 2o).Spring water TC ranges between 0.00 and 3.50 cfu/ml with a mean value of 0.50 cfu/ml (Figure 2p).
Zn detected in sampled springs ranges between 0.01 and 0.66 mg/l (Figure 2q).The findings reveal that Zn contents in all sampled sources are generally low when compared to the WHO (2006) regulatory limit of 5.0 mg/l.Fe in all sampled springs ranges between 0.4 and 4.7 mg/l (Figure 2r).Fe detected in all sampled springs are higher than the WHO (2006) and the SON (2007) regulatory limit of 0.3 mg/l.Mn detected in the 17 springs with their values ranging between 0.01 and 1.15 mg/l (Figure 2s).Findings reveal that Mn in 3 springs (Agboomi, Isemo, Omi-Olokungboye and Oroki at Ikare) is higher than the WHO ( 2006) limits of 0.05 and 0.1 mg/l, and SON limit of 0.2 mg/l.Three springs (Otuu at Iboropa and, Gonga-Obane at Akunnu) are higher than the WHO (2006) limits of 0.05 and 0.1 mg/l.Cu in the sampled springs ranges between 0.0 and 2.10 mg/l (Figure 2t).From the findings, Cu in 7 springs (Gurusi, Arae, Agbo-Ilepa, Omi-Alagoke, Odewo and Ogbogi at Ikare, and Otunshoe at Akunnu), are above the WHO (2006) and SON (2007) regulatory limit of 1.0 and 0.1 mg/l respectively.
In summary, this finding corroborates Mallo et al. ( 2001) who reported that surface waters in the Tandil region of Argentina are highly polluted when compared to other sources such as pipe borne water system, wells, springs by analyzing pH, temperature, hardness, Cl, Ca, nitrates and bacteriology.Drinking water having microbial pollution posed a major threat to human health.

Principal component and cluster analyses
The result of principal components analysis in Table 1 shows that of the 20 components, only 5 had extracted eignvalues over 1.This is based on Chatfield and Collin (1980) assumption which stated that components with an eigenvalue of less than 1 should be eliminated.The extracted 5 components were subsequently rotated according to varimax rotation in order to make interpretation easier and fundamental significance of extracted components to the water quality status of the selected springs.The result of rotation revealed further, the percentages of the total variances of the 5 extracted components when added account for 97.52% (that is their cumulative variance) of the total variance of the observed variables.This indicates that the variance of the observed variables had been accounted for by these 5 extracted components.The calculated components loadings, eigenvalues, total variance and cumulative variance are shown in Table 2 while the scree plot of the eignvalues of observed components is depicted in Figure 3.

Ayeni and Soneye 135
Based on the component loadings, the variables are grouped accordingly with their designated components as follows: -Component 1: TH, Mg 2+ , TDS, Cl -and Ca + -Component 2: Zn, Turbidity, DO, pH, TSS and PO 4 3--Component 3: Cu, temperature, TC, TBC and NO 3 --Component 4: Oil and grease and Fe -Component 5: Mn and BOD Component 1,2,3,4 and 5 explained 36.996,22.058,16.674,12.908 and 8.882% of the variance respectively.Classifying the component loading according to Liu et al. (2003) the loading values greater 0.75 signifies "strong", the loading with absolute values between 0.75 and 0.50 indicate "moderate" while loading values between 0.50 and 0.30 denote as "weak".Using this classification, all variable in component 1 and component 2 had strong positive loading except PO 4 3-with moderate positive.Of the 5 variables in component 3, two had strong positive loading (Temperature and TC), NO 3 -had moderate positive loading while Cu and TBC were signified with strong and moderate loading respectively.All variables in component 4 and 5 explained strong positive loading.
An interpretation of the rotated 5 principal components is made by examining the component loadings noting the relationship to the original variables.Component 1 gives information about the variations in TH, Mg 2+ , TDS, Cl -and Ca + .In this component, loading indicates that organic matter and organic acids which could be attributed to various anthropogenic activities and geological formation and/or composition of the area greatly influenced the quality of selected springs.The same also interpreted for component 2 but considers its eignvalue and total variance, it is quite lower compared with component 1.
Components 3 explained information about Cu, temperature, TC, TBC and NO 3 -.This component represents pollution from domestic and agricultural waste as well as the geological composition of the area.However, the significance of NO 3 -in the component 3 indicates that nitrification takes place in the vicinity of the springs.In the component 4, it can be understood that dissolved or emulsified oil and grease extracted from water especially unsaturated fats and fatty acids and Fe extracted from the parent rock of the area are of the significance in that component.In the component 5, Mn presence is an indication of the parent rock influence while BOD claimed that there is a high level of organic pollution, caused usually by poorly treated waste water.
The dendrogram of the observed locations dataset was generated using Euclidean distance of HCA for CA result (Figure 4 and Table 3).Based on Euclidean distance, two major clustering groups (cluster 1 and cluster 2) were observed.Cluster 1 characterized with low Euclidean distance corresponds to 23 locations (6, 9, 7, 3, 13, 19, 17, 22, 10, 30, 4, 5, 2, 12, 23, 27, 18, 26, 29, 21, 25, 28 and 24).Cluster 2 which has high Euclidean distance is    coherent to 7 locations (15, 16, 1, 20, 8, 11 and 14).Subgroup clusters were also clarified within the major cluster 1 and vary with significance Euclidean distance.The dendrogram clarifies cluster 1 as the abnormality observation which had high variation in the concentration of the surface water quality parameters compared to cluster 2 surface water samples concentration.The variation in cluster 1 might due to low polluted effluents from non-point sources (agricultural and urban activities).Cluster 2 shows a high pollution from agricultural area which encompasses the springs.

Conclusion
This study presents the usefulness of multivariate statistical techniques of large and complex dataset in order to obtain better information and interpretation concerning surface water quality.Principal component analyses helped in identifying the factors responsible for surface water quality variations in 6 selected communities.The result revealed that the percentages of the total variances of the 5 extracted components when added account for 97.52% (that is their cumulative variance) of the total variance of the observed variables.The variation in components 1 and 2 loadings indicates that organic matter and organic acids could greatly influence the quality of selected springs.Component 3 ascribed mainly to domestic and agricultural waste of the springs environment while component 4 and 5 respectively attributed to dissolved/emulsified poorly treated waste water.On the other hand, the result of cluster analysis revealed 2 major clustering groups resulting from the influences of agricultural and urban activities around the samples'

Figure 1 .
Figure 1.Selected spring locations in Akoko Northeast LGA of Ondo State, SW -Nigeria.

Figure 3 .
Figure 3.The scree plot of the eignvalues.
the northern senatorial part of Ondo State, SW -Nigeria (Figure1).It is bounded

Table 3 .
Pond names and their cluster membership.Cluster 1 characterized with low Euclidean distance corresponds to 23 locations and clarifies with sub groups that varies with significance Euclidean distance while cluster 2 coherent to 7 locations and observed high Euclidean distance with sub group of insignificance Euclidean distance.Therefore, it is worthwhile to conclude that PCA and CA are better tools for better understanding of the concealed information about parameters variance and datasets discrete information in water quality assessment studies.