ABSTRACT
The primary objective of this article is to investigate how survival modelling can be used in traffic accident analysis to explain driver accident risk factors. Accident records of 398 drivers from 2007 to 2009 were obtained from Motor Traffic and Transport Department (MTTD), Ghana Police Service, Northern region. Cox proportional regression model was employed for the analysis using the SAS package. The conclusion was that Survival modelling promises to be a useful tool for road safety analysis and the most significant variables to the risks of accident were driver characteristics (age, gender, experience), their behaviour in traffic (speed, use of alcohol, use of safety belt), the nature of exposure (annual kilometreage, road surface condition) and vehicle characteristics (vehicle age, weight, tyres condition). Implementation of the findings of this study will enable policy makers put up better measures to reduce accident occurrence in the region in particular and the country as a whole.
Key words: Cox proportional model, accident risks, annual vehicle kilometreage, survival modelling, SAS package.
Road transport is a predominant means of commuting in Ghana accounting for high passenger travels and carting of goods in the country. Road transportation facilitates the movement of people, goods and services in all sectors of the economy, including tourism, mining, trade, health, education and agriculture, among others. Similarly, road crashes has also become a major national issue receiving frontpage coverage in the press and National TV news on a regular basis. Drivers are faced with risky situations and potential accidents every time they are on the road. Counter measures are taken by society to prevent accidents or moderate their consequences (Hakkanen and Summala, 2001). Accidents happen when road users cannot adapt their actions to the varying demands of the traffic environment. Consequently, the risk of accident can be lowered by improving road users’ performance in traffic or by reducing system demands on road users (Elvik, 1996). Many factors affect driver accident involvement as found in literature. Factors may be considered accident causes if they either increase or decrease the probability of accident occurence. Therefore, to prevent accident, one must know which of the numerous traffic risk factors have a real strong influence on the number and probability of accidents. At any given time, driver accident risk is affected by personal risk factors, vehicle risk factors, environmental factors, and other risks created by other drivers and traffic (Elvik, 1996).
Past researches (ChiengMeng et al., 2016; Elvik et al., 2004; Sullman et al., 2002; Häkkänen and Summala, 2000; Dagan et al., 2006; Taylor and Dorn, 2006; Rodríguez et al., 2003; Jovanis et al., 1991) have determined that over speeding, age, experience, sleep quality, driving mileage, vehicle weight, limited stopping distances, substantial traffic volume,the habits of the drivers and time of day have some relationship with the occurrence of road accidents. Predictive accident models have been developed by various authors in the world. For example, Calliendo et al. (2013) studied crash prediction model for road tunnels. They used bivariate negative binomial regression, jointly applied to nonsevere and sever crashes, to model the frequency of accident occurence. The regression parameters were estimated using the maximum likelihood method. Oppe (1989) used multiple linear regression models, where the dependent variable (either number of accidents or accident rate) is a function of a series of independent variables such as speed or traffic volume. These models assumed the occurrence of accidents to be normally distributed and therefore, they lack the distributional property necessary to describe adequately the random and discrete vehicle accident events on the road and hence are inappropriate for making probabilistic statements about accident occurrence.
Saccomanno and Buyco (1988) and Blower et al. (1993) used a Poisson loglinear regression model to explain variations in accident rates. This regression model is especially suitable for handling data with large numbers of zero counts and therefore, cannot be appropriate for road accident counts, since it fails to account for extraPoisson variation (the value of the variation could exceed the value of the mean) in the observed accidents counts. To solve this problem of extraPoisson variation, several authors such as Miaou (1994) developed two types of negative binomial models, one using a maximum likelihood method and one using a method of moments. The maximum likelihood model was found to be more reliable than the Poisson regression model in predicting accidents where overdispersion is present. In 1949, R. J. Smeed also developed a loglinear regression model and he found an inverse relationship between the traffic risk (fatality per motor vehicle) and the level of motorisation (number of vehicles per inhabitant). This means that with annually increasing traffic volume, fatalities per vehicle decrease. Smeed concluded that fatalities (F) in any country in a given year are related to the number of registered vehicles (V) and population P) of that country by the following equation;
where F = number of fatalities in road accidents in the country, V = number of vehicles in the country, P = population of the country, α = 0.003 and β = 2/3. This formula became popular and has been used in many studies. It is often called as Smeed's formula.
It was generally observed in the literature that the development of accident prediction models have largely been focused on parametric modeling (Collett, 2003), where the functional form of the model is completely specified. These models will be appropriate if only we are sure of the model was correctly specified. However, if we are not completely certain, as is typically the case, then the semiparametric survival modeling approach proposed by Cox (1972) will be most appropriate. It is a “robust” model in the sense that it provides results that closely approximates the true parametric model, and therefore, the user does not need to worry about whether a wrong parametric model is chosen. The objectives of this study therefore, is to investigate how the principles of survival modeling can be used in modeling accidents occurrence and to identify the factors that influence accident risks of drivers. The survival modeling (Klembaum, 1996) approach has not been widely adopted by researchers in the area of accident data anlysis. Survival modeling is commonly applied in medicine to the study of serious diseases and treatment methods. This study will assess the development needs of survival models in the area of traffic accident analysis and the findings can serve as a basis for health care professionals and policy makers to create preventive measures for traffic accidents.
Modeling approach
In order to achieve the set objectives of this research, we formulated the following two specific research questions. Can driver involvement in road traffic accident be examined with survival models? Do driver characteristics and behaviour (such as driver’s sex, age, experience, use of belt, use of alcohol, route failiarity, speeding), vehicle characteristics (such as vehicle’s age, weight, tyres thread, ownership, annual kilometrage) and traffic environment characteristics (such as accident scene, road surface condition, other traffic demands) contribute to accident risk? These questions will be answered using the developed survival models and the estimated parameters. The general procedures of modeling the accident data is summarized in the following seven steps;
Step 1: MTTD Data collection and processing.
Step 2: Study the varibales in the data and their categorization/Codes.
Step 3: Preliminary analysis of the data is performed using the KaplanMeier estimate of survival curves and logrank test (Kaplan and Meier, 1958). This univariate anlysis was perforemed to ascertain the significance of the variables under study and to use it as a basis for inclusion or otherwise of the the covariates in the final model.
Step 4: Fit the Cox model using the signifcant covariates suggested in step 3 and including other relevant variables using the Maximum Partial Likelihood Estimate (Cox, 1975).
Step 5: Refit the Model with only significant variables obtained in step 4 to obtain the final model.
Step 6: The final model is then evaluated to ascertain the goodness of the fit of the model.
Step 7: If the model fits the data well, the final model will be considered the complete model for the accident data, otherwise, we return to step 4 to refit the model using transformed values of the variables.
The flowchart of the modeling procedure is illustrated in the following Figure 1.
Source of data
Three years data containing detailed information on 398 accidents that involved drivers for the period of 2007 to 2009 were taken from Motor Traffic and Transport Department (MTTD) of the Ghana Police Service, Northern Regional Office, Tamale. The subsequent definitions and sentences in the “Methods” are mainly summary based on textbooks by Kalbfleisch and Prentice (2002), Klein and Moeschberger (1997), Allison (1995), Klembaum (1996) and Lawless (1982).
Survival time distribution
Survival analysis can simply be defined as timetoevent analysis (Klembaum, 1996); for example, time to die from disease say cancer. Survival data can be generated by observing a set of individuals at some welldefined point in time, and are followed for some substantial period of time, recording the times at which the events of interest occur and possibly some covariates associated with the individual that the risk of the event possibly depends upon. But an important issue in survival research is how to deal with individuals whose survival cannot be followed during the entire research period. Such indivividuals are called censored individuals. There are generally three reasons why censoring may occur; a person does not experience the event before the study ends, a person is lost to followup during the study period, a person withdraws from the study because of some other reasons but not the event of interest. Censoring can happen in the following three ways;
Type I: the duration of the study is fixed to a chosen period. The individuals are monitored from a set starting point and individuals who are lost to the monitoring, or are withdrawn from the study or do not experience the event at the end of the study period, are censored observations.
Type II: The length of the monitoring period depends on the desired number or proportion of uncensored observations. The length of the study period is the same as the survival of the individual with the longest life span. Individuals, who are removed from the study for various reasons or survive less than the monitoring period, are censored observations.
Type III: The duration of monitoring is fixed. However, individuals may enter the study at different starting points. Censored observations are the ones whose survival period continues after the overall monitoring period has ended.
Survival studies can be divided into the following two groups:
i) Monitoring censored to the right: the investigation has begun at a certain selected moment when the individuals entering the examination are exposed to the phenomenon under investigation, e.g. a medicine or treatment, the investigation is continued from that moment on for a certain length of time.
ii) Monitoring censored to the left: the investigation has begun at a certain selected moment, but includes indivividuals whose exposure to the phenomenon under investigation has begun before the examination period started (as in the present study).
The Greek letter delta (δ) denote a {0, 1} random variable indicating either failure or censorship. That is, δ = 1 for failure if the event occurs during the study period, or
δ = 0 if the survival time is censored by the end of the study period. The survival time
T , can be assumed to be following either a certain distribution or by direct observation based on the actual data. The most commonly used survival distributions are the negative exponential distribution, the Weibull distribution, the Gumbel distribution, the Logarithmic normal distribution or their combinations. The type of distribution that is best at describing the survival distribution is mainly dependent on the data. If
T represents a continous survival time, then its distribution is characterized by three functions; the survival function,
S(
t), which gives the probability that a person,
If one of these functions is known, the other two can be determined.
The Cox Proportional Hazards model and its characteristics
The survival model type that will be used is the Cox Proportional regression model (Cox, 1972) to examine driver’s accident risks and their dependence on characteristics connected with drivers and vehicles, as well as the prevailing road way conditions. The Cox PH model is usually written in terms of the hazard model formula
where h(t,X) is the hazard function, that is, hazard at time t for an individual with a given specification of a set of explanatory variables, X which are assumed to be timeindependent, h_{0} (t) is the base level of the hazard function; which is the hazard function for an individual, prior to considering any of the X’s (it represents the nonparametric part of the model) and can be thought of as the intercept in multiple regression. The is linear function formed by the variables and their parameters representing the parametric part of the model. It is this property that makes the Cox regression model a semiparametric model. The measure of the effect is called hazard ratio. The hazard ratio (HR) of two individuals with different covariates .
This hazard ratio depends only on the predictor variables and not on time hence it is timeindependent, which is also why the Cox regression is called the proportional hazard model. For indicator or (dummy) variables with values 1 and 0, one can interpret the hazard/risk ratio ( ) as the ratio of the estimated hazard for those with a value of 1 to the estimated hazard for those with a value of zero (controlling for other covariates). However, for quantitative covariates, a more helpful statistic is obtained by subtracting 1.0 from the hazard ratio and multiplying by 100. This gives the estimated percentage change in the hazard for each one unit increase in the covariate, holding other covariates constant (Allison, 1995). The corresponding survival function for the Cox Proportional Hazard regression model is related as
The “partial” likelihood function is usually used instead of the “complete” likelihood function to estimate the parameters of the Cox model, because the likelihood formula considers probabilities only for those individuals who experienced the event, and does not consider probabilities for those individuals that are censored. The partial likelihood function for Cox PH model (Cox, 1975) is
Proportional Hazards (PH) assumption checking
The goal of statistical model development is to obtain the model which best describes the data. That is to say, the fitted model must provide an adequate summary of the data upon which it is based. Therefore, a complete and thorough examination of the model’s fit and adherence to the model’s assumption is of great importance and concern. The Cox PH model assumes that the hazard of one individual is proportional to the hazard of any other individual, where the proportionality constant is independent of time. This means that the ratio of the risk of accident of two drivers is the same no matter how long they have been driving. This requires that covariates not be timedependent. If any of the covariates varies with time, the Proportional hazards assumption is violated. There are several methods for verifying that a model satisfies the assumption of proportionality; they are the Graphical method, the method of adding timedependent covariates in the Cox model, and tests based on the Schoenfeld residuals (Schoenfeld, 1982).
In this study, the method of adding timedependent covariates (Crowley and Hu, 1977) in the Cox model was employed. This can be done by including a timecovariate interaction terms in the model and testing if the coefficient for interaction is significantly different from zero. If a timedependent covariate is significant, this indicates a violation of the proportionality assumption for that specific predictor. In this analysis, the interactions with log (time) was used because this is the most common function of time used in timedependent covariates but any function of time could be used.
Cox Proportional Hazards model diagnostics
Several methods can be used to check the adequacy of a Cox PH model. We have the CoxSnell Residual method (Cox and Snell, 1968), the Deviance residual method (Thernaeau et al., 1990), the Schoenfeld residual method (Schoenfeld, 1982) and Diagnostic for influential observations (Cain and Lange, 1984). In this study, the diagnostic for influential observations (Cain and Lange, 1984) was employed. This method is used to identify which if any observations, exert an undue influence on the estimates of the parameters and for that matter the fit of the model. According to Cain and Lange, an observation is said to be influential if removing the observation substantially changes the estimate of the coefficients. The deltabeta (DfBeta) statistics is what is considered in this research and it tells one how much each coefficient will change by removal of a single observation. Therefore, we can check whether there are influential observations for any particular explanatory variable. The signs of the DfBeta statistics are the reverse of what one might expect – a negative sign means that the coefficient increases when the observation is removed.
Fitting Proportional Hazards Model for the MTTD data
The accident data contains, by definition, only drivers involved in accidents; this means that they have all experienced the event of interest. In order to analyse the data using survival analysis, it was assumed that all the drivers entered the study at the first day of the year and the survival, or accident time, was defined as the number of days counted from 1st January of the year to the day the accident occured. Drivers that were involved in accidents within the first 244 days (that is, first eight months) of each year were considered as “uncensored drivers” and those involved after the 244 days were considered as “censored drivers”. Since this approach enabled a distinction of which observations should be classified as censored or uncensored, then the MTTD accident data can now be analyzed using survival approach. The theoretical display of this description is depicted in the Figure 2 (Allison, 1995). The horizontal axis represents survival time (in days). Each of the horizontal lines labeled A through F represents a single driver. An x indicates that the accident occurred at that point in time. The vertical line at 244 is a point at which we stop following the driver. Any accident that occurred at 244 days or earlier were considered uncensored and those that occurred after 244 days are censored drivers. Therefore, drivers A, C, D and E have uncensored accident times while drivers B and C have censored accident times.
Preliminary analysis
In any data analysis, it is always a great idea to do some univariate analysis before proceeding to more complicated models. In survival analysis, it is highly recommended to look at the KaplanMeier (KM) curves and logrank tests (Kaplan and Meier, 1958) for all the categorical predictors. The KM curves will provide insight into the shape of the survival functions for the categories of the variables to determine pictorially the survival experience of its categories as well as their proportionality (that is, the survival functions are approximately parallel). The KM test of equality across strata (categories), called the logrank test, was employed in the preliminary analysis to explore whether or not to include the predictor in the final model. A variable was considered for inclusion into the final model if it is significant, that is, if the logrank test has a pvalue of 0.05 or less. If a predictor is not significant in a univariate analysis, it is highly unlikely that it will contribute anything to the model when it included in the final with other predictors.Detailed information on the list of variables, the variables’categorizations/codes, reference categories of the variables, total frequencies of accidents under each variable, the number uncensored (that is, the number of drivers involved in accident within the first 8 months) and logrank tests of equality across strata (categories) for the MTTD data are provided in the following Table 1. The KaplanMeier survival curves (Figures 3 to 9) for only significant variables indicated from the above Table 1 in order to compare the survival experience of the different levels of the variables are as follows.
Fitted Cox Proportional Hazard Models
The Cox regression model (Cox, 1975) was first fitted to obtain model 1A displayed in Table 2 using the most important variables from the point of view of KaplanMeier test as well as other interesting variables. The significant variables in model 1A rerunned to arrive at the final model 1B. This final model 1B was then evaluated by checking for the proportionality assumption (Table 5) and influential observations (Table 7) in the dataset.
Interpretation of the above outputs
Table 1 presents detailed information on KaplanMeier estimates for all the variables under study, the categories of each variable and their associated accident frequencies and reference levels. It can be seen that the variables that significantly contribute to accident time or probability according to the KaplanMeier estimate include driver age, sex, use of safety belt, use of alcohol, speed, age of driving license and age of vehicle since the logrank test of equality across levels for each of them resulted in a pvalue less than 0.05. The KaplanMeier survival curves that compared the survival experience of the levels of these significant variables are indicated in Figures 3 through 9. These significant variables from the point of view of KaplanMeier and these relevant variables: annual vehicle kilometreage, tyres condition, weight of vehicle and route familiarity, were used to estimate Cox regression model 1A (Table 2). The model provides the coefficient estimates and their associated standard errors and significance. It is noted that there is no intercept estimate – a characteristic feature of partial likelihood estimation (Allison, 1995). The hazard ratios along with their 95% confidence interval are also shown in the last three columns.
The variable driver_agp had three levels and so the estimated coefficients for the last two levels as indicated in model 1A are supposed to be compared with that of the omitted first level which is the reference level. The same explanation can be given to the variable annukil as indicated in the same model. However, a more useful is the global test reported for both variables at the bottom part of the model 1A (Table 3). It can be seen that both variables driver_agp and annukil are significant at the0.05 significance level. A variable is considered significant if its pvalue is less than or equal to 0.05. This means that the highly significant variables in model 1A include use of safety belt, use of alcohol, speed, tyres condition, age of driver and annual vehicle kilometreage. The variables that were not significant and for that matter not qualified for inclusion in the final model include sex, vehicle age, age of driving license, weight of vehicle and route familiarity. However, the predictor sex is proven to be a very important variable to have in the final model and therefore it was added to the significant variables and the model was refitted to obtain the final model 1B (Table 4).
Hazard ratios/Relative risks to drivers
Driver age
From these models (Tables 2 and 4), it can be seen that drivers’ age proved to be a major significant accident risk factor. In the model 1A which indicated the individual contribution of each category of the age groups, it can be seen that drivers aged between 26 to 50 years had 1.854 times (with 95% confidence interval: 1.136 to 3.027) riskier than drivers aged 25 and those who were older than 50 years had 2.170 times greater than those in their early 20s. This conclusion is confirmed in Figure 3 which shows the survival function for each age group. It can be seen that the survival function for those drivers aged between 26 and 50 had higher share of accidents, followed by those above 50 years and up to or less than 25. In general, the pattern of one survivorship function lying above another means that the group defined by the upper curved live longer or had a more favourable survival experience than the group defined by the lower curve. This may reflect lack of experience (and perhaps riskier driving style) of young drivers, as for the old drivers it can be attributed to reduce capabilities on their part. Traffic conditions set greater demands on all drivers; the young and very old may have a harder time meeting the greater demands. Also, the final model 1B gave the overall contribution of driver age (p=0.0146) with a hazard ratio of 1.473 which means that for any one year increase in the age of the driver, The hazard of accident increases
by 47.3%.
Driver sex
According to the KaplanMeier estimate, sex had a strong effect on accidents time. However, the effect of sex had a moderate influence as seen in model 1A with pvalue of 0.0807. Furthermore, in final model 1B, where the sex variable was artificially introduced yielded a pvalue of 0.0408 and a hazard ratio ratio of 1.832 which means that the risk of female drivers had 1.832 times greater than the risk of male drivers. For survival distribution of driver sex as shown in Figure 4, one can say that at any point in time the proportion of drivers estimated to be alive (not involved in an accident) is greater for males (represented by the upper curve) than that of the females (represented by the lower curve). Generally, in the MTTD data, female drivers drove fewer kilometres, were less often under the influence of alcohol, used safety belt more often than male drivers, had fewer accidents and committed fewer offences. Previous studies demonstrated that differences in risks between sexes could be explained by differences in mobility. Therefore, the role of sex as a risk factor is less conclusive but may remain a practically useful explanatory variable.
Driver’s use of alcohol
Driving under the influence of alcohol significantly increased drivers accident risks. According to the final model 1B, alcohol use was a significant variable with pvalue 0.0001 and the relative risks of drivers under the influence of alcohol, was 2.5031 times (1.721 – 3.640 with 95% confidence interval) greater at risk than that of nonalcohol users. From the survival distribution curve of use of alcohol as indicated in Figure 6, it can be seen that the survivorship function for non alcohol users is lying above the survivorship function for drivers who used alcohol which means that non alcohol users live longer or had a more favourable survival experience than the alcohol users.
Use of safety belt
Use of safety belt had a very strong explanatory power in model 1B with a pvalue of 0.0004 and a hazard ratio of 0.451, indicating that if a driver changes from not use of belt to use of belt, while holding other covariates constant the hazard of accident decreases by (100  45.1%) = 54.9%. The use of safety belt, which was only supposed to influence the seriousness of injuries, proved to be also a strong risk factor in the models. From the survival graph of use of safety belt as seen in Figure 5, it can be seen that drivers who used safety belts had higher survival experience than those who did not. Aside its significant effect on accident time, it also increases the severity and consequences of the accident.
Speed of vehicle
Speed proved to be a statistically significant variable in predicting the hazard of accidents according to KaplanMeier estimate and the fitted Cox regression models. According to model 1B, the hazard of accidents for drivers who drove over 80 km/h is 3.664 times (with a 95% confidence interval: 2.371 – 5.661) that of those who drove less than 80 km/h. Also, from the survival distribution curve of estimated speed of vehicle at the time of accident as seen in Figure 7, it can be seen that drivers whose speeds exceeded 80 km/h had lower survival experience than those who did not.
Age of license
Age of license of drivers was used as a proxy to assess the level of experience of the driver. It was a moderate significant variable in model 1A with p=0.0635. However, the hazard ratio indicated that drivers with duration of license less than 5 years had 1.7 times the risk of those with license duration of at least 5 years. From the survival curve of driving experience (age of driving license in years) as indicated in Figure 8, it can be seen that drivers whose license age exceeded 5 years had higher survival experience than those with license less than 5 years.
Annual vehicle kilometreage
Drivers annual vehicle kilometreage had a strong explanatory power in the fitted models. In model 1A, it can be seen that drivers that had travelled between 5,000 km/a to 14,000 km/a had 42.3% lower than those that had travelled less than 5,000 km/a (the reference group). Also, those that had travelled for at least 15,000 km/a had (100  44.8%) = 55.2% lower than those that had travelled less than 5,000 km/a. In the final model 1B which gave the overall contribution of drivers’ annual kilometreage (p=0.0084), the hazard ratio of 0.647 which means that for any one year increase in driver’s exposure to traffic, it is associated with (100  64.7%) = 35.3% decrease in expected time to accident holding all other covariates constant. This results indicated that drivers accident risks decreases as annual kilometerage increases, but it is generally believed that higher exposure to traffic is associated with higher risk; besides, the findings of this research is in opposition to this belief. This perhaps might be due to accumulated experience on the part of these drivers.
Route familiarity
The route familiarity variable was not a significant variable as indicated in model 1A (p = 0.0816). The hazard ratio is 0.643 indicating that, the accident risks was about 35.7% lower for drivers who were familiar with the site of the accident compared to other drivers.
Vehicle weight
The statistical significance of vehicle weight in model 1A was very weak (p =0.9039) and therefore its role as a risk factor is less conclusive. However, users of light vehicles had a higher relative risk.
Vehicle age
Vehicle age had no significant effect on accident risk in the model 1A, though it proved to be a significant variable in the KaplanMeier estimate. In this model, the hazard ratio is 1.236, indicating that vehicles older than 10 years had 1.2 times the risk of vehicles that are less or equal to 10 years. In otherwords, for each one year increase in the the age of the vehicle, the hazard of accident goes up by estimated 100 (1.236 – 1) = 23.6%. Also, from the survival curve of age of vehicle as shown in Figure 9, it can be seen that drivers whose vehicles aged over 10 years had lower survival experience than those with vehicles less than 10 years. However, it was observed in the MTTD data that users of old vehicles included many alcohol users, young drivers, and nonusers of belt. More drivers of newer vehicles were involved in speeding over 80 km/h prior to the accident.
Tyres condition / Tread depth
Tyres condition/tread depth proved to be statistically significant (p = 0.018) risk factor in determining accident time in model 1A but had a moderate influence in the final model with p=0.0646. Therefore, the role of tyres condition as a risk factor is less conclusive. However, users of worn out tyres are more at risk than users of unworn out tyres. According to the final model 1B, drivers with very worn out tyres vehicles had 1.404 times that of drivers who used less worn out tyres (>4 mm). This also means that for every 1 mm decrease in the tyre’s tread depth, the accident probability increased by about 40%.
Assessing the adequacy of the final model 1B
The proportionality assumption test of the model is indicated in model 1.1B. It can be seen that the tests of all the timedependent variables were not significant either individually (Table 5) or collectively (Table 6) with pvalue for each variable greater than 0.05. Therefore, we do not have enough evidence to reject proportionality assumption for this model. Since the assumption of proportionality is satisfied, it suggests that the Cox regression model 1B provides a reasonable fit to the MTTD data. The DfBeta statistics was also employed to determine whether there are any influential observations in the data in the fitting of the model, which is displayed in Table 7. The table displays the Dfbeta statistics dataset for first 42 individuals only; the signs of the DfBeta statistics are the reverse of what one might expect – a negative sign which means that the coefficient increases when the observation is removed. For example, the estimated coefficients for the covariates of sex and alcohol in the final model are respectively 0.60565 and 0.91737. However, in Table 7, the value 0.012141 for dsex indicates that if observation 2 is removed, the sex coefficient will decrease to approximately 0.605650.012141 = 0.593509, a decrease of 2%.
Also, the value 0.0013132 for dalcohol indicates that if observation 2 is removed, the alcohol coefficient will decrease to approximately 0.917370.003132 = 0.914238, a decrease of 0.3%. Furthermore, it can be seen that the value 0.015493 for dsex indicates that if observation 42 is removed, the sex coefficient will increase to approximately 0.60565+0.015493 = 0.621143, an increase of 1.5%. Overall, it can be seen that none of the observations did exert an undue influence on the estimated coefficients and hence the fit of the model. In summary, the evaluation of the final model based on the proportionality assumption and the DfBeta indicated that the models structure was acceptable. The practical conclusion is that removal of an observation will result in no or minor changes in the overall coefficients of all the covariates considered and hence will not distort the models. However, missing data was a general problem. The amount of missing data varies from one variable but the models included all the observations that did not have missing data concerning the variables. Specific causes of missing data were drivers who had died in the accidents and could not be captured in the registry. Some killed drivers were omitted from the MTTD data due to large missing information about them.
It is clear from the analysis that the application of survival models to the analysis of accident data appeared to be a promising approach. The models applied well to the examination of accident risk factors. Table 8 presents the most important risk factors according to the models for the MTTD accident data.
1) This study focused on only one year driving, but the driver might have been driving for so many years before the accident occurred. Therfore, any further work on this should be focused on the date the drivers got their driving license to their first accident involvement.
2) There are several covariates that may play an essential role to the development of the models but unfortunately were not available. Example, the date the vehicle was first taken into use, how long the driver had been on the trip when accident occurred, criminal records of drivers, drivers history of accident involvement, and income levels of drivers,.
3) Statkeholders responsible for ensuring safety on our roads should implement the findings of the study since it will enable them put up better measures to reduce the occurrence of accidents in the northern region in particular and the country as a whole.
The authors have not declared any conflict of interests.
REFERENCES
Allison PD (1995). Survival Analysis using SAS: A practical guide, Cary, NC: SAS Institute Inc.


Blower D, Kenneth L, Campbell, Green P (1993). Accident rates for heavy trucktractors in Michigan. Accid. Anal. Prev. 25(3):307321.
Crossref



Cain KC, Lange NT (1984). Approximate case influence for the proportional hazards regression model with censored data. Biometrika 40:493499.
Crossref



ChiengMing T, MingShan Y, LiYung T, HsinHsien L, MinChi L (2016). A comprehensive analysis of factors leading to speeding offenses among largetruck drivers. Transp. Res. Part F. 38:171181.
Crossref



Calliendo C, DeGuglielmo ML, Guida M (2013). A crashprediction model for road tunnels. Accid. Anal. Prev. 55:107115.
Crossref



Collett D (2003). Modelling Survival Data in Medical Research, Chapman and Hall, London.



Cox DR, Snell EJ (1968). A general definition of residuals with discussion. Royal Statistical Society J. Series B. 30:248275.



Cox DR (1972). Regression models and lifetables. Royal Statistical Society J. Series B. 34:187220.



Cox DR (1975). Partial likelihood. Biometrika 62:269276.
Crossref



Crowley J, Hu M (1977). Covariance analysis of heart transplant survival data. Am. J. Stat. Ass. 78:2736.
Crossref



Dagan Y, Doljansky JT, Green A, Weiner A (2006). Body mass index (BMI) as a first line screening criterion for detection of excessive daytime sleepiness among professional drivers. Traffic Injury Prev. 7(1):4448.
Crossref



Elvik R (1996). A Meta¬analysis of studies concerning the safety effects of daytime running lights on cars. Accid. Anal. Prev. 28(6):6856¬94.



Elvik R, Christensen P, Amundsen A (2004). Speed and road accidents: an evaluation of the power model. Institute of Transport Economics (TØI) Report. 740:134.



Hakkanen J, Summala H (2001). Fatal traffic accidents among trailer truck drivers and accident causes as viewed by other truck drivers. Accid. Anal. Prev. 33:187196.
Crossref



Häkkänen J, Summala H (2000). Sleepiness at work among commercial truck drivers. Sleep 23(1):4957.



Jovanis PP, Kaneko T, Lin TD (1991). Exploratory analysis of motor carrier accident risk and daily driving patterns. Transp. Res. Board. Working paper No. 73.



Kalbfleisch JD, Prentice RL (2002). The Statistical Analysis of Failure Data, 2nd ed. Wiley, New York.
Crossref



Kaplan E, Meier P (1958). Nonparametric estimation from incomplete observations. Am. J. Stat. Ass. 53:457481.
Crossref



Klein JP, Moeschberger ML (1997). Survival Analysis: Techniques for Censored and Truncated Data. Springer, New York.
Crossref



Klembaum DG (1996). Survival Analysis: A Self learning text. Springer, New York.
Crossref



Lawless JF (1982). Statistical Models and Methods for Lifetime Data Analysis. Wiley, New York.



Miaou SP (1994). The relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions. Accid. Anal. Prev. 26(4):471482.
Crossref



Oppe S (1989). Macroscopic models for traffic and traffic safety. Accid. Anal. Prev. 21(3):225232.
Crossref



Rodríguez DA, Rocha M, Khattak AJ, Belzer MH (2003). Effects of truck driver wages and working conditions on highway safety: Case study. Trans. Res. Record 1833:95102.
Crossref



Smeed RJ (1949). Some statistical aspects of road safety research. Journal of Royal Statistical Society, Series A. 112(1):134.
Crossref



Schoenfeld D (1982). Partial residuals for the proportional hazards regression model. Biometrika 69:239241.
Crossref



Saccomanno F, Buyco C (1988). Generalized Loglinear Models of Truck Accident Rates. Trans. Res. Record 1172:2331.



Sullman JM, Meadows ML, Pajo KB (2002). Aberrant driving behaviours amongst New Zealand truck drivers. Trans. Res. Part F Traffic Psychol. Behav. 5:217232.



Taylor AH, Dorn L (2006). Stress, fatigue, health, and risk of road traffic accidents among professional drivers: The contribution of physical inactivity. Annual Rev. Public Health 27:371391.
Crossref



Thernaeau TM, Grambsch PM, Fleming TR (1990). Martingalebased residuals for survival models. Biometrika 77:147160.
Crossref

