Determining the Factors Affecting the Survival of HIV Patients: Comparison of Cox Model and the Random Survival Forest Method

http Introduction Currently, AIDS is the most serious threat for public health (1). This epidemic so far has devastated many individuals, families, and societies and has increasingly caused erosion of civil order and economic growth (2). More than 90% of those infected with AIDS live in developing countries, with 80% being infected through sexual relationships. This disease has been the cause of the death of more than 25 million people up to 2006. AIDS has been stated as one of the devastating pandemics in history and it is estimated that 6% of the world’s population is infected with this virus (3). Currently, AIDS has no cure, but the mortality resulting from HIV has diminished thanks to administration of highly active anti-retroviral therapy (HAART) (1). Anti-retroviral therapy (ART) is useful in reducing the speed of process preventing AIDS in an HIV-positive patient and increasing their survival (4). Therefore, ART has transformed a very fatal infectious disease into a potentially chronic and controllable infection (5). Infection with HIV is associated with Determining the Factors Affecting the Survival of HIV Patients: Comparison of Cox Model and the Random Survival Forest Method


Introduction
Currently, AIDS is the most serious threat for public health (1). This epidemic so far has devastated many individuals, families, and societies and has increasingly caused erosion of civil order and economic growth (2). More than 90% of those infected with AIDS live in developing countries, with 80% being infected through sexual relationships. This disease has been the cause of the death of more than 25 million people up to 2006. AIDS has been stated as one of the devastating pandemics in history and it is estimated that 6% of the world's population is infected with this virus (3). Currently, AIDS has no cure, but the mortality resulting from HIV has diminished thanks to administration of highly active anti-retroviral therapy (HAART) (1). Anti-retroviral therapy (ART) is useful in reducing the speed of process preventing AIDS in an HIV-positive patient and increasing their survival (4). Therefore, ART has transformed a very fatal infectious disease into a potentially chronic and controllable infection (5). Infection with HIV is associated with iejm.hums.ac.ir http gradual quantitative and qualitative reductions in CD4 cells. Therefore, the patient is at risk of catching many comorbid and opportunistic infections (6). Tuberculosis (TB) is one of the most common opportunistic infections in patients with HIV. HIV significantly increases the number of patients with TB, thus heightening the risk of mortality among the affected patients (7). Eradicating pandemic diseases such as AIDS is the third goal of the 17 objectives of sustainable development document, based on which all countries including Iran are committed to ending AIDS epidemic by 2030 (8). In this regard, the use of suitable statistical models can be effective in validating the identification of important prognostic factors and improving the accuracy of predicting patient survival.
In survival analysis, various regression models are used for predicting the probability of incidence of future events (9). Cox proportional hazards regression model is one of the most common models in identifying the potential risk factors of diseases. Several studies have used this model for determining the survival of patients with AIDS and HIV. Nevertheless, when using Cox model, some limitations such as the proportional hazards requirement, there is poor performance in complex models such as nonlinear and collinear effects of variables (10). Further, this model is not valid enough under conditions of high censor rate (11). Therefore, models should be used with fewer constraints. Random survival forest (RSF) is a nonparametric machine learning method which was developed by Ishwaran et al based on random forests (RF) (12). This model is used to address the problem of using the Cox model including concurrent assessment of complex effects and interaction effects between variables (10).
In different studies, the selection of covariates has been done in survival analysis and comparison of Cox and RSF models (11,13). Hence, RSF studies have a better performance compared to Cox model, and enjoy the ability to identify nonlinear effects automatically, while Cox model lacks this ability. On the other hand, when the number of predictors is low, RSF model underperforms compared to Cox due to sensitivity to confounding factors. Indeed, under such conditions, RSF is unusable and Cox is proposed instead (10).
Considering the efficiency and ability of different models to predict the factors affecting the survival of patients in different diseases, the aim of this model is to determine the predictive factors of AIDS patients based on Cox and RSF models and compare their accuracy.

Materials and Methods
This research has been performed as a retrospective cohort study to investigate the survival of patients with HIV in Hamadan province located in western Iran. For this purpose, 769 patients with HIV who had a medical file in the healthcare center of Hamadan province between 1997 and 2017 were studied. The information required in this study was extracted using checklists from the patient's file. The collected information included age at the time of diagnosis, gender, route of HIV transmission (injection, sexual, mother to baby, unknown), the number of CD4 cells, antiretroviral therapy, the duration of HIV diagnosis until TB, the date of diagnosing AIDS, date of death, cause of death, and date of latest news of patients. The time from disease diagnosis until death was considered as survival time. In this paper, to identify the variables affecting the survival time of HIV patients, Cox multivariate regression model and RSF method were used. RSF is a developed form of RF which is used for survival data with the right censor with the same principles of RF, possessing all of its important features (7). Random forest covers several trees based on a random sample with substitution. Generally, the RSF algorithm is as follows (12): A number of B bootstrap samples are chosen from the main data. In every bootstrap sample, about %37 of the data are left that is known as out of bag (OOB) sample. 1. For every bootstrap sample, a survival tree is grown.
In every node of the tree, q predictors (covariates) are randomly chosen for splitting. The node is divided into two daughter nodes using splitting criteria. The variable chosen for splitting is the one that creates the maximum difference in the survival of two daughter nodes. 2. The tree grows up to its maximum growth size. The last node is called the final node. The final node should not be less than d o >0 (d 0 represents the number of intended events, which is death in this study). 3. For every tree, a cumulative hazard function (CHF) is calculated and then the mean of these CHFs reports the total CHF. 4. Using OOB data, the prediction error is calculated. For splitting every node and creating the daughter nodes, log-rank splitting rule, log-rank score, and RSF have been used. The comparison of the accuracy of splitting rules is made through the prediction error, with lower values suggesting higher accuracy. The importance of every variable in the prediction is measured by VIMP index. Positive values suggest variables with predictive abilities (important value), while zero or negative values are those with no ability to predict (10,14). In order to compare the efficiency of the Cox proportional hazards model and RSF, two criteria called Brier score and C-index have been used. All analyses were performed by R 3.1.2 alongside random forest SRC and statistical packages for survival analysis.

Results
In this study, the survival of 769 patients with AIDS was investigated. Out of this number, 662 (86.1%) were male and 107 (13.9%) were female. The mean ± SD age of diagnosis of patients was 33.83 ± 9.63 years, with 88.4% of patients being younger than 45 years (with the range of iejm.hums.ac.ir http 1-87). Further, 63.8% of patients had primary education while 36.2% had university degrees. Most of the patients were single (44.3%). Additionally, 45.6% were under ART treatment. Finally, 9.2% of patients concurrently suffered TB. Table 1 demonstrates the demographic characteristics of the patients.
First, using Cox proportional hazards regression model, the effect of influential factors for survival was determined, as presented in Table 2. Based on the results of Cox regression model, the variables of history of injection, co-injection, status of TB (Yes/No), the first CD4 cell count, and time of diagnosis until developing TB were identified as important and influential factors for the survival of patients. Based on the hazards ratio (HR), the mortality risk for those with a history of injection was 12.328 times greater than non-injection patients, and it was 13.565 greater for TB patients than non-TB individuals. The risk of mortality diminished with an increase in the CD4 cell count. Further, those with a history of co-injection had 0.122 greater risk of mortality compared to those without such history. Finally, with the increase in the time of disease diagnosis until TB, the risk of mortality diminished.
In order to compare Cox model with RSF models, RSF models were used based on the log-rank, RSF, and logrank score, with the best ones being chosen based on the minimum error as the final model. Table 3 shows the error of the three models. According to the table, the log-rank method with the minimum error was identified as the best model among the three models. Figure 1 displays the important variables based on the degree of significance according to the log-rank rule. Accordingly, the variables of time of diagnosis until TB, the first CD4 cell count, ART, and history of co-injection were identified as the important variables in the survival of patients with AIDS. The error value for this rule is 16.30, which has a constant trend from 300 trees above. The important variables based on RSF and log-rank scores are presented in Figures 2 and  3.
In order to compare Cox and RSF model based on logrank rule, Brier score and C-index were used, with the results provided in Table 4. According to this table, RSF model based on the log-rank rule was identified as the most suitable model among the models applied in this research for determining the important factors in the survival of patients with AIDS. Specifically, the variables of time of diagnosis until TB, the first CD4 cell count, ART, and history of co-injection were the important variables in predicting the survival of HIV+ patients, respectively.

Discussion
This study was done to compare Cox multivariate regression model and RSF models to determine the factors affecting the survival of HIV+ patients considering mortality as the final event. The aim of comparison was to select a model with greater accuracy and efficiency in identifying the factors affecting the survival of HIV+ patients. Accordingly, in RSF and Cox models, the effect of demographic, clinical, and laboratory factors was tested on survival. RSF model was performed using different splitting groups. Ranking of the important variables identified based on the log-rank method, log-rank score, and RSF indicated that many important variables are the same based on three rules. However, the insignificant variables are also common based on the mentioned three rules. As can be seen, the status of TB (Y/N) and the latest marital status were the least important. Based on the three rules used, the RSF model with the log-rank splitting rule had the minimum error. This finding is in line with the study by Datema et alin determining the influential factors for survival of patients with head and neck cancer. In this study, RSF models were compared with each other based on the error criterion. Accordingly, the model based on log-rank splitting rule was determined as the most suitable model in the set of RSF models (13).
Comparison between Cox multivariate model and RSF model based on the log-rank splitting rule was made using    (15,16). According to the results of this study, the variables of time of disease diagnosis until TB, the first CD4 cell count, ART, and history of co-injection are the important variables in predicting the survival of HIV+ patients, respectively. The type of diagnosis until TB was identified as the most important factor in the mortality of patients. The World Health Organization has identified TB as the cause of death of 23% of AIDS patients. Therefore, this variable plays a significant role in the survival patients (16).
The first CD4 cell count was identified as the second most important predictive factor for the mortality of patients. Investigation of the effect of CD4 on the survival of patients indicated that the reduction in the CD4 cell count is associated with increased mortality rate, thus increasing the hazard ratio (HR) in patients. The results of many studies have suggested that the reduction in CD4 cell count plays a significant role in increasing the risk of HIV, TB, and AIDS-induced death (17). This finding is in line with previous studies (18,19). Further, Cuong et al indicated that CD4 cell count less than 100 is an effective predictive factor for AIDS-induced death in patients (20). Based on the results of this study, the use of ART treatment was identified as one of the important variables for the survival of patients with HIV, causing increased survival of patients. Evidence has shown that ART consumption is associated with diminished mortality. On the other hand, some studies have found that older patients have a worse response compared to younger individuals (21,22). The results of a study showed that ART leads to diminished HIV-induced mortality and increased CD4 cell count in patients with concurrent TB and HIV infection (23).
Co-injection has been one of the major causes of HIV in recent years (24), and it has been mentioned as its most common cause. Further, some injection addicts have AIDS. In this study, 65.9% of patients with HIV had a history of addiction.
This research had some limitations. Since it was a retrospective cohort study, the accuracy of the recorded information might have caused bias in the results and the reduction of validity. Further, to estimate the survival time, the precise date of developing the disease is not as clear as the date for other chronic diseases. Hence, in this study as with other survival studies, the duration of survival was considered as the time of diagnosis (i.e. the patient's referral) as the time of developing HIV infection.