How to write good college essays: Medical Data Analytics Using R

Medical Data Analytics Using R 1.) R for Recency => months since last donation, 2.) F for Frequency => total number of donation, 3.) M for Monetary => total amount of blood donated in c.c., 4.) T for Time => months since first donation and 5.) Binary variable => 1 -> donated blood, 0-> didnt donate blood. The main idea behind this dataset is the concept of relationship management CRM. Based on three metrics: Recency, Frequency and Monetary (RFM) which are 3 out of the 5 attributes of the dataset, we would be able to predict whether a customer is likely to donate blood again based to a marketing campaign. For example, customers who have donated or visited more currently (Recency), more frequently (Frequency) or made higher monetary values (Monetary) are more likely to respond to a marketing effort. Customers with less RFM score are less likely to react. It is also known in customer behavior, that the time of the first positive interaction (donation, purchase) is not significant. However, the Recency of the last donation is very important. In the traditional RFM implementation each customer is ranked based on his RFM value parameters against all the other customers and that develops a score for every customer. Customers with bigger scores are more likely to react in a positive way for example (visit again or donate). The model constructs the formula which could predict the following problem. Keep in repository only customers that are more likely to continue donating in the future and remove those who are less likely to donate, given a certain period of time. The previous statement also determines the problem which will be trained and tested in this project. Firstly, I created a .csv file and generated 748 unique random numbers in Excel in the domain [1,748] in the first column, which corresponds to the customers or users ID. Then I transferred the whole data from the .txt file (transfusion.data) to the .csv file in excel by using the delimited (,) option. Then I randomly split it in a train file and a test file. The train file contains the 530 instances and the test file has the 218 instances. Afterwards, I read both the training dataset and the test dataset. From the previous results, we can see that we have no missing or invalid values. Data ranges and units seem reasonable. Figure 1 above depicts boxplots of all the attributes and for both train and test datasets. By examining the figure, we notice that both datasets have similar distributions and there are some outliers (Monetary > 2,500) that are visible. The volume of blood variable has a high correlation with frequency. Because the volume of blood that is donated each time is fixed, the Monetary value is proportional to the Frequency (number of donations) each person gave. For example, if the amount of blood drawn in each person was 250 ml/bag (Taiwan Blood Services Foundation 2007) March then Monetary = 250*Frequency. This is also why in the predictive model we will not consider the Monetary attribute in the implementation. So, it is reasonable to expect that customers with higher frequency will have a lot higher Monetary value. This can be verified also visually by examining the Monetary outliers for the train set. We retrieve back 83 instances. In order, to understand better the statistical dispersion of the whole dataset (748 instances) we will look at the standard deviation (SD) between the Recency and the variable whether customer has donated blood (Binary variable) and the SD between the Frequency and the Binary variable.The distribution of scores around the mean is small, which means the data is concentrated. This can also be noticed from the plots. From this correlation matrix, we can verify what was stated above, that the frequency and the monetary values are proportional inputs, which can be noticed from their high correlation. Another observation is that the various Recency numbers are not factors of 3. This goes to opposition with what the description said about the data being collected every 3 months. Additionally, there is always a maximum number of times you can donate blood per certain period (e.g. 1 time per month), but the data shows that. 36 customers donated blood more than once and 6 customers had donated 3 or more times in the same month. The features that will be used to calculate the prediction of whether a customer is likely to donate again are 2, the Recency and the Frequency (RF). The Monetary feature will be dropped. The number of categories for R and F attributes will be 3. The highest RF score will be 33 equivalent to 6 when added together and the lowest will be 11 equivalent to 2 when added together. The threshold for the added score to determine whether a customer is more likely to donate blood again or not, will be set to 4 which is the median value. The users will be assigned to categories by sorting on RF attributes as well as their scores. The file with the donators will be sorted on Recency first (in ascending order) because we want to see which customers have donated blood more recently. Then it will be sorted on frequency (in descending order this time because we want to see which customers have donated more times) in each Recency category. Apart from sorting, we will need to apply some business rules that have occurred after multiple tests: For Recency (Business rule 1): If the Recency in months is less than 15 months, then these customers will be assigned to category 3. If the Recency in months is equal or greater than 15 months and less than 26 months, then these customers will be assigned to category 2. Otherwise, if the Recency in months is equal or greater than 26 months, then these customers will be assigned to category 1 And for Frequency (Business rule 2): If the Frequency is equal or greater than 25 times, then these customers will be assigned to category 3. If the Frequency is less than 25 times or greater than 15 months, then these customers will be assigned to category 2. If the Frequency is equal or less than 15 times, then these customers will be assigned to category 1 RESULTS The output of the program are two smaller files that have resulted from the train file and the other one from the test file, that have excluded several customers that should not be considered future targets and kept those that are likely to respond. Some statistics about the precision, recall and the balanced F-score of the train and test file have been calculated and printed. Furthermore, we compute the absolute difference between the results retrieved from the train and test file to get the offset error between these statistics. By doing this and verifying that the error numbers are negligible, we validate the consistency of the model implemented. Moreover, we depict two confusion matrices one for the test and one for the training by calculating the true positives, false negatives, false positives and true negatives. In our case, true positives correspond to the customers (who donated on March 2007) and were classified as future possible donators. False negatives correspond to the customers (who donated on March 2007) but were not classified as future possible targets for marketing campaigns. False positives correlate to customers (who did not donate on March 2007) and were incorrectly classified as possible future targets. Lastly, true negatives which are customers (who did not donate on March 2007) and were correctly classified as not plausible future donators and therefore removed from the data file. By classification we mean the application of the threshold (4) to separate those customers who are more likely and less likely to donate again in a certain future period. Lastly, we calculate 2 more single value metrics for both train and test files the Kappa Statistic (general statistic used for classification systems) and Matthews Correlation Coefficient or cost/reward measure. Both are normalized statistics for classification systems, its values never exceed 1, so the same statistic can be used even as the number of observations grows. The error for both measures are MCC error: 0.002577Ãƒâ€šÃ‚ and Kappa error:Ãƒâ€šÃ‚ 0.002808, which is very small (negligible), similarly with all the previous measures. REFERENCES UCI Machine Learning Repository (2008) UCI machine learning repository: Blood transfusion service center data set. Available at: http://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center (Accessed: 30 January 2017). Fundation, T.B.S. (2015) Operation department. Available at: http://www.blood.org.tw/Internet/english/docDetail.aspx?uid=7741pid=7681docid=37144 (Accessed: 31 January 2017). The Appendix with the code starts below. However the whole code has been uploaded on my Git Hub profile and this is the link where it can be accessed. https://github.com/it21208/RassignmentDataAnalysis/blob/master/RassignmentDataAnalysis.R library(ggplot2) library(car) Ãƒâ€šÃ‚ # read training and testing datasets traindata Ãƒ ¯Ã†â€™Ã… ¸Ãƒâ€šÃ‚ read.csv(C:/Users/Alexandros/Dropbox/MSc/2nd Semester/Data analysis/Assignment/transfusion.csv) testdata Ãƒ ¯Ã†â€™Ã… ¸Ãƒâ€šÃ‚ read.csv(C:/Users/Alexandros/Dropbox/MSc/2nd Semester/Data analysis/Assignment/test.csv) # assigning the datasets to dataframes dftrain Ãƒ ¯Ã†â€™Ã… ¸ data.frame(traindata) dftest Ãƒ ¯Ã†â€™Ã… ¸ data.frame(testdata) sapply(dftrain, typeof) # give better names to columns names(dftrain)[1] Ãƒ ¯Ã†â€™Ã… ¸ ID names(dftrain)[2] Ãƒ ¯Ã†â€™Ã… ¸ recency names(dftrain)[3]Ãƒ ¯Ã†â€™Ã… ¸frequency names(dftrain)[4]Ãƒ ¯Ã†â€™Ã… ¸cc names(dftrain)[5]Ãƒ ¯Ã†â€™Ã… ¸time names(dftrain)[6]Ãƒ ¯Ã†â€™Ã… ¸donated # names(dftest)[1]Ãƒ ¯Ã†â€™Ã… ¸ID names(dftest)[2]Ãƒ ¯Ã†â€™Ã… ¸recency names(dftest)[3]Ãƒ ¯Ã†â€™Ã… ¸frequency names(dftest)[4]Ãƒ ¯Ã†â€™Ã… ¸cc names(dftest)[5]Ãƒ ¯Ã†â€™Ã… ¸time names(dftest)[6]Ãƒ ¯Ã†â€™Ã… ¸donated # drop time column from both files dftrain$time Ãƒ ¯Ã†â€™Ã… ¸ NULL dftest$time Ãƒ ¯Ã†â€™Ã… ¸ NULL #Ãƒâ€šÃ‚ sort (train) dataframe on Recency in ascending order sorted_dftrain Ãƒ ¯Ã†â€™Ã… ¸ dftrain[ order( dftrain[,2] ), ] #Ãƒâ€šÃ‚ add column in (train) dataframe -Ãƒâ€šÃ‚ hold score (rank) of Recency for each customer sorted_dftrain[ , Rrank] Ãƒ ¯Ã†â€™Ã… ¸ 0 #Ãƒâ€šÃ‚ convert train file from dataframe format to matrix matrix_train Ãƒ ¯Ã†â€™Ã… ¸ as.matrix(sapply(sorted_dftrain, as.numeric)) #Ãƒâ€šÃ‚ sort (test) dataframe on Recency in ascending order sorted_dftest Ãƒ ¯Ã†â€™Ã… ¸ dftest[ order( dftest[,2] ), ] #Ãƒâ€šÃ‚ add column in (test) dataframe -hold score (rank) of Recency for each customer sorted_dftest[ , Rrank] Ãƒ ¯Ã†â€™Ã… ¸ 0 #Ãƒâ€šÃ‚ convert train file from dataframe format to matrix matrix_test Ãƒ ¯Ã†â€™Ã… ¸ as.matrix(sapply(sorted_dftest, as.numeric)) # categorize matrix_train and add scores for Recency apply business rule for(i in 1:nrow(matrix_train)) { if (matrix_train [i,2] Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_train [i,6] Ãƒ ¯Ã†â€™Ã… ¸ 3 } else if ((matrix_train [i,2] = 15)) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_train [i,6] Ãƒ ¯Ã†â€™Ã… ¸ 2 } else {Ãƒâ€šÃ‚ matrix_train [i,6] Ãƒ ¯Ã†â€™Ã… ¸ 1Ãƒâ€šÃ‚ } Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ } # categorize matrix_test and add scores for Recency apply business rule for(i in 1:nrow(matrix_test)) { if (matrix_test [i,2] Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_test [i,6] Ãƒ ¯Ã†â€™Ã… ¸ 3 } else if ((matrix_test [i,2] = 15)) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_test [i,6] Ãƒ ¯Ã†â€™Ã… ¸ 2 } else {Ãƒâ€šÃ‚ matrix_test [i,6] Ãƒ ¯Ã†â€™Ã… ¸ 1 } Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ } # convert matrix_train back to dataframe sorted_dftrain Ãƒ ¯Ã†â€™Ã… ¸ data.frame(matrix_train) # sort dataframe 1rst by Recency Rank (desc.) then by Frequency (desc.) sorted_dftrain_2Ãƒ ¯Ã†â€™Ã… ¸ sorted_dftrain[order(-sorted_dftrain[,6], -sorted_dftrain[,3] ), ] # add column in train dataframe- hold Frequency score (rank) for each customer sorted_dftrain_2[ , Frank] Ãƒ ¯Ã†â€™Ã… ¸ 0 # convert dataframe to matrix matrix_train Ãƒ ¯Ã†â€™Ã… ¸ as.matrix(sapply(sorted_dftrain_2, as.numeric)) # convert matrix_test back to dataframe sorted_dftest Ãƒ ¯Ã†â€™Ã… ¸ data.frame(matrix_test) # sort dataframe 1rst by Recency Rank (desc.) then by Frequency (desc.) sorted_dftest2 Ãƒ ¯Ã†â€™Ã… ¸ sorted_dftest[ order( -sorted_dftest[,6], -sorted_dftest[,3] ), ] # add column in test dataframe- hold Frequency score (rank) for each customer sorted_dftest2[ , Frank] Ãƒ ¯Ã†â€™Ã… ¸ 0 # convert dataframe to matrix matrix_test Ãƒ ¯Ã†â€™Ã… ¸ as.matrix(sapply(sorted_dftest2, as.numeric)) #categorize matrix_train, add scores for Frequency for(i in 1:nrow(matrix_train)){ Ãƒâ€šÃ‚ if (matrix_train[i,3] >= 25) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_train[i,7] Ãƒ ¯Ã†â€™Ã… ¸ 3 Ãƒâ€šÃ‚ } else if ((matrix_train[i,3] > 15) (matrix_train[i,3] Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_train[i,7] Ãƒ ¯Ã†â€™Ã… ¸ 2 Ãƒâ€šÃ‚ } else {Ãƒâ€šÃ‚ matrix_train[i,7] Ãƒ ¯Ã†â€™Ã… ¸ 1Ãƒâ€šÃ‚ } Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ } #categorize matrix_test, add scores for Frequency for(i in 1:nrow(matrix_test)){ Ãƒâ€šÃ‚ if (matrix_test[i,3] >= 25) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_test[i,7] Ãƒ ¯Ã†â€™Ã… ¸ 3 Ãƒâ€šÃ‚ } else if ((matrix_test[i,3] > 15) (matrix_test[i,3] Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_test[i,7] Ãƒ ¯Ã†â€™Ã… ¸ 2 Ãƒâ€šÃ‚ } else {Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ matrix_test[i,7] Ãƒ ¯Ã†â€™Ã… ¸ 1Ãƒâ€šÃ‚ } } #Ãƒâ€šÃ‚ convert matrix test back to dataframe sorted_dftrain Ãƒ ¯Ã†â€™Ã… ¸ data.frame(matrix_train) # sort (train) dataframe 1rst on Recency rank (desc.) 2nd Frequency rank (desc.) sorted_dftrain_2 Ãƒ ¯Ã†â€™Ã… ¸ sorted_dftrain[ order( -sorted_dftrain[,6], -sorted_dftrain[,7] ), ] # add another column for the Sum of Recency rank and Frequency rank sorted_dftrain_2[ , SumRankRAndF] Ãƒ ¯Ã†â€™Ã… ¸ 0 # convert dataframe to matrix matrix_train Ãƒ ¯Ã†â€™Ã… ¸ as.matrix(sapply(sorted_dftrain_2, as.numeric)) #Ãƒâ€šÃ‚ convert matrix test back to dataframe sorted_dftest Ãƒ ¯Ã†â€™Ã… ¸ data.frame(matrix_test) # sort (train) dataframe 1rst on Recency rank (desc.) 2nd Frequency rank (desc.) sorted_dftest2 Ãƒ ¯Ã†â€™Ã… ¸ sorted_dftest[ order( -sorted_dftest[,6],Ãƒâ€šÃ‚ -sorted_dftest[,7] ), ] # add another column for the Sum of Recency rank and Frequency rank sorted_dftest2[ , SumRankRAndF] Ãƒ ¯Ã†â€™Ã… ¸ 0 # convert dataframe to matrix matrix_test Ãƒ ¯Ã†â€™Ã… ¸ as.matrix(sapply(sorted_dftest2, as.numeric)) # sum Recency rank and Frequency rank for train file for(i in 1:nrow(matrix_train)) { matrix_train[i,8] Ãƒ ¯Ã†â€™Ã… ¸ matrix_train[i,6] + matrix_train[i,7] } # sum Recency rank and Frequency rank for test file for(i in 1:nrow(matrix_test)) { matrix_test[i,8] Ãƒ ¯Ã†â€™Ã… ¸ matrix_test[i,6] + matrix_test[i,7] } # convert matrix_train back to dataframe sorted_dftrain Ãƒ ¯Ã†â€™Ã… ¸ data.frame(matrix_train) # sort train dataframe according to total rank in descending order sorted_dftrain_2 Ãƒ ¯Ã†â€™Ã… ¸ sorted_dftrain[ order( -sorted_dftrain[,8] ), ] # convert sorted train dataframe matrix_train Ãƒ ¯Ã†â€™Ã… ¸ as.matrix(sapply(sorted_dftrain_2, as.numeric)) # convert matrix_test back to dataframe sorted_dftest Ãƒ ¯Ã†â€™Ã… ¸ data.frame(matrix_test) # sort test dataframe according to total rank in descending order sorted_dftest2 Ãƒ ¯Ã†â€™Ã… ¸ sorted_dftest[ order( -sorted_dftest[,8] ), ] # convert sorted test dataframe to matrix matrix_test Ãƒ ¯Ã†â€™Ã… ¸ as.matrix(sapply(sorted_dftest2, as.numeric)) # apply business rule check count customers whose score >= 4 and that Have Donated, train file # check count for all customers that have donated in the train dataset count_train_predicted_donations Ãƒ ¯Ã†â€™Ã… ¸ 0 counter_train Ãƒ ¯Ã†â€™Ã… ¸ 0 number_donation_instances_whole_train Ãƒ ¯Ã†â€™Ã… ¸ 0 false_positives_train_counter Ãƒ ¯Ã†â€™Ã… ¸ 0 for(i in 1:nrow(matrix_train)) { Ãƒâ€šÃ‚ if ((matrix_train[i,8] >= 4) (matrix_train[i,5] == 1)) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ count_train_predicted_donations = count_train_predicted_donations + 1Ãƒâ€šÃ‚ } if ((matrix_train[i,8] >= 4) (matrix_train[i,5] == 0)) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ false_positives_train_counter = false_positives_train_counter + 1} Ãƒâ€šÃ‚ if (matrix_train[i,8] >= 4) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ counter_train Ãƒ ¯Ã†â€™Ã… ¸ counter_train + 1 Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ } Ãƒâ€šÃ‚ if (matrix_train[i,5] == 1) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ number_donation_instances_whole_train Ãƒ ¯Ã†â€™Ã… ¸ number_donation_instances_whole_train + 1 Ãƒâ€šÃ‚ } } # apply business rule check count customers whose score >= 4 and that Have Donated, test file # check count for all customers that have donated in the test dataset count_test_predicted_donations Ãƒ ¯Ã†â€™Ã… ¸ 0 counter_test Ãƒ ¯Ã†â€™Ã… ¸ 0 number_donation_instances_whole_test Ãƒ ¯Ã†â€™Ã… ¸ 0 false_positives_test_counter Ãƒ ¯Ã†â€™Ã… ¸ 0 for(i in 1:nrow(matrix_test)) { Ãƒâ€šÃ‚ if ((matrix_test[i,8] >= 4) (matrix_test[i,5] == 1)) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ count_test_predicted_donations = count_test_predicted_donations + 1Ãƒâ€šÃ‚ } if ((matrix_test[i,8] >= 4) (matrix_test[i,5] == 0)) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ false_positives_test_counter = false_positives_test_counter + 1} Ãƒâ€šÃ‚ if (matrix_test[i,8] >= 4) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ counter_test Ãƒ ¯Ã†â€™Ã… ¸ counter_test + 1 Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ } Ãƒâ€šÃ‚ if (matrix_test[i,5] == 1) { Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ number_donation_instances_whole_test Ãƒ ¯Ã†â€™Ã… ¸ number_donation_instances_whole_test + 1 Ãƒâ€šÃ‚ Ãƒâ€šÃ‚ } } # convert matrix_train to dataframe dftrain Ãƒ ¯Ã†â€™Ã… ¸ data.frame(matrix_train) # remove the group of customers who are less likely to donate again in the future from train file dftrain_final Ãƒ ¯Ã†â€™Ã… ¸ dftrain[c(1:counter_train),1:8] # convert matrix_train to dataframe dftest Ãƒ ¯Ã†â€™Ã… ¸ data.frame(matrix_test) # remove the group of customers who are less likely to donate again in the future from test file dftest_final Ãƒ ¯Ã†â€™Ã… ¸ dftest[c(1:counter_test),1:8] # save final train dataframe as a CSV in the specified directory reduced target future customers write.csv(dftrain_final, file = C:\Users\Alexandros\Dropbox\MSc\2nd Semester\Data analysis\Assignment\train_output.csv, row.names = FALSE) #save final test dataframe as a CSV in the specified directory reduced target future customers write.csv(dftest_final, file = C:\Users\Alexandros\Dropbox\MSc\2nd Semester\Data analysis\Assignment\test_output.csv, row.names = FALSE) #train precision=number of relevant instances retrieved / number of retrieved instances collect.530 precision_train Ãƒ ¯Ã†â€™Ã… ¸Ãƒâ€šÃ‚ count_train_predicted_donations / counter_train # train recall = number of relevant instances retrieved / number of relevant instances in collect.530 recall_train Ãƒ ¯Ã†â€™Ã… ¸ count_train_predicted_donations / number_donation_instances_whole_train # measure combines PrecisionRecall is harmonic mean of PrecisionRecall balanced F-score for # train file f_balanced_score_train Ãƒ ¯Ã†â€™Ã… ¸ 2*(precision_train*recall_train)/(precision_train+recall_train) # test precision precision_test Ãƒ ¯Ã†â€™Ã… ¸ count_test_predicted_donations / counter_test # test recall recall_test Ãƒ ¯Ã†â€™Ã… ¸ count_test_predicted_donations / number_donation_instances_whole_test # the balanced F-score for test file f_balanced_score_test Ãƒ ¯Ã†â€™Ã… ¸ 2*(precision_test*recall_test)/(precision_test+recall_test) # error in precision error_precision Ãƒ ¯Ã†â€™Ã… ¸ abs(precision_train-precision_test) # error in recall error_recall Ãƒ ¯Ã†â€™Ã… ¸ abs(recall_train-recall_test) # error in f-balanced scores error_f_balanced_scores Ãƒ ¯Ã†â€™Ã… ¸ abs(f_balanced_score_train-f_balanced_score_test) # Print Statistics for verification and validation cat(Precision with training dataset: , precision_train) cat(Recall with training dataset: , recall_train) cat(Precision with testing dataset: , precision_test) cat(Recall with testing dataset: , recall_test) cat(The F-balanced scores with training dataset: , f_balanced_score_train) cat(The F-balanced scores with testing dataset:Ãƒâ€šÃ‚ , f_balanced_score_test) cat(Error in precision: , error_precision) cat(Error in recall: , error_recall) cat(Error in F-balanced scores: , error_f_balanced_scores) # confusion matrix (true positives, false positives, false negatives, true negatives) # calculate true positives for train which is the variable count_train_predicted_donations # calculate false positives for train which is the variable false_positives_train_counter # calculate false negatives for train false_negatives_for_train Ãƒ ¯Ã†â€™Ã… ¸ number_donation_instances_whole_train count_train_predicted_donations # calculate true negatives for train true_negatives_for_train Ãƒ ¯Ã†â€™Ã… ¸ (nrow(matrix_train) number_donation_instances_whole_train) false_positives_train_counter collect_trainÃƒ ¯Ã†â€™Ã… ¸c(false_positives_train_counter, true_negatives_for_train, count_train_predicted_donations, false_negatives_for_train) # calculate true positives for test which is the variable count_test_predicted_donations # calculate false positives for test which is the variable false_positives_test_counter # calculate false negatives for test false_negatives_for_test Ãƒ ¯Ã†â€™Ã… ¸ number_donation_instances_whole_test count_test_predicted_donations # calculate true negatives for test true_negatives_for_testÃƒ ¯Ã†â€™Ã… ¸(nrow(matrix_test)-number_donation_instances_whole_test)- false_positives_test_counter collect_test Ãƒ ¯Ã†â€™Ã… ¸ c(false_positives_test_counter, true_negatives_for_test, count_test_predicted_donations, false_negatives_for_test) TrueCondition Ãƒ ¯Ã†â€™Ã… ¸ factor(c(0, 0, 1, 1)) PredictedCondition Ãƒ ¯Ã†â€™Ã… ¸ factor(c(1, 0, 1, 0)) # print confusion matrix for train df_conf_mat_train Ãƒ ¯Ã†â€™Ã… ¸ data.frame(TrueCondition,PredictedCondition,collect_train) ggplot(data = df_conf_mat_train, mapping = aes(x = PredictedCondition, y = TrueCondition)) + Ãƒâ€šÃ‚ geom_tile(aes(fill = collect_train), colour = white) + Ãƒâ€šÃ‚ geom_text(aes(label = sprintf(%1.0f, collect_train)), vjust = 1) + Ãƒâ€šÃ‚ scale_fill_gradient(low = blue, high = red) + Ãƒâ€šÃ‚ theme_bw() + theme(legend.position = none) #Ãƒâ€šÃ‚ print confusion matrix for test df_conf_mat_test Ãƒ ¯Ã†â€™Ã… ¸ data.frame(TrueCondition,PredictedCondition,collect_test) ggplot(data =Ãƒâ€šÃ‚ df_conf_mat_test, mapping = aes(x = PredictedCondition, y = TrueCondition)) + Ãƒâ€šÃ‚ geom_tile(aes(fill = collect_test), colour = white) + Ãƒâ€šÃ‚ geom_text(aes(label = sprintf(%1.0f, collect_test)), vjust = 1) + Ãƒâ€šÃ‚ scale_fill_gradient(low = blue, high = red) + Ãƒâ€šÃ‚ theme_bw() + theme(legend.position = none) # MCC = (TP * TN FP * FN)/sqrt((TP+FP) (TP+FN) (FP+TN) (TN+FN)) for train values mcc_train Ãƒ ¯Ã†â€™Ã… ¸ ((count_train_predicted_donations * true_negatives_for_train) (false_positives_train_counter * false_negatives_for_train))/sqrt((count_train_predicted_donations+false_positives_train_counter)*(count_train_predicted_donations+false_negatives_for_train)*(false_positives_train_counter+true_negatives_for_train)*(true_negatives_for_train+false_negatives_for_train)) # print MCC for train cat(Matthews Correlation Coefficient for train: ,mcc_train) # MCC = (TP * TN FP * FN)/sqrt((TP+FP) (TP+FN) (FP+TN) (TN+FN)) for test values mcc_test Ãƒ ¯Ã†â€™Ã… ¸ ((count_test_predicted_donations * true_negatives_for_test) (false_positives_test_counter * false_negatives_for_test))/sqrt((count_test_predicted_donations+false_positives_test_counter)*(count_test_predicted_donations+false_negatives_for_test)*(false_positives_test_counter+true_negatives_for_test)*(true_negatives_for_test+false_negatives_for_test)) # print MCC for test cat(Matthews Correlation Coefficient for test: ,mcc_test) # print MCC err between train and err cat(Matthews Correlation Coefficient error: ,abs(mcc_train-mcc_test)) # Total = TP + TN + FP + FN for train total_train Ãƒ ¯Ã†â€™Ã… ¸ count_train_predicted_donations + true_negatives_for_train + false_positives_train_counter + false_negatives_for_train # Total = TP + TN + FP + FN for test Ãƒâ€šÃ‚ total_test Ãƒ ¯Ã†â€™Ã… ¸ count_test_predicted_donations + true_negatives_for_test + false_positives_test_counter + false_negatives_for_test # totalAccuracy = (TP + TN) / Total for train values totalAccuracyTrain Ãƒ ¯Ã†â€™Ã… ¸ (count_train_predicted_donations + true_negatives_for_train)/ total_train # totalAccuracy = (TP + TN) / Total for test values totalAccuracyTest Ãƒ ¯Ã†â€™Ã… ¸ (count_test_predicted_donations + true_negatives_for_test)/ total_test # randomAccuracy = ((TN+FP)*(TN+FN)+(FN+TP)*(FP+TP)) / (Total*Total)Ãƒâ€šÃ‚ for train values randomAccuracyTrainÃƒ ¯Ã†â€™Ã… ¸((true_negatives_for_train+false_positives_train_counter)*(true_negatives_for_train+false_negatives_for_train)+(false_negatives_for_train+count_train_predicted_donations)*(false_positives_train_counter+count_train_predicted_donations))/(total_train*total_train) # randomAccuracy = ((TN+FP)*(TN+FN)+(FN+TP)*(FP+TP)) / (Total*Total)Ãƒâ€šÃ‚ for test values randomAccuracyTestÃƒ ¯Ã†â€™Ã… ¸((true_negatives_for_test+false_positives_test_counter)*(true_negatives_for_test+false_negatives_for_test)+(false_negatives_for_test+count_test_predicted_donations)*(false_positives_test_counter+count_test_predicted_donations))/(total_test*total_test) # kappa = (totalAccuracy randomAccuracy) / (1 randomAccuracy) for train kappa_train Ãƒ ¯Ã†â€™Ã… ¸ (totalAccuracyTrain-randomAccuracyTrain)/(1-randomAccuracyTrain) # kappa = (totalAccuracy randomAccuracy) / (1 randomAccuracy) for test kappa_test Ãƒ ¯Ã†â€™Ã… ¸ (totalAccuracyTest-randomAccuracyTest)/(1-randomAccuracyTest) # print kappa error cat(Kappa error: ,abs(kappa_train-kappa_test))

How to write good college essays

Sunday, October 13, 2019

Medical Data Analytics Using R

No comments:

Post a Comment