Crime Type Classification using Neural Networks: A brief walkthrough

21 min readMay 19, 2019

Crime type classifications?

The objective of the article is to achieve a brief walkthrough for the classification of each crime type, predicting the most probable crime type, at a certain location in Louisville at a certain time, using correlated features such as the premise type and zip code from the public dataset. Potentially, this type of work will benefit the police department by improving crime reporting system. By collecting and inputting the crime information, this solution will categorize the corresponding crime type so that the police can do more efficient operations.

The objective is to classify the crime type based on the given locational information as well as the temporal information

Datasets

I will be using the public crime dataset available from the Louisville Metro Open Data Portal (https://data.louisvilleky.gov/dataset/crime-reports).

Understanding the dataset

After downloading the crime data files which are in CSV format, to begin with, we need to understand our dataset and determine the useful features from it before doing any modeling.

The Louisville crime dataset contains well-structured data meaning that the dataset is easy to process to visualize or analyze. For all of my work, I used Python and imported data using Panda as a dataframe.

The below shows the partial capture of the raw crime dataset.

The partial capture of the raw crime dataset in CSV format

To understand what each column means, columns (or features) can be briefly described as below;

The total of 14 features are available for each crime record

I used the recent 9 years of the dataset which was from 2009 to 2017 although the data was available for a longer period of time. It was to balance the trade-off between the data size and the computing time of the neural networks.

Additionally, I decided to add another dataset which was the weather history records gathered from the National Oceanic and Atmospheric Administration (https://www.noaa.gov/). This weather data contains the information on daily precipitation in Louisville from 2009 to 2017. The reason for using the weather data in classifying crime types was because a positive relation between rain and some of the crime types were reported (Sommer, Lee & Bind, 2018). It will be interesting to see if the weather has any correlation with the crime types in Louisville.

The weather dataset consisted of 3 features as shown in the below table;

*The total of 3 input features are available for each weather record*

To narrow down the scope, I decided to convert the precipitation data into the data representing whether it rained or not on each day. To do that, PRCP column which was the float type values was classified into a Boolean type using the below table;

*Rain class based on the amount of precipitation*

By appending all the crime data CSV files from 2009 to 2017 as well as the weather data into a single dataset, the total of 201,242 samples was acquired.

Data Pre-processing

Once we have the dataset ready, we need to pre-process the raw data to make the dataset more suitable for a neural network.

First of all, some of the data columns are not needed since the information it gives is redundant. For example, both “DATE_OCCURRED” and “DATE_REPORTED” contains temporal information. Since some entries of “DATE_OCCURRED” is missing in case it is unknown, I chose to drop “DATE_OCCURRED” column.

Also, the format of “DATE_REPORTED” is “YYYY-MM-DD HH:MM:SS”, and this format is re-formatted into four separate columns of “MONTH”, “HOUR” as well as additional information of “DAY” since I believe each of them has a different correlation.

Besides, some features including “NIBRS_CODE” and “INCIDENT_NUMBER” as well as “UCR_HIERARCHY” contains the information of crime types within each value meaning these three codes are created with the direct information of “CRIME_TYPE”. In other words, these columns can only be written after knowing the crime type; thus, these columns are dropped to be excluded from the training data.

“ID” is nothing more than the indexes of each crime record; thus, this column can also be dropped. “BLOCK_ADDRESS” is dropped because the locational information is already included in “ZIP_CODE” feature.

All of the features except for “DATE_REPORTED”, “DATE_OCCURRED” and “ID” are categorical features. Further data pre-processing is required and explained in the next section.

After the data preprocessing, there are 10 categorical features left in total as listed below; [“NIBRS_CODE”, “UCR_DESC”, “LMPD_DIVISION”, “ZIP_CODE”, “ATT_COMP”, “PREMISE_TYPE”, “CITY”, “MONTH”, “HOUR”, “DAY”]

Data imbalance

In the dataset, there are a total of 16 crime types and the counts of each crime type are plotted as below figure. Looking at the below bar plot, notice that the data is highly skewed! “THEFT/LARCENY” is the one with the largest occurrence, but some of the classes like “HOMICIDE”, “ARSON” and “DUI” has extremely minimal occurrence. The proportion of the highest and lowest class was 1812:1.

The skews in training data cause some potential problems. One is that it is difficult to have representation across classes because some of the numbers of some classes are extremely low when making a validation or test data. More importantly, the accuracy of the classification model could be misled because the model might just always output one of the top three classes and achieve very high accuracy.

Overcoming the data imbalance

To eliminate the data skews, I used two approaches.

First, similar classes were combined to reduce the total classes of four crime types. First of all, “HOMICIDE”, “SEX CRIMES” and “ASSAULT” were re-categorized as “VIOLENT” based on the definition of violent crimes as any crimes that offense against the person (“Non-Violent vs. Violent Crimes”, 2019).

Second, “MOTOR VEHICLE THEFT” and “VEHICLE BREAK-IN/THEFT” were all re-categorized commonly as “VEHICLE BREAK-IN” since both were vehicle-related. The second highest “DRUGS/ALCOHOL VIOLATIONS” was kept unchanged. Then, the top five re-defined crime types were selected as output classes. After re-grouping the data, the skew of the training dataset was improved.

The number of occurrence of each crime type after re-grouping them to five categories

For further improvement, random under-sampling was performed to balance the dataset. The reason for choosing under-sampling was because oversampling might result in the overfitting by adding more data samples based on the given dataset. Also, the size of the crime dataset was sufficient for under-sampling.

Synthetic sampling technique such as SMOTE was not considered in this work because all the features were categorical. SMOTE works by calculating distances between elements from the majority and minority classes (Chawla et al., 2002). Applying SMOTE for the categorical only dataset was not supported by Keras at this moment.

*The number of the occurrence of each crime type after performing random undersampling*

Dividing the data into training, validation and test set

After the under-sampling, the total number of the dataset was 117,310 samples. Then, the dataset was divided into the training, validation and test dataset as below;

The dataset with 117,310 samples split into training, validation and test sets

Data correlation analysis

To determine which input features are correlated to the crime types and by how much, the statistical measurements of the correlation analysis for categorical features were needed.

Noting that all of the features were categorical, to measure the level of associations of two categorical variables, Cramer’s V was used.

The Cramer’s V uses the Chi-Square test to measure how strongly two categorical features are associated outputting a value between 0 and 1. As the value closes to 1, it indicates a relatively stronger association.

The formula of the Cramer’s V is defined as below to denote the level of association between two categorical features.

where X² represents Pearson’s Chi-Square statistic, N is the total sample size and is the smaller number of either of two categories (Argyrous, 1997).

Pearson’s Chi-Square statistic measures the independence of the two categorical features to check whether distributions of categorical features differ from one another.

where f_o is the actual frequency or counts and f_e is the expected frequency or counts.

The below heatmap shows the Cramer’s V associations between categorical features from the dataset to show the correlations of each feature to one another.

Referring to the first row to determine the correlation of features with crime types, any association values less than 0.10 were considered to have no or very weak correlations as they were close to 0. Excluding the features with very weak or zero correlations, the features including “LMPD_BEAT”, “LMPD_DIVISION”, “ZIP_CODE”, “ATT_COMP”, “PREMISE_TYPE” and “HOUR” were selected. Among the selected features, “PREMISE_TYPE” had the highest correlation with crime types.

More data features?!

On top of the given dataset, I chose to create a few more features. It was because the initial trial of the MLP model which will be shown later resulted in the poor accuracy of 0.5205. More data features with relatively high correlations will improve accuracy.

By grouping any crime entries directly involving money, “FINANCIAL DAMAGE” feature was newly created. Furthermore, any crimes damaging people were categorized as “HUMAN LOSS”. Also, other temporal and location information were re-grouped to create new features as below;

*The summary of the newly created features by categorizing with conditions*

After adding these newly created features, the Cramer V correlation was re-captured as the below heatmap.

*The heatmap representation of the Cramer V correlation between categorical features for the crime dataset after adding more features*

Some features including “FINANCIAL DAMAGE”, “HUMAN LOSS”, “RESIDENTIAL AREA”, “SHOP AND STORES”, “TIME”, “CROWDED AREA”, “RESIDENTIAL AT NIGHT”, “NEAR HWY OR ROAD”, and “DOWNTOWN AT NIGHT” were found to have high correlations to the crime type having larger than 0.10 level of association.

Based on the above analysis, a total of 15 categorical features were finally selected for training the classifier model.

The 15 selected features are listed below;

[“LMPD_BEAT”, “LMPD_DIVISION”, “ZIP_CODE”, “ATT_COMP”, “PREMISE_TYPE”, “HOUR”, “FINANCIAL DAMAGE”, “HUMAN LOSS”, “RESIDENTIAL AREA”, “SHOP OR STORES”, “TIME”, “CROWDED AREA”, “RESIDENTIAL AT NIGHT”, “NEAR HWY OR ROAD”, “DOWNTOWN AT NIGHT”]

Modeling

Now, the data is fully ready! What we are expected to do next is to define the performance metrics so that we know if our model performs well or not.

Performance Metrics

a) Loss function

In this work, cross-entropy is used to measure how accurate the prediction matches the real output by measuring the distance between what model determined and the accurate classification output.

For categorical classification multi-class classification problems, the cross-entropy loss function is used along with the SoftMax activation function. The SoftMax activation function is needed in order to predict the probability for each class.

This degree of accuracy from cross-entropy can be taken into account for backpropagation. The mathematical derivation of hidden layers is the following;

where x_i is input record, f is a nonlinear function representing the neural network with the weight matrix W for the layer and the bias vector b. The SoftMax activation function at the output layer outputs a vector of probabilities for each class.

With a SoftMax layer, the cross-entropy loss is given by;

where is 1 if the crime type corresponding to a given input record is the crime type, otherwise it is 0. Then, the loss function is given by the below;

The loss is back-propagated through the network by taking calculating gradient to update the weights and biases.

b) Accuracy

Accuracy above is also used to compare how well the model finds the correct predictions; however, the accuracy formula is not differentiable so it cannot be used for backpropagation. The accuracy is still a good metric to evaluate the performance of the models. One thing to note that is higher accuracy does not always mean that the model performance is better if the dataset is skewed. For instance, for the dataset which has 99% of instances in one class and only 1 % in the other, predicting that every instance belongs to the majority class will result in an accuracy of 99%. But this is not a problem in this work because the data is re-sampled to be balanced, and the accuracy is utilized for model evaluations.

c) F1 Score

In addition to the accuracy measurement, F1 score from a confusion matrix can also be used to describe the performance of a classification model. Below equations are the metrics from a confusion matrix. The higher these values are, the better the model performs.

As a harmonic mean of precision and recall, the F1 score below is considered as a metric from a confusion matrix.

True positive (TP): correct positive prediction
False positive (FP): incorrect positive prediction
True negative (TN): correct negative prediction
False negative (FN): incorrect negative prediction

There are two types of F1 scores which are macro and micro. For macro, F1 is calculated separately for each class and averaged. For micro, F1 is calculated over all entries and it takes the class imbalance into account. In this work, because the classes in the dataset are resampled to be balanced, macro F1 is used to give equal importance to each class.

d) AUC Score

As another metric for performance measurement for classification, Receiver Operating Characteristics (ROC) curve is considered. ROC is a probability curve of the true positive rate against the false positive. The Area Under the Curve (AUC) of ROC curves represents the degree or measure of true positives and negatives. As the AUC score is closer to 1, the model is better performing in distinguishing the positives and the negatives.

Baseline classifiers

When modeling a neural network classifier, we need a baseline so that we can compare if our model is performing better or not. As a baseline classifier, Multinomial Naïve Bayes classifier was used on the crime dataset which consisted of categorical features. This classifier is based on the Bayes theorem representing the probability of each outcome as below;

where y is a class, and X represents features which are given as;

When using a Naïve Bayes classifier from Sklearn library in Python, the accuracy was 64.09%. F1 score was 0.64. The performance along with AUC scores on each class was used as a baseline and it was compared with other models’ performance in later sections. Naïve Bayes classifier assumes that the features are independent, but some of the input features are correlated with one another based on the heatmap drawn to show the Cramer’s V associations; thus, the improvements in performance are expected if the relationship between input features can be considered.

As another baseline model, a basic multiclass logistic regression classifier was used. A Logistic Regression (LR) classifier from Sklearn library uses the one vs rest scheme to extend the binary into the multi-class. Briefly speaking, the separate binary classification is performed calculating the probabilities for each output class. Then, the class with the highest score is chosen to be the output. As a result of the classification with the LR model, 66.4% of accuracy was acquired. This accuracy was 2.3% higher than the accuracy of the Naïve Bayer classifier.

Multi-layer perceptron classifier

As multi-layered logistic regression classifier, a feed-forward neural network was designed in this work. It was designed to consist of one input layer with 256 neurons, three hidden layers with 128 nodes each, and the output layer with 5 nodes.

As seen in the below figure, the first dense layer was added with the input size of 168, which was the number of the one-hot encoded input features. Since all inputs features were categorical or binary from the dataset, the one-hot encoding scheme was used to convert them to be represented as numeric and vector form for neural network inputs.

The output of this first dense layer was set to 256. All the parameters were determined by performing the hyperparameter optimization while some of the parameters like dropout were empirically determined. The output layer had the size of 5 which was equivalent to the number of the output classes. The entire implementation was done using Keras.

The fully connected feedforward model. The model consisted of 3 hidden layers along with an input and output layer. Dropout layers were added to overcome overfitting. The size of input and output layers were determined by the number of features and the classes, respectively. The number of neurons in hidden layers were configured during the hyperparameter optimization

Choosing an activation function

In neural network models, activation functions are used to determine whether the feature represented by each neuron is activated or not and to introduce non-linearity to neural networks. There are various types of activations functions such as Sigmoid, Tanh, and ReLu. Ertam and Aydin prove that the ReLu outperforms other activation functions which are TanH, Elu, Sigmoid, SoftPlus, and SoftSign in classifying MNIST dataset. He experimented and compared the classification accuracy on CNN and SoftMax classifier using those activation functions, and 98.43% of accuracy was acquired with the ReLu (Ertam & Aydin, 2017). Thus, for the purpose of crime type classification in this work, the ReLu activation function for hidden layers is the most reasonable choice.

Choosing an optimizer for backpropagation

To train neural networks, the gradient descent method, which is backpropagation, is performed. After the forward propagation, the error between the expected output and the predicted value is calculated with a defined loss function. Then, the gradient descents of the batches with respect to each weight in the neural network is calculated to update the weight values which results in a smaller error in the forward propagation at the proceeding epoch.

In order to improve the convergence of the gradient descent method, an adaptive step size method, which is equivalent to an adaptive learning rate method, was introduced. As one of the most popular methods, Adaptive Moment Estimation (Adam) optimizer is proposed by Kingma and Ba. Adam optimizer computes individual learning rates for different parameters (Kingma & Ba, 2018). Mini-batch gradient descent with Adam optimizer is widely used along with the ReLu (Agarap, 2019). Thus, Adam optimizer is primarily chosen for the baseline model in this project.

Choosing a weight initializer

Neural networks are trained or optimized by starting with initial and random parameter values and iteratively updating the parameters through back-propagation; thus, the way to choose initial parameters affect the optimization performance. Also, it was emphasized that a proper initialization of the weights is critical to the convergence of the neural networks (Kumar, 2017).

For the sigmoid function, it loses non-linearity if the initial weights are too close to 0. If the initial weights are too large, the variance becomes large causing the gradients to approach to zero. To resolve the issue of choosing the initial weights, the Xavier Initialization was proposed which the variance to remain the same for input and output of each layer (Xavier, 2010). But it was proved that the different initializers lead to a drastic difference in convergence with activation functions which are non-differentiable at 0 (He, Zhang, Ren & Sun, 2015). It was also experimentally illustrated that Xavier Initialization results in a much smaller variance of inputs as the neural network gets deeper and showed the 30-layer neural network does not converge with the Xavier Initialization while it does with the He Initialization (Kumar, 2017).

Training the model

The below shows the training and validation accuracy and loss plot over the training progresses based on the model. Overfitting was easily identified because the trend of the loss over the validation dataset began to increase after about 2 epochs while the loss became smaller over the training dataset.

Dropout layer to overcome overfitting

Dropout layers were added on each hidden layer to prevent occurring the overfit. The dropout rate was empirically determined as 0.6 after the trial and error approach. In Keras, 0.6 of dropout means that 60% of nodes are randomly ignored. As seen in the below figure, the accuracy and loss over the validation dataset had similar trends as the training dataset and no overfitting was observed. The accuracy of both training and validation was converged about 0.67 after about 60 epochs. Also, the convergence of the loss was seen after about 70 epochs and the value was about 0.9.

As a process of optimization, hyperparameter optimization was performed utilizing a hyperparameter tuning tool called Talos to iteratively record the accuracy with the parameter variations. Using the tool, the training process was iterated for the number of combinations of the variations in configurations, then, the results were stored in a spreadsheet file.

The below figure was plotted to analyze the performances based on the Talos result. The performance was the best with the learning rate of 0.0001. Also, when 3 hidden layers were added along with the neurons on each layer were set to 128, the loss was improved by the most. Also, the performance was better in most cases when the batch size was 256. While measuring these optimization plots, the epoch was kept to 100. Lastly, the value of dropout was determined empirically. After applying the tuned parameters, the accuracy on test dataset was 0.6720 while the loss was 0.8424.

Encoding Schemes

All the features in this work are categorical or binary, and these categorical features were to be encoded by mapping them into integer representations. It was necessary when the features to be inputted into neural networks. Initially, the one-hot encoding was applied in this work, but there were other encoding schemes available. Potdar, Pardawala, and Pai propose that different encoding techniques result in differentiating the classification accuracy of the neural network model (Potdar, Pardawala, and Pai, 2017). In their experiment, Sum encoding and Backward difference encoding schemes outperformed one-hot encoding by 4% of the accuracy. Based on this paper, to find the most optimal encoding scheme for the particular dataset used, Sum encoding, and Backward Difference encoding was tried and compared with the results with the one-hot encoding.

The below table shows the performance comparison of MLPs with different encoding schemes. The accuracy and F1 score with the Sum encoding were the highest. It was higher than that with one-hot encoding by 0.35%.

Referring to the below figure showing the AUC scores of ROC curves for each class, for three out of 5 crime types, the AUC score was highest with one-hot encoding. The Sum encoding performed better in terms of model accuracy because the classification of the “THEFT/LARCENY” class was improved. The MLP with Backward Difference encoding performed the worst among three variations.

The comparison of the AUC scores for each class when varying categorical feature encoding schemes. Sum encoding was higher in AUC score on “Theft/Larceny” and “Violent” while one-hot encoding was better on other types

MLP with Embedding layer instead of feature encoding

The main drawback of the one-hot or sum encoding is that every value of categorical features is treated independently to each other ignoring any intrinsic relations between values. Entity embedding layer is introduced to overcome this drawback by having a set of weights for each of the categorical columns. These weights will be updated during the backpropagation resulting in getting similar categories to be closer to one another in terms of its vector representations (Guo & Berkhahn, 2016).

One advantage of using embeddings is that the number of dimensions can be reduced to represent the categorical feature compared to the one-hot or sum encoding. For these encoding, the feature should be broken into as many unique values are present for that categorical feature. By adding entity embedding layers, the categorical features can be represented as sparse tensors more efficiently.

More importantly, by optimizing the weight matrix in embedding layers, similar values get close to each other in the embedding space revealing the relations of the categorical variables. It was proven that this helps in boosting neural network performance. Guo and Berkhahn showed that Mean Absolute Percentage Error (MAPE) reduced from 0.101 to 0.093 after using entity embeddings instead of one-hot encoding (Guo & Berkhahn, 2016).

As seen in the below figure, the MLP models were re-designed to have entity embedding layers. Each categorical feature was inputted into embedding layers with the reduced dimension into half of the dimension of one-hot or sum encoded features. Then, all the layers were concatenated into one to be followed by the fully connected dense layers. The dimension of the dense layer was reduced from 168 to 105. The fully connected dense layers consisted of 256 neurons for each layer except for the 3rd layer with 128 neurons. Also, ReLu activation function was used for each layer with the 0.6 of dropout. For the output layer, SoftMax layer was used to result in the probability of each class.

*The feedforward fully connected model with entity embedding. The size of the input dimension was reduced significantly*

F1 score comparison

The above plot shows the comparison of F1 score for each classifier. The score of MLP models was noticeably higher than that of the Naïve Bayes and LR classifiers, and the score of Naïve Bayes classifier was the lowest for all output classes. The MLP with entity embedding layers significantly outperformed all other classifiers on the “DRUGS/ALCOHOL VIOLATION” and “BURGLARY” classes.

AUC score comparison

The above graph represents the comparison of the AUC score for each classifier. Similar to the F1 score comparisons, the AUC score of MLP models outperformed the Naïve Bayes and LR classifiers for most of the classes except for the “DRUGS/ALCOHOL VIOLATION” class. One noticeable from the plot was that the AUC score of the LR classifier was the highest for the “DRUGS/ALCOHOL VIOLATION” class which was a different result with the F1 score. The AUC score of Naïve Bayes classifier was the lowest for all classes. Referring to this comparison, the overall performance of the MLP with entity embedding layers was the highest among all classifiers tested in this work.

Conclusively!

· MLP with entity embedding layers was shown to result in the best performance in terms of the accuracy, average F1 score as well as AUC scores of each output class.

· Specifically, referring to the AUC of each class, 1.3% to 3.1% of performance improvement was observed for every class when comparing the MLP with entity embedding layers with the results of Naïve Bayes classifier.

· The accuracy and average F1 scores were the same for all classifiers and it was expected since the dataset was re-sampled to be perfectly balanced.

· Out of all classes, all the classifiers commonly showed the best classification performance on the class of “VEHICLE BREAK-IN” and the least for “DRUGS/ALCOHOL VIOLATION”. It could be improved by adding more features like whether there was a liquor store nearby.

To wrap up,

In this article, the neural network models were designed and studied to classify the crime types based on categorical input data containing spatial and temporal information on the crime occurrence. The crime dataset from 2009 to 2017 was acquired from Louisville Metro Open Data Portal. To remove the skews in the dataset, re-grouping and re-sampling technique were applied. Specifically, the crime types were re-grouped into top five crime and random under-sampling was performed to make the dataset perfectly balanced. To develop a neural network-based classifier, a fully connected feed-forward model was developed and its hyperparameters were optimized to avoid overfitting and increase the accuracy. The set of configurations include learning rates, neuron size for hidden layers, batch size as well as the dropout rate. To improve the classification accuracy, feature analysis was done using Cramer’s V to select correlated features as well as to create additional features. As a part of optimization, different encoding schemes were tested to see the effects on the model performance. Finally, the entity embedding layer was added to take the relation between values in categories into account while training the model. As a result, the accuracy of the MLP model with entity embedding layers was 67.78% which was higher than that of the Naïve Bayes classifier by 3.6%.

References

Louisville Metro Open Data Portal, Crime Report, Retrieved Feb 20, 2019,
from https://data.louisvilleky.gov/dataset/crime-reports

Sommer, A. J., Lee. M., & Bind, M. C. (2018). Comparing apples to apples: an environmental criminology analysis of the effects of heat and rain on violent crimes in Boston. Palgrave Communications 4, Article number: 138. Retrieved from https://www.nature.com/articles/s41599-018-0188-3

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357. Retrieved from https://arxiv.org/pdf/1106.1813.pdf

LegalMatch, Non-Violent vs. Violent Crimes, Retrieved Apr 4, 2019, from https://www.legalmatch.com/law-library/article/non-violent-vs-violent-crimes.html

Argyrous, G. (1997). Measures of association for nominal data. London, Palgrave.

Kadar, C., & Pletikosa, I. (2018). Mining large-scale human mobility data for long-term crime prediction. EPJ Data Science, 7(1), 26.

Kingma, D. P., & Ba, J. L. (2018). Adam: A Method For Stochastic Optimization. arXiv preprint arXiv:1412.6980v2.

Agarap, A. B. M. (2019). Deep Learning using Rectified Linear Unit (ReLU). arXiv preprint arXiv:1803.08375.

Kumar, S. K. (2017). On weight initialization in deep neural networks. arXiv preprint arXiv:1704.08863v2.

Ertam, F., Aydin. G. (2017). Data Classification with Deep Learning using Tensorflow. IEEE, 2nd International Conference on Computer Science and Engineering, Retrieved from https://ieeexplore.ieee.org/document/8093521

Ku, C. H., & Leroy, G. A. (2014). A decision support system: Automated crime report analysis and classification for e-government. Government Information Quarterly, 31(4), 534–544. Retrieved from https://doi.org/10.1016/j.giq.2014.08.003

Glorat, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural network. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249–256

Wang, B., Luo, X., Zhang, F., Yuan, B., Bertozzi, A. L., & Brantingham, P. J. (2018). Graph-Based Deep Modeling and Real Time Forecasting of Sparse Spatio-Temporal Data. arXiv preprint arXiv:1804.00684.

He, K., Zhang. X., Ren. S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. IEEE International Conference on Computer Vision.

Potdar, K., Pardawala, T. S., & Pai, C. D. (2017). A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers. International Journal of Computer Applications, 175(4), 0975–8887

Guo, C., & Berkhahn, F. (2016). Entity Embeddings of Categorical Variables. arXiv preprint arXiv: 1604.06737v1. Retrieved from https://arxiv.org/pdf/1604.06737.pdf