04-bullets3.Rmd

\addtocounter{chapter}{1}\setcounter{section}{0}\specialchapt{CHAPTER 4. A MODERN BULLET MATCHING DATABASE AND WEB APPLICATION}

\begin{center}

A paper to be submitted to the \textbf{Journal of Forensic Science}. \\
Eric Hare, Heike Hofmann, Alicia Carriquiry

\textbf{Abstract}

\end{center}

Bullet matching is a process used to determine whether two bullets have been fired from the same gun barrel. While traditionally a manual process performed by trained forensic examiners, recent work has been done to add statistical validity and objectivity to the procedure. In this paper, we build upon the algorithms explored in Automatic Matching of Bullet Lands by describing a database structure which tracks the parameters used through the course of the algorithm's run, and allows for the seamless replication of results by researchers, adding a layer of reproducibility to the bullet matching process. Finally, we describe two web applications, one intended for use by forensic examiners, and one intended for use by developers of bullet matching algorithms, which utilize the database to allow bullet matching to be done more efficiently and seamlessly.

\newpage

# Background

The need for advancements in terms of scientific objectivity and reproducibility of forensic methods is well known. Note, for example, the recent report by the President's Council of Advisors on Science on Technology (PCAST) [@pcast2016]. The report references a number of areas of common practice in the field in which subjectivity is far too common, including but not limited to fingerprint analysis, bitemark analysis and firearms analysis.

Work has been done to address these concerns and add objectivity to these procedures. In the case of firearms analysis, the focus of this paper, some examples include @vorburger:2011, @thompson:2013, and @riva:2014. Our own work, "Automatic Matching of Bullet Lands", describes procedures used to produce an estimate of the probability of a match between two bullet lands. It does so by deriving a number of features, some from the literature and some original, and computing these features on pairs of reference bullets from the NIST Ballistics Toolmark Research Database (NBTRD). The algorithms used are published as open-source R code available in the package `bulletr` [@bulletr]. In spite of these steps towards transparency, however, the process of duplicated and assessing the performance of the algorithm in hopes of improving predictive accuracy was cumbersome in a number of ways:

1. Doing so requires specialized statistical software (specifically, R and associated R packages)
2. Computing statistics on all pairs of bullet lands is a time consuming process even on high-powered machines (on the order of several days)
3. Updates to our `bulletr` package, or any package dependencies of `bulletr`, may change the results such that our findings are not completely reproducible even if each step is correctly followed

In this paper, we add a new layer of reproducibility to the algorithms to allow for forensic scientists, statisticians, and other interested researchers to duplicate and iterate on the results in a seamless fashion. We do so by introducing a new database structure that supplements the NIST database by storing all necessary parameters and intermediate results needed to arrive at a matching probability between two bullet lands. We describe the structure from a technical perspective, and then describe the front-end and back-end application structure which utilizes the database to provide results to the researchers. Finally, we provide a case study analysis on features of bullet land pairs as an example of the capabiilities of this structure when it is leveraged. Reproducible R [@R] code used to access the database is provided as a convenience to future researchers working in this domain.

# Database Structure

Figure \ref{fig:database_schema} displays the database schema along with links between the relevant id columns. This structure provides the necessary links between the raw input data, and the processed signatures used to compute features and ultimately provide matching probabilities between pairs of lands. This diagram will be explained in depth in this section.

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/schema.png}
\caption{A schematic of the database.}
\label{fig:database_schema}
\end{figure}

We can connect to the database using the `dplyr` package.

```{r, message=FALSE, cache=FALSE}
library(xtable)
library(RMySQL)
library(tidyverse)
library(bulletr)
library(gridExtra)
library(randomForest)
library(caret)
library(broom)
library(matrixcalc)

dbname <- "bullets"
user <- "buser"
password <- readLines("buser_pass.txt")
host <- "127.0.0.1"
port <- 3306

my_db <- src_mysql(dbname, host, port, user, password)
```

The remainder of this section will walk through the most relevant database tables, and include reproducible R code for accessing, parsing, and displaying the data.

## Data

The data table is essentially a mirror of the bullets stored in the NBTRD. It currently includes a long-form version of the two Hamby bullet sets (Set 252 and Set 44) [@hamby:2009] and the Cary Persistence study [@cary]. A sample of 20 rows of this table can be seen in Table \ref{tab:data}. The land id column identifies the bullet land under consideration. The x coordinate is the location along the shorter axis, while the y coordinate is along the longer axis. The value column represents the height of the bullet at that particular location.

```{r}
my_data <- tbl(my_db, "data")

# Get Hamby Barrel 1 Bullet 1 Land 3
result <- my_data %>% 
    filter(land_id == 39, !is.na(value)) %>%
    arrange(x, y) %>%
    head(n = 20) %>%
    as.data.frame
```

```{r, echo=FALSE, message=FALSE, results='asis'}
print(xtable(result, digits = c(0, 0, 0, 0, 2, 2), label = "tab:data", caption = "A sample of 20 rows of the data table. The land id column identifies the bullet land under consideration. The x coordinate is the location along the shorter axis, while the y coordinate is along the longer axis. The value column represents the height of the bullet at that particular location."), comment = FALSE)
```

In comparison to a regular x3p file, this data table is less space efficient. An x3p file uses a surface matrix of dimension $(x, y)$ where the value of each $(x_i, y_j)$ is the height of the bullet at $x = i$ and $y = j$. Our expanded version of the format turns each cell into a single row $x_i, y_j, z_{ij}$. Where $z_{ij}$ where $z_{ij}$ is the height of the bullet at $x = i$, $y = j$. Thus, for a bullet with 500 $x$ values and 1572 $y$ values (as in land id 39), an x3p file uses a matrix of size 786000, while we use a data frame of 786000 rows and 3 columns, storing $3x$ as much information. While certainly less space efficient, the format allows for easy querying of specific lands or specific profiles from the database.

## Metadata

As shown in Table \ref{tab:metadata}, the metadata table includes one row for each unique bullet land. The name of the bullet is derived from the file path, as provided by the NBTRD. Several other parameters are given, including the number of profiles, (x values), number of observations per profile (y values), and the increments of each in micrometers. (in this case, one x unit is equivalent to 1.5625 micrometers). Note that this table includes parameters that are derived solely from the properties of the data itself. Hence, the previous data table in conjunction with metadata forms the information needed to generate x3p files. Conversely, x3p files can be used to regenerate the data contained in both these tables.

```{r}
my_metadata <- tbl(my_db, "metadata")

result <- my_metadata %>% 
    filter(land_id >= 61, land_id <= 72) %>%
    as.data.frame
```

```{r, echo=FALSE, results='asis'}
print(xtable(result[,1:7], digits = c(0, 0, 0, 0, 0, 0, 0, 0), label = "tab:metadata", caption = "A sample of 12 rows of the metadata table. The land id column identifies the bullet land under consideration. The name of the bullet is derived from the file path, as provided by the NBTRD. Several other parameters are given, including the number of profiles, (x values), number of observations per profile (y values), and the increments of each in micrometers. (in this case, one x unit is equivalent to 1.5625 micrometers)"), comment = FALSE, include.rownames = FALSE)
```

## Metadata Derived

Similarly to the metadata table, the metadata_derived table (Sampled in Table \ref{tab:metadata_derived}) includes one row per bullet land. The difference is that the columns of this table were derived by our algorithm rather than properties of the data. The run_id, which will be discussed in more depth later, indicates the algorithm run that yielded the following derived parameters.

```{r}
my_metadata_derived <- tbl(my_db, "metadata_derived")

result <- my_metadata_derived %>% 
    filter(land_id >= 61, land_id <= 72, run_id == 3) %>%
    as.data.frame
```

```{r, echo=FALSE, results='asis'}
print(xtable(result, digits = c(0, 0, 0, 0, 4, 4, 0, 0), label = "tab:metadata_derived", caption = "A sample of 12 rows of the metadata\\_derived table. Once again, a land id identifies a particular bullet land. The run id, which will be discussed in more depth later, indicates the algorithm run that yielded the following derived parameters."), comment = FALSE, include.rownames = FALSE)
```

A number of derived parameters are given for a particular land and a particular run:

1. **ideal_crosscut** - The location of the ideal cross section (or ideal x coordinate) at which to extract a profile, as given by Hare, Hofmann, Carriquiry (2017).
2. **left_twist** - The calculated twist of the scan as determined by the left shoulder.
3. **right_twist** - The calculated twist of the scan as determined by the right shoulder.
4. **left_sample** - The number of samples used to compute the left twist.
5. **right_sample** - The number of samples used to compute the right twist.

## Profiles

Table \ref{tab:profiles} displays 10 profiles from land id 39, using the first 20 x coordinate values. Profiles are defined by properties of the grooves or shoulders. Hence, given this information, we can post-process the data table to extract particular profiles. For instance, we can use profile_id 32448, which is the profile obtained by extracting at x = 100 for land_id 39. Using properties of the grooves as determined by our algorithm, we can extract the the profile with the shoulders and grooves removed.

```{r}
my_profiles <- tbl(my_db, "profiles")

result <- my_profiles %>% 
    filter(land_id == 39, x > 92, x < 107) %>%
    select(-groove_left_pred, -groove_right_pred) %>%
    as.data.frame
```

```{r, echo=FALSE, results='asis'}
print(xtable(result, digits = c(0, 0, 0, 0, 2, 2, 2), label = "tab:profiles", caption = "A sample of 10 rows of the profiles table. A profile id is uniquely identified by a land id, a run id, and an x value."), comment = FALSE, include.rownames = FALSE)
```

Figure \ref{fig:prof} displays the profile obtained by extracting land id 39 at x = 100. Dashed vertical lines indicate the location of the shoulders. Within the bounds of the dashed line, the profiles that are relevant for bullet matching are obtained.

```{r prof, message=FALSE, warning=FALSE, fig.height=3, fig.width=6, fig.cap='The profile obtained by extracting land id 39 at x = 100. Dashed vertical lines indicate the location of the shoulders. Within the bounds of the dashed line, the profiles that are relevant for bullet matching are obtained.'}
myprof <- filter(result, x == 100)

land39 <- my_data %>% 
    filter(land_id == 39, x == 100) %>%
    as.data.frame

ggplot(data = land39, aes(x = y, y = value)) +
    geom_line() +
    geom_vline(xintercept = myprof$groove_left, linetype = 2) +
    geom_vline(xintercept = myprof$groove_right, linetype = 2) +
    theme_bw()
```

## Signatures

The land signature represents the processed data that is ultimately used for matching. In our case, a land signature represents the smoothed and processed residuals obtained from fitting a Locally Weighted Scatterplot Smoothing Regression (LOESS) to the profiles from above [@cleveland:1979]. Figure \ref{fig:sigs} displays the signature obtained by processing the profile of land id 39 at x = 100. It can be seen that the signature represents an attempt at reducing a bullet land to the peaks and valleys that represent striations, by removing the global structure of the bullet that dominates the view of the profile.

```{r sigs, message=FALSE, warning=FALSE, fig.height=3, fig.width=6, fig.cap='The signature obtained by processing the profile of land id 39 at x = 100.'}
my_signatures <- tbl(my_db, "signatures")

result <- my_signatures %>% 
    filter(profile_id == myprof$profile_id, run_id == 1) %>%
    as.data.frame

ggplot(data = result, aes(x = y, y = l30)) +
    geom_line() +
    theme_bw()
```

## CCF

The CCF table contains features computed on cross-comparisons between different signatures. The name is a bit of a misnomer; the ccf, or cross-correlation function, is only one of the features.  Table \ref{tab:ccf} displays a subset of the derived features for a comparison of the derived profile for land id 39, from above, with six other land profiles. This land's known match is the fourth row in the table, and the features immediately stand out as more pronounced, including a ccf above 90% and a number of matches far exceeding the other comparisons.

```{r}
my_ccf <- tbl(my_db, "ccf")

result <- my_ccf %>% 
    filter(profile1_id == myprof$profile_id, 
           compare_id == 4) %>%
    select(profile1_id, profile2_id, ccf, rough_cor, D, overlap, matches, cms, sum_peaks) %>%
    collect() %>%
    slice(5:10) %>%
    as.data.frame
```

```{r, echo=FALSE, results='asis'}
print(xtable(result, digits = c(0, 0, 0, 3, 3, 3, 3, 3, 3, 3), label = "tab:ccf", caption = "A subset of the derived features for a comparison of the derived profile for land id 39, from above, with six other land profiles. This land's known match is the fourth row in the table, and the features immediately stand out as more pronounced, including a ccf above .9 and a number of matches far exceeding the other comparisons."), comment = FALSE, include.rownames = FALSE)
```

# Web Applications

We now turn our attention to two applications which build upon the previously described database. These applications were designed to be web-based, easy to use, interactive applications which supplement forensic examiners and forensic scientists in either performing a bullet matching routine, or participating in bullet matching research.

## Front-End

Figure \ref{fig:frontend-app} displays the first page of the application. Using the database, the application populates two lists along the left-hand side, allowing selection of any of the bullet lands currently stored in the database. By default, two lands that are known to match, from bullets fired from the first Hamby 252 barrel, are used. The forensic examiner can also upload their own bullet lands, in the x3p format [@x3pr], and the application will use those lands rather than lands stored in the database.

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/frontapp1.png}
\caption{User Interface for the front-end bullet matching algorithm.}
\label{fig:frontend-app}
\end{figure}

The first page is called Stage 0. Stage 0 involves selection of the bullet lands, and provides some preliminary information. Beneath the land selection are two buttoms to either select "Step-By-Step Mode" or "Easy Mode". This document will focus on "Step-By-Step Mode", which enforces that the user interact with the process of the algorithm. By contrast, "Easy Mode" will choose parameters of the algorithm, such as the location to take a cross section, or the level of smoothing, automatically based on the procedures described in Automatic Matching of Bullet Lands. 

Using the `plotly` package [@plotly], the two lands are rendered in a 3D viewing framework. This framework allows panning, rotation, zooming, and other features to aid in the manual and visual inspection of the bullet land surfaces. These surface renderings will be displayed on each page of the application so that they can be used to help inform some parameter choices.

When the forensic examiner has chosen the bullet lands and read the information on Stage 0, they can choose the "Confirm Lands" button underneath "Step-By-Step Mode" to begin the matching process. The application then moves onto Stage 1, "Finding a Stable Region". In this stage, the goal is to select the coordinate of the ideal cross-section of each land. Using the algorithm described in Hare, Hofmann, Carriquiry (2016), the application attempts to select what it believes is the ideal cross-section, and provides those for each land as the default choice. When satisfied with the choice, the forensic examiner then can select "Confirm Coordinates" to continue to Stage 2.

Stage 2 involves automatic detection and removal of the grooves. A portion of the application at this stage is shown in Figure \ref{fig:frontend-app2}. In the top left, sliders are available representing the regions at which to extract the profile. While typically the default choice works well, some profiles may have unusual patterns that make automatic groove detection inaccurate. In these cases, the sliders can be adjusted as necessary to fine tune the locations. The vertical blue lines plotted show the groove location coordinates overlaid with the profile. Performing this step allows the LOESS fit to be impacted only by the curved structure of the bullet land itself, and not by the dominating grooves.

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/frontapp2.png}
\caption{Stage 2 of the front-end bullet matching application. In this stage, the grooves are automatically detected and removed from the profile of the land.}
\label{fig:frontend-app2}
\end{figure}

After groove detection, the application moves into Stage 3 (Figure \ref{fig:frontend-app3}). The application automatically fits a LOESS regression with a span of 3%, and allows the forensic examiner to adjust the span as desired to control the level of smoothing. The fits for each land, along with the extracted residuals (the land signature), are displayed. 

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/frontapp3.png}
\caption{Stage 3 of the front-end bullet matching application. In this stage, a LOESS regression is fit to the resulting profile (with grooves removed), and the smoothed residuals or the "signature" of the bullet lands are extracted.}
\label{fig:frontend-app3}
\end{figure}

Stage 4 (Figure \ref{fig:frontend-app4}) is an alignment stage. Using the ccf, the two signatures are aligned automatically for purposes of extracting features. As in prior stages, the amount of lag can be adjusted manually if the automatic choice is not ideal.

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/frontapp4.png}
\caption{Stage 4 of the front-end bullet matching application. Here, the two land signatures are automatically aligned.}
\label{fig:frontend-app4}
\end{figure}

The final stage, Stage 5, involves using the aligned signatures in order to detect peaks and valleys. By smoothing over the signatures of each land, locations in which the derivative is equal to zero can be detected. Figure \ref{fig:frontend-app5} displays the application at this stage. Note that the level of smoothing, called the "Smoothing Factor", can be adjusted as desired. The detected peaks and valleys in the aligned signatures are indicated by red and blue vertical lines, respectively.

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/frontapp5.png}
\caption{Stage 5 of the front-end bullet matching application, where peaks and valleys are detected in the smoothed and aligned signatures.}
\label{fig:frontend-app5}
\end{figure}

Finally, all stages are completed, and the resulting report is generated. A subset of the report is shown in Figure \ref{fig:frontend-app6}. In particular, note that the probability of a match based on the trained random forest is given, along with the derived features. Although not shown in the figure, the report also includes the results and parameters of all previous stages, for reproducibility. In this way, by printing the results, a step-by-step trace of each stage of the algorithm can be performed.

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/frontapp6.png}
\caption{A portion of the final output of the bullet matching application.}
\label{fig:frontend-app6}
\end{figure}

## Back-End

The back-end application, shown in Figure \ref{fig:backend-app}, stands in contrast to the front-end application by being primarily intended for researches looking to improve the matching performance of the algorithm. Like the front-end app, it uses the database to generate a list of all the bullet lands available. Unlike the front-end app, however, it allows the selection of only one land. Once it is selected, the ideal cross section will be displayed on the left.

\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{images/backapp1.png}
\caption{User Interface for the back-end bullet matching algorithm.}
\label{fig:backend-app}
\end{figure}

On the right hand side, there will be two tables. The first table is the metadata for that particular land, which comes directly from the original x3p file. The second table is information on the profile based on the selected cross-section. Beneath this table, the profile with the detected grooves overlaid is shown. Finally, beneath that, the signature derived from this profile is shown.

While functionally simple, this application allows for the assessment of the generated signatures. In the course of tuning parameters and optimizing performance, we can examine signatures that may have issues, for instance if the grooves were not properly detected or a poor cross section was taken.

# Conclusion

In this paper, we have introduced and described a formal database housing raw bullet data, and the results of each processing stage of our bullet matching algorithm. Because this database is openly accessible, and all parameters are tracked, this allows for researchers to more easily use the results in order to iterate on components of the algorithm in hopes of improving the matching performance. Furthermore, use of the database will ensure that algorithms built upon it will automatically update as new data is provided, so that over time results can improve.

The two web applications both serve important purposes in the process of bullet matching research. The front-end application serves as an entry point to the algorithms and the features which a non-programmer can utilize. By simply uploading the surface scans of two bullet lands, a predicted matching probability can be obtained in a few seconds. Perhaps more importantly, values of the features can also be obtained, and the bullet lands themselves can be rotated and viewed in a 3D plotting framework. This application can act as a supplement to classical comparison methods, which are traditionally done manually under a comparison microscope.

On the other hand, the back-end application is intended for researchers who intend to develop code to improve upon the algorithms. It allows an assessment of where the algorithm is having the most issue distinguishing matches from non-matches, and can be used to quickly implement improvements as needed.