Title: | Handwriting Analysis in R |
---|---|
Description: | Perform statistical writership analysis of scanned handwritten documents. Webpage provided at: <https://github.com/CSAFE-ISU/handwriter>. |
Authors: | Iowa State University of Science and Technology on behalf of its Center for Statistics and Applications in Forensic Evidence [aut, cph, fnd], Nick Berry [aut], Stephanie Reinders [aut, cre], James Taylor [aut], Felix Baez-Santiago [ctb], Jon González [ctb] |
Maintainer: | Stephanie Reinders <[email protected]> |
License: | GPL-3 |
Version: | 3.2.1.9000 |
Built: | 2024-11-07 00:53:01 UTC |
Source: | https://github.com/csafe-isu/handwriter |
about_variable()
returns information about the model variable.
about_variable(variable, model)
about_variable(variable, model)
variable |
A variable in the fitted model output by |
model |
A fitted model created by |
Text that explains the variable
about_variable( variable = "mu[1,2]", model = example_model )
about_variable( variable = "mu[1,2]", model = example_model )
addToFeatures
addToFeatures(FeatureSet, LetterList, vectorDims)
addToFeatures(FeatureSet, LetterList, vectorDims)
FeatureSet |
The current list of features that have been calculated |
LetterList |
List of all letters and their information |
vectorDims |
Vectors with image Dims |
A list consisting of current features calculated in FeatureSet as well as measures of compactness, loop count, and loop dimensions
analyze_questioned_documents()
estimates the posterior probability of
writership for the questioned documents using Markov Chain Monte Carlo (MCMC) draws from a hierarchical
model created with fit_model()
.
analyze_questioned_documents( main_dir, questioned_docs, model, num_cores, writer_indices, doc_indices )
analyze_questioned_documents( main_dir, questioned_docs, model, num_cores, writer_indices, doc_indices )
main_dir |
A directory that contains a cluster template created by |
questioned_docs |
A directory containing questioned documents |
model |
A fitted model created by |
num_cores |
An integer number of cores to use for parallel processing
with the |
writer_indices |
A vector of start and stop characters for writer IDs in file names |
doc_indices |
A vector of start and stop characters for document names in file names |
A list of likelihoods, votes, and posterior probabilities of writership for each questioned document.
## Not run: main_dir <- "/path/to/main_dir" questioned_docs <- "/path/to/questioned_images" analysis <- analyze_questioned_documents( main_dir = main_dir, questioned_docs = questioned_docs, model = model, num_cores = 2, writer_indices = c(2, 5), doc_indices = c(7, 18) ) analysis$posterior_probabilities ## End(Not run)
## Not run: main_dir <- "/path/to/main_dir" questioned_docs <- "/path/to/questioned_images" analysis <- analyze_questioned_documents( main_dir = main_dir, questioned_docs = questioned_docs, model = model, num_cores = 2, writer_indices = c(2, 5), doc_indices = c(7, 18) ) analysis$posterior_probabilities ## End(Not run)
Fit a model with fit_model()
and calculate posterior probabilities of
writership with analyze_questioned_documents()
of a set of test documents
where the ground truth is known. Then use calculate_accuracy()
to measure
the accuracy of the fitted model on the test documents. Accuracy is calculated as
the average posterior probability assigned to the true writer.
calculate_accuracy(analysis)
calculate_accuracy(analysis)
analysis |
Writership analysis output by
|
The model's accuracy on the test set as a number
# calculate the accuracy for example analysis performed on test documents and a model with 1 chain calculate_accuracy(example_analysis) ## Not run: main_dir <- "/path/to/main_dir" test_images_dir <- "/path/to/test_images" analysis <- analyze_questioned_documents( main_dir = main_dir, questioned_docs = test_images_dir, model = model, num_cores = 2, writer_indices = c(2, 5), doc_indices = c(7, 18) ) calculate_accuracy(analysis) ## End(Not run)
# calculate the accuracy for example analysis performed on test documents and a model with 1 chain calculate_accuracy(example_analysis) ## Not run: main_dir <- "/path/to/main_dir" test_images_dir <- "/path/to/test_images" analysis <- analyze_questioned_documents( main_dir = main_dir, questioned_docs = test_images_dir, model = model, num_cores = 2, writer_indices = c(2, 5), doc_indices = c(7, 18) ) calculate_accuracy(analysis) ## End(Not run)
Removes alpha channel from png image.
cleanBinaryImage(img)
cleanBinaryImage(img)
img |
A matrix of 1s and 0s. |
png image with the alpha channel removed
Cursive written word: csafe
csafe
csafe
Binary image matrix. 111 rows and 410 columns.
csafe_document <- list() csafe_document$image <- csafe plotImage(csafe_document) csafe_document$thin <- thinImage(csafe_document$image) plotImageThinned(csafe_document) csafe_processList <- processHandwriting(csafe_document$thin, dim(csafe_document$image))
csafe_document <- list() csafe_document$image <- csafe plotImage(csafe_document) csafe_document$thin <- thinImage(csafe_document$image) plotImageThinned(csafe_document) csafe_processList <- processHandwriting(csafe_document$thin, dim(csafe_document$image))
drop_burnin()
removes the burn-in from the Markov Chain Monte Carlo (MCMC) draws.
drop_burnin(model, burn_in)
drop_burnin(model, burn_in)
model |
A list of MCMC draws from a model fit with |
burn_in |
An integer number of starting iterations to drop from each MCMC chain. |
A list of data frames of MCMC draws with burn-in dropped.
model <- drop_burnin(model = example_model, burn_in = 25) plot_trace(variable = "mu[1,2]", model = example_model)
model <- drop_burnin(model = example_model, burn_in = 25) plot_trace(variable = "mu[1,2]", model = example_model)
Example of writership analysis
example_analysis
example_analysis
The results of analyze_questioned_documents()
stored in a named list with 5 items:
A data frame of that shows the writer, document name, cluster assignment, slope, principle component rotation angle, and wrapped principle component rotation angle for each training graph in each questioned documents.
A data frame of the cluster fill counts for each questioned document.
A list of data frames where each data frame contains the likelihoods for a questioned document for each MCMC iteration.
A list of vote tallies for each questioned document.
A list of posterior probabilities of writership for each questioned document and each known writer in the closed set used to train the hierarchical model.
plot_cluster_fill_counts(formatted_data = example_analysis) plot_posterior_probabilities(analysis = example_analysis)
plot_cluster_fill_counts(formatted_data = example_analysis) plot_posterior_probabilities(analysis = example_analysis)
An example cluster template created with make_clustering_template()
. The
cluster template was created from handwriting samples
"w0016_s01_pLND_r01.png", "w0080_s01_pLND_r01.png", "w0124_s01_pLND_r01.png",
"w0138_s01_pLND_r01.png", and "w0299_s01_pLND_r01.png" from the CSAFE Handwriting
Database. The template has K=5 clusters.
example_cluster_template
example_cluster_template
A list containing a single cluster template created by
make_clustering_template()
. The cluster template was created by sorting
a random sample of 1000 graphs from 10 training documents into 10 clusters
with a K-means algorithm. The cluster template is a named list with 16
items:
An integer for the random number generator.
A vector of cluster assignments for each graph used to create the cluster template.
The final cluster centers produced by the K-Means algorithm.
The number of clusters to build (10) with the K-means algorithm.
The number of training graphs to use (1000) in the K-means algorithm.
A vector that lists the training document from which each graph originated.
A vector that lists the writer of each graph.
The maximum number of iterations for the K-means algorithm (3).
A vector of the number of graphs that changed clusters on each iteration of the K-means algorithm.
A vector of the outlier cutoff values calculated on each iteration of the K-means algorithm.
The reason the K-means algorithm terminated.
A matrix of the within cluster distances on each iteration of the K-means algorithm. More specifically, the distance between each graph and the center of the cluster to which it was assigned on each iteration.
A vector of the within-cluster sum of squares on each iteration of the K-means algorithm.
# view cluster fill counts for template training documents template_data <- format_template_data(example_cluster_template) plot_cluster_fill_counts(template_data, facet = TRUE)
# view cluster fill counts for template training documents template_data <- format_template_data(example_cluster_template) plot_cluster_fill_counts(template_data, facet = TRUE)
Example of a hierarchical model
example_model
example_model
A hierarchical model created by fit_model
with a single chain of 100 MCMC iterations. It is a named
list of 4 objects:
A data frame of model training data that shows the writer, document name, cluster assignment, slope, principle component rotation angle, and wrapped principle component rotation angle for each training graph.
A data frame of the cluster fill counts for each model training document.
The model training information from graph_measurements
and cluster_fill_counts
formatted for RJAGS.
A model fit using the rjags_data
and the RJAGS and coda packages. It is an MCMC list that contains a single
MCMC object.
# convert to a data frame and view all variable names df <- as.data.frame(coda::as.mcmc(example_model$fitted_model)) colnames(df) # view a trace plot plot_trace(variable = "mu[1,1]", model = example_model) # drop the first 25 MCMC iterations for burn-in model <- drop_burnin(model = example_model, burn_in = 25) ## Not run: # analyze questioned documents main_dir <- /path/to/main_dir questioned_docs <- /path/to/questioned_documents_directory analysis <- analyze_questioned_documents( main_dir = main_dir, questioned_docs = questioned_docs model = example_model num_cores = 2 ) analysis$posterior_probabilities ## End(Not run)
# convert to a data frame and view all variable names df <- as.data.frame(coda::as.mcmc(example_model$fitted_model)) colnames(df) # view a trace plot plot_trace(variable = "mu[1,1]", model = example_model) # drop the first 25 MCMC iterations for burn-in model <- drop_burnin(model = example_model, burn_in = 25) ## Not run: # analyze questioned documents main_dir <- /path/to/main_dir questioned_docs <- /path/to/questioned_documents_directory analysis <- analyze_questioned_documents( main_dir = main_dir, questioned_docs = questioned_docs model = example_model num_cores = 2 ) analysis$posterior_probabilities ## End(Not run)
'r lifecycle::badge("superseded")'
extractGraphs(source_folder = getwd(), save_folder = getwd())
extractGraphs(source_folder = getwd(), save_folder = getwd())
source_folder |
path to folder containing .png images |
save_folder |
path to folder where graphs are saved to |
Development on 'extractGraphs()' is complete. We recommend using 'process_batch_dir()' instead.
Extracts graphs from .png images and saves each by their respective writer.
saves graphs in an rds file
## Not run: sof = "path to folder containing .png images" saf = "path to folder where graphs will be saved to" extractGraphs(sof, saf) ## End(Not run)
## Not run: sof = "path to folder containing .png images" saf = "path to folder where graphs will be saved to" extractGraphs(sof, saf) ## End(Not run)
fit_model()
fits a Bayesian hierarchical model to the model training data
in model_docs
and draws samples from the model as Markov Chain Monte
Carlo (MCMC) estimates.
fit_model( main_dir, model_docs, num_iters, num_chains = 1, num_cores, writer_indices, doc_indices, a = 2, b = 0.25, c = 2, d = 2, e = 0.5 )
fit_model( main_dir, model_docs, num_iters, num_chains = 1, num_cores, writer_indices, doc_indices, a = 2, b = 0.25, c = 2, d = 2, e = 0.5 )
main_dir |
A directory that contains a cluster template created by
|
model_docs |
A directory containing model training documents |
num_iters |
An integer number of iterations of MCMC. |
num_chains |
An integer number of chains to use. |
num_cores |
An integer number of cores to use for parallel processing clustering assignments. The model fitting is not done in parallel. |
writer_indices |
A vector of the start and stop character of the writer ID in the model training file names. E.g., if the file names are writer0195_doc1, writer0210_doc1, writer0033_doc1 then writer_indices is 'c(7,10)'. |
doc_indices |
A vector of the start and stop character of the "document name" in the model training file names. This is used to distinguish between two documents written by the same writer. E.g., if the file names are writer0195_doc1, writer0195_doc2, writer0033_doc1, writer0033_doc2 then doc_indices are 'c(12,15)'. |
a |
The shape parameter for the Gamma distribution in the hierarchical model |
b |
The rate parameter for the Gamma distribution in the hierarchical model |
c |
The first shape parameter for the Beta distribution in the hierarchical model |
d |
The second shape parameter for the Beta distribution in the hierarchical model |
e |
The scale parameter for the hyper prior for mu in the hierarchical model |
A list of training data used to fit the model and the fitted model
## Not run: main_dir <- "/path/to/main_dir" model_docs <- "path/to/model_training_docs" questioned_docs <- "path/to/questioned_docs" model <- fit_model( main_dir = main_dir, model_docs = model_docs, num_iters = 100, num_chains = 1, num_cores = 2, writer_indices = c(2, 5), doc_indices = c(7, 18) ) model <- drop_burnin(model = model, burn_in = 25) analysis <- analyze_questioned_documents( main_dir = main_dir, questioned_docs = questioned_docs, model = model, num_cores = 2 ) analysis$posterior_probabilities ## End(Not run)
## Not run: main_dir <- "/path/to/main_dir" model_docs <- "path/to/model_training_docs" questioned_docs <- "path/to/questioned_docs" model <- fit_model( main_dir = main_dir, model_docs = model_docs, num_iters = 100, num_chains = 1, num_cores = 2, writer_indices = c(2, 5), doc_indices = c(7, 18) ) model <- drop_burnin(model = model, burn_in = 25) analysis <- analyze_questioned_documents( main_dir = main_dir, questioned_docs = questioned_docs, model = model, num_cores = 2 ) analysis$posterior_probabilities ## End(Not run)
format_template_data()
formats the template data for use with
plot_cluster_fill_counts()
. The output is a list that contains a data frame
called cluster_fill_counts
.
format_template_data(template)
format_template_data(template)
template |
A single cluster template created by
|
List that contains the cluster fill counts
template_data <- format_template_data(template = example_cluster_template) plot_cluster_fill_counts(formatted_data = template_data, facet = TRUE)
template_data <- format_template_data(template = example_cluster_template) plot_cluster_fill_counts(formatted_data = template_data, facet = TRUE)
get_cluster_fill_counts()
creates a data frame that shows the number of
graphs in each cluster for each input document.
get_cluster_fill_counts(df)
get_cluster_fill_counts(df)
df |
A data frame with columns |
A dataframe of cluster fill counts for each document in the input data frame.
writer <- c(rep(1, 20), rep(2, 20), rep(3, 20)) docname <- c(rep('doc1',20), rep('doc2', 20), rep('doc3', 20)) doc <- c(rep(1, 20), rep(2, 20), rep(3, 20)) cluster <- sample(3, 60, replace=TRUE) df <- data.frame(docname, writer, doc, cluster) get_cluster_fill_counts(df)
writer <- c(rep(1, 20), rep(2, 20), rep(3, 20)) docname <- c(rep('doc1',20), rep('doc2', 20), rep('doc3', 20)) doc <- c(rep(1, 20), rep(2, 20), rep(3, 20)) cluster <- sample(3, 60, replace=TRUE) df <- data.frame(docname, writer, doc, cluster) get_cluster_fill_counts(df)
get_clusters_batch
get_clusters_batch( template, input_dir, output_dir, writer_indices = NULL, doc_indices = NULL, num_cores = 1, save_master_file = FALSE )
get_clusters_batch( template, input_dir, output_dir, writer_indices = NULL, doc_indices = NULL, num_cores = 1, save_master_file = FALSE )
template |
A cluster template created with |
input_dir |
A directory containing graphs created with
|
output_dir |
Output directory for cluster assignments |
writer_indices |
Optional. A Vector of start and end indices for the writer id in the graph file names. |
doc_indices |
Optional. Vector of start and end indices for the document id in the graph file names. |
num_cores |
Integer number of cores to use for parallel processing |
save_master_file |
TRUE or FALSE. If TRUE, a master file named 'all_clusters.rds' containing the cluster assignments for all documents in the input directory will be saved to the output directory. If FASLE, a master file will not be saved, but the individual files for each document in the input directory will still be saved to the output directory. |
A list of cluster assignments
## Not run: template <- readRDS('path/to/template.rds') get_clusters_batch(template=template, input_dir='path/to/dir', output_dir='path/to/dir', writer_indices=c(2,5), doc_indices=c(7,18), num_cores=1) get_clusters_batch(template=template, input_dir='path/to/dir', output_dir='path/to/dir', writer_indices=c(1,4), doc_indices=c(5,10), num_cores=5) ## End(Not run)
## Not run: template <- readRDS('path/to/template.rds') get_clusters_batch(template=template, input_dir='path/to/dir', output_dir='path/to/dir', writer_indices=c(2,5), doc_indices=c(7,18), num_cores=1) get_clusters_batch(template=template, input_dir='path/to/dir', output_dir='path/to/dir', writer_indices=c(1,4), doc_indices=c(5,10), num_cores=5) ## End(Not run)
In a model created with fit_model()
the pi parameters are the estimate of
the true cluster fill count for a particular writer and cluster. The function
get_credible_intervals()
calculates the credible intervals of the pi
parameters for each writer in the model.
get_credible_intervals(model, interval_min = 0.05, interval_max = 0.95)
get_credible_intervals(model, interval_min = 0.05, interval_max = 0.95)
model |
A model output by |
interval_min |
The lower bound for the credible interval. The number must be between 0 and 1. |
interval_max |
The upper bound for the credible interval. The number
must be greater than |
A list of data frames. Each data frame lists the credible intervals for a single writer.
get_credible_intervals(model=example_model) get_credible_intervals(model=example_model, interval_min=0.05, interval_max=0.95)
get_credible_intervals(model=example_model) get_credible_intervals(model=example_model, interval_min=0.05, interval_max=0.95)
Get the posterior probabilities for questioned document analyzed with analyze_questioned_documents()
.
get_posterior_probabilities(analysis, questioned_doc)
get_posterior_probabilities(analysis, questioned_doc)
analysis |
The output of |
questioned_doc |
The filename of the questioned document |
A data frame of posterior probabilities for the questioned document
get_posterior_probabilities( analysis = example_analysis, questioned_doc = "w0030_s03_pWOZ_r01" )
get_posterior_probabilities( analysis = example_analysis, questioned_doc = "w0030_s03_pWOZ_r01" )
A graph prototype consists of the starting and ending points of each path in the graph, as well as and evenly spaced points along each path. The prototype also stores the center point of the graph. All points are represented as xy-coordinates and the center point is at (0,0).
graphToPrototype(graph, numPathCuts = 8)
graphToPrototype(graph, numPathCuts = 8)
graph |
A graph from a handwriting sample |
numPathCuts |
Number of segments to cut the path(s) into |
List of pathEnds, pathQuarters, and pathCenters given as (x,y) coordinates with the graph centroid at (0,0). The returned list also contains path lengths. pathQuarters gives the (x,y) coordinates of the path at the cut points and despite the name, the path might not be cut into quarters.
Cursive written word: London
london
london
Binary image matrix. 148 rows and 481 columns.
london_document <- list() london_document$image <- london plotImage(london_document) london_document$thin <- thinImage(london_document$image) plotImageThinned(london_document) london_processList <- processHandwriting(london_document$thin, dim(london_document$image))
london_document <- list() london_document$image <- london plotImage(london_document) london_document$thin <- thinImage(london_document$image) plotImageThinned(london_document) london_processList <- processHandwriting(london_document$thin, dim(london_document$image))
make_clustering_template()
applies a K-means clustering algorithm to the
input handwriting samples pre-processed with process_batch_dir()
and saved
in the input folder main_dir > data > template_graphs
. The K-means
algorithm sorts the graphs in the input handwriting samples into groups, or
clusters, of similar graphs.
make_clustering_template( main_dir, template_docs, writer_indices, centers_seed, K = 40, num_dist_cores = 1, max_iters = 25 )
make_clustering_template( main_dir, template_docs, writer_indices, centers_seed, K = 40, num_dist_cores = 1, max_iters = 25 )
main_dir |
Main directory that will store template files |
template_docs |
A directory containing template training images |
writer_indices |
A vector of the starting and ending location of the writer ID in the file name. |
centers_seed |
Integer seed for the random number generator when selecting starting cluster centers. |
K |
Integer number of clusters |
num_dist_cores |
Integer number of cores to use for the distance calculations in the K-means algorithm. Each iteration of the K-means algorithm calculates the distance between each input graph and each cluster center. |
max_iters |
Maximum number of iterations to allow the K-means algorithm to run |
List containing the cluster template
## Not run: main_dir <- "path/to/folder" template_docs <- "path/to/template_training_docs" template_list <- make_clustering_template( main_dir = main_dir, template_docs = template_docs, writer_indices = c(2, 5), K = 10, num_dist_cores = 2, max_iters = 25, centers_seed = 100, ) ## End(Not run)
## Not run: main_dir <- "path/to/folder" template_docs <- "path/to/template_training_docs" template_list <- make_clustering_template( main_dir = main_dir, template_docs = template_docs, writer_indices = c(2, 5), K = 10, num_dist_cores = 2, max_iters = 25, centers_seed = 100, ) ## End(Not run)
Full page image of the handwritten London letter.
message
message
Binary image matrix. 1262 rows and 1162 columns.
message_document <- list() message_document$image <- message plotImage(message_document) ## Not run: message_document <- list() message_document$image <- message plotImage(message_document) message_document$thin <- thinImage(message_document$image) plotImageThinned(message_document) message_processList <- processHandwriting(message_document$thin, dim(message_document$image)) ## End(Not run)
message_document <- list() message_document$image <- message plotImage(message_document) ## Not run: message_document <- list() message_document$image <- message plotImage(message_document) message_document$thin <- thinImage(message_document$image) plotImageThinned(message_document) message_processList <- processHandwriting(message_document$thin, dim(message_document$image)) ## End(Not run)
Full page image of the 4th sample (nature) of handwriting from the first writer.
nature1
nature1
Binary image matrix. 811 rows and 1590 columns.
nature1_document <- list() nature1_document$image <- nature1 plotImage(nature1_document) ## Not run: nature1_document <- list() nature1_document$image <- nature1 plotImage(nature1_document) nature1_document$thin <- thinImage(nature1_document$image) plotImageThinned(nature1_document) nature1_processList <- processHandwriting(nature1_document$thin, dim(nature1_document$image)) ## End(Not run)
nature1_document <- list() nature1_document$image <- nature1 plotImage(nature1_document) ## Not run: nature1_document <- list() nature1_document$image <- nature1 plotImage(nature1_document) nature1_document$thin <- thinImage(nature1_document$image) plotImageThinned(nature1_document) nature1_processList <- processHandwriting(nature1_document$thin, dim(nature1_document$image)) ## End(Not run)
Plot the cluster centers of a cluster template created with
make_clustering_template
. This function uses a K-Means type algorithm to
sort graphs from training documents into clusters. On each iteration of the
algorithm, it calculates the mean graph of each cluster and finds the graph
in each cluster that is closest to the mean graph. The graphs closest to the
mean graphs are used as the cluster centers for the next iteration.
Handwriter stores the cluster centers of a cluster template as graph
prototypes. A graph prototype consists of the starting and ending points of
each path in the graph, as well as and evenly spaced points along each path.
The prototype also stores the center point of the graph. All points are
represented as xy-coordinates and the center point is at (0,0).
plot_cluster_centers(template, plot_graphs = FALSE, size = 100)
plot_cluster_centers(template, plot_graphs = FALSE, size = 100)
template |
A cluster template created with |
plot_graphs |
TRUE plots all graphs in each cluster in addition to the cluster centers. FALSE only plots the cluster centers. |
size |
The size of the output plot |
A plot
# plot cluster centers from example template plot_cluster_centers(example_cluster_template) plot_cluster_centers(example_cluster_template, plot_graphs = TRUE)
# plot cluster centers from example template plot_cluster_centers(example_cluster_template) plot_cluster_centers(example_cluster_template, plot_graphs = TRUE)
Plot the cluster fill counts for each document in formatted_data
.
plot_cluster_fill_counts(formatted_data, facet = TRUE)
plot_cluster_fill_counts(formatted_data, facet = TRUE)
formatted_data |
Data created by |
facet |
|
ggplot plot of cluster fill counts
# Plot cluster fill counts for template training documents template_data <- format_template_data(example_cluster_template) plot_cluster_fill_counts(formatted_data = template_data, facet = TRUE) # Plot cluster fill counts for model training documents plot_cluster_fill_counts(formatted_data = example_model, facet = TRUE) # Plot cluster fill counts for questioned documents plot_cluster_fill_counts(formatted_data = example_analysis, facet = FALSE)
# Plot cluster fill counts for template training documents template_data <- format_template_data(example_cluster_template) plot_cluster_fill_counts(formatted_data = template_data, facet = TRUE) # Plot cluster fill counts for model training documents plot_cluster_fill_counts(formatted_data = example_model, facet = TRUE) # Plot cluster fill counts for questioned documents plot_cluster_fill_counts(formatted_data = example_analysis, facet = FALSE)
Plot the cluster fill rates for each document in formatted_data
.
plot_cluster_fill_rates(formatted_data, facet = FALSE)
plot_cluster_fill_rates(formatted_data, facet = FALSE)
formatted_data |
Data created by |
facet |
|
ggplot plot of cluster fill rates
# Plot cluster fill rates for template training documents template_data <- format_template_data(example_cluster_template) plot_cluster_fill_rates(formatted_data = template_data, facet = TRUE) # Plot cluster fill rates for model training documents plot_cluster_fill_rates(formatted_data = example_model, facet = TRUE) # Plot cluster fill rates for questioned documents plot_cluster_fill_rates(formatted_data = example_analysis, facet = FALSE)
# Plot cluster fill rates for template training documents template_data <- format_template_data(example_cluster_template) plot_cluster_fill_rates(formatted_data = template_data, facet = TRUE) # Plot cluster fill rates for model training documents plot_cluster_fill_rates(formatted_data = example_model, facet = TRUE) # Plot cluster fill rates for questioned documents plot_cluster_fill_rates(formatted_data = example_analysis, facet = FALSE)
Plot credible intervals for the model's pi parameters that estimate the true writer cluster fill counts.
plot_credible_intervals( model, interval_min = 0.025, interval_max = 0.975, facet = FALSE )
plot_credible_intervals( model, interval_min = 0.025, interval_max = 0.975, facet = FALSE )
model |
A model created by |
interval_min |
The lower bound of the credible interval. It must be greater than zero and less than 1. |
interval_max |
The upper bound of the credible interval. It must be greater than the interval minimum and less than 1. |
facet |
|
ggplot plot credible intervals
plot_credible_intervals(model = example_model) plot_credible_intervals(model = example_model, facet = TRUE)
plot_credible_intervals(model = example_model) plot_credible_intervals(model = example_model, facet = TRUE)
Use processDocument()
to split handwritting into component shapes called
graphs. plot_graphs()
creates a plot that displays the graphs. ggplot2::facet_wrap()
places each graph in its own facet, and ncol
sets the number of columns of facets.
plot_graphs(doc, ncol = NULL)
plot_graphs(doc, ncol = NULL)
doc |
A PNG image of handwriting processed with |
ncol |
Optionally, set the number of columns in the output plot. The default is |
A plot of all graphs in the document
image_path <- system.file("extdata", "phrase_example.png", package = "handwriter") doc <- processDocument(image_path) plot_graphs(doc)
image_path <- system.file("extdata", "phrase_example.png", package = "handwriter") doc <- processDocument(image_path) plot_graphs(doc)
Creates a tile plot of posterior probabilities of writership for each
questioned document and each known writer analyzed with
analyze_questioned_documents()
.
plot_posterior_probabilities(analysis)
plot_posterior_probabilities(analysis)
analysis |
A named list of analysis results from |
A tile plot of posterior probabilities of writership.
plot_posterior_probabilities(analysis = example_analysis)
plot_posterior_probabilities(analysis = example_analysis)
Create a trace plot for all chains for a single variable of a fitted model
created by fit_model()
. If the model contains more than one chain, the
chains will be combined by pasting them together.
plot_trace(variable, model)
plot_trace(variable, model)
variable |
The name of a variable in the model |
model |
A model created by |
A trace plot
plot_trace(model = example_model, variable = "pi[1,1]") plot_trace(model = example_model, variable = "mu[2,3]")
plot_trace(model = example_model, variable = "pi[1,1]") plot_trace(model = example_model, variable = "mu[2,3]")
This function plots a basic black and white image.
plotImage(doc)
plotImage(doc)
doc |
A document processed with |
ggplot plot
csafe_document <- list() csafe_document$image <- csafe plotImage(csafe_document) ## Not run: document <- processDocument('path/to/image.png') plotImage(document) ## End(Not run)
csafe_document <- list() csafe_document$image <- csafe plotImage(csafe_document) ## Not run: document <- processDocument('path/to/image.png') plotImage(document) ## End(Not run)
This function returns a plot with the full image plotted in light gray and the thinned skeleton printed in black on top.
plotImageThinned(doc)
plotImageThinned(doc)
doc |
A document processed with |
gpplot plot of thinned image
csafe_document <- list() csafe_document$image <- csafe csafe_document$thin <- thinImage(csafe_document$image) plotImageThinned(csafe_document)
csafe_document <- list() csafe_document$image <- csafe csafe_document$thin <- thinImage(csafe_document$image) plotImageThinned(csafe_document)
This function returns a plot of a single graph extracted from a document. It
uses the letterList parameter from the processHandwriting()
or processDocument()
function and
accepts a single value as whichLetter
. Dims requires the dimensions of the
entire document, since this isn't contained in processHandwriting()
or processDocument()
.
plotLetter( doc, whichLetter, showPaths = TRUE, showCentroid = TRUE, showSlope = TRUE, showNodes = TRUE )
plotLetter( doc, whichLetter, showPaths = TRUE, showCentroid = TRUE, showSlope = TRUE, showNodes = TRUE )
doc |
A document processed with |
whichLetter |
Single value in 1:length(letterList) denoting which letter to plot. |
showPaths |
Whether the calculated paths on the letter should be shown with numbers. |
showCentroid |
Whether the centroid should be shown |
showSlope |
Whether the slope should be shown |
showNodes |
Whether the nodes should be shown |
Plot of single letter.
twoSent_document = list() twoSent_document$image = twoSent twoSent_document$thin = thinImage(twoSent_document$image) twoSent_document$process = processHandwriting(twoSent_document$thin, dim(twoSent_document$image)) plotLetter(twoSent_document, 1) plotLetter(twoSent_document, 4, showPaths = FALSE)
twoSent_document = list() twoSent_document$image = twoSent twoSent_document$thin = thinImage(twoSent_document$image) twoSent_document$process = processHandwriting(twoSent_document$thin, dim(twoSent_document$image)) plotLetter(twoSent_document, 1) plotLetter(twoSent_document, 4, showPaths = FALSE)
This function returns a plot of a single line extracted from a document. It uses the letterList parameter from the processHandwriting function and accepts a single value as whichLetter. Dims requires the dimensions of the entire document, since this isn't contained in processHandwriting.
plotLine(letterList, whichLine, dims)
plotLine(letterList, whichLine, dims)
letterList |
Letter list from processHandwriting function |
whichLine |
Single value denoting which line to plot - checked if too big inside function. |
dims |
Dimensions of the original document |
ggplot plot of single line
twoSent_document = list() twoSent_document$image = twoSent twoSent_document$thin = thinImage(twoSent_document$image) twoSent_processList = processHandwriting(twoSent_document$thin, dim(twoSent_document$image)) dims = dim(twoSent_document$image) plotLine(twoSent_processList$letterList, 1, dims)
twoSent_document = list() twoSent_document$image = twoSent twoSent_document$thin = thinImage(twoSent_document$image) twoSent_processList = processHandwriting(twoSent_document$thin, dim(twoSent_document$image)) dims = dim(twoSent_document$image) plotLine(twoSent_processList$letterList, 1, dims)
This function returns a plot with the full image plotted in light gray and the skeleton printed in black, with red triangles over the vertices. Also called from plotPath, which is a more useful function, in general.
plotNodes(doc, plot_break_pts = FALSE, nodeSize = 3, nodeColor = "red")
plotNodes(doc, plot_break_pts = FALSE, nodeSize = 3, nodeColor = "red")
doc |
A document processed with |
plot_break_pts |
Logical value as to whether to plot nodes or break points. plot_break_pts=FALSE plots nodes and plot_break_pts=TRUE plots break point. |
nodeSize |
Size of triangles printed. 3 by default. Move down to 2 or 1 for small text images. |
nodeColor |
Which color the nodes should be |
Plot of full and thinned image with vertices overlaid.
csafe_document <- list() csafe_document$image <- csafe csafe_document$thin <- thinImage(csafe_document$image) csafe_document$process <- processHandwriting(csafe_document$thin, dim(csafe_document$image)) plotNodes(csafe_document) plotNodes(csafe_document, nodeSize=6, nodeColor="black")
csafe_document <- list() csafe_document$image <- csafe csafe_document$thin <- thinImage(csafe_document$image) csafe_document$process <- processHandwriting(csafe_document$thin, dim(csafe_document$image)) plotNodes(csafe_document) plotNodes(csafe_document, nodeSize=6, nodeColor="black")
Process a list of handwriting samples saved as PNG images in a directory:
(1) Load the image and convert it to black and white with readPNGBinary()
(2) Thin the handwriting to one pixel in width with thinImage()
(3) Run processHandwriting()
to split the handwriting into parts called edges and place nodes at the ends of
edges. Then combine edges into component shapes called graphs.
(4) Save the processed document in an RDS file.
(5) Optional. Return a list of the processed documents.
process_batch_dir(input_dir, output_dir = ".", skip_docs_on_retry = TRUE)
process_batch_dir(input_dir, output_dir = ".", skip_docs_on_retry = TRUE)
input_dir |
Input directory that contains images |
output_dir |
A directory to save the processed images |
skip_docs_on_retry |
Logical whether to skip documents in input_dir that
caused errors on a previous run. The errors and document names are stored
in output_dir > problems.txt. If this is the first run,
|
No return value, called for side effects
## Not run: process_batch_dir("path/to/input_dir", "path/to/output_dir") ## End(Not run)
## Not run: process_batch_dir("path/to/input_dir", "path/to/output_dir") ## End(Not run)
Process a list of handwriting samples saved as PNG images: (1) Load the image
and convert it to black and white with readPNGBinary()
(2) Thin the
handwriting to one pixel in width with thinImage()
(3) Run
processHandwriting()
to split the handwriting into parts called edges
and place nodes at the ends of edges. Then combine edges into component
shapes called graphs. (4) Save the processed document in an RDS file. (5)
Optional. Return a list of the processed documents.
process_batch_list(images, output_dir, skip_docs_on_retry = TRUE)
process_batch_list(images, output_dir, skip_docs_on_retry = TRUE)
images |
A vector of image file paths |
output_dir |
A directory to save the processed images |
skip_docs_on_retry |
Logical whether to skip documents in the images arguement that
caused errors on a previous run. The errors and document names are stored
in output_dir > problems.txt. If this is the first run,
|
No return value, called for side effects
## Not run: images <- c('path/to/image1.png', 'path/to/image2.png', 'path/to/image3.png') process_batch_list(images, "path/to/output_dir", FALSE) process_batch_list(images, "path/to/output_dir", TRUE) ## End(Not run)
## Not run: images <- c('path/to/image1.png', 'path/to/image2.png', 'path/to/image3.png') process_batch_list(images, "path/to/output_dir", FALSE) process_batch_list(images, "path/to/output_dir", TRUE) ## End(Not run)
Load a handwriting sample from a PNG image. Then binarize, thin, and split the handwriting into graphs.
processDocument(path)
processDocument(path)
path |
File path for handwriting document. The document must be in PNG file format. |
The processed document as a list
image_path <- system.file("extdata", "phrase_example.png", package = "handwriter") doc <- processDocument(image_path) plotImage(doc) plotImageThinned(doc) plotNodes(doc)
image_path <- system.file("extdata", "phrase_example.png", package = "handwriter") doc <- processDocument(image_path) plotImage(doc) plotImageThinned(doc) plotNodes(doc)
The main driver of handwriting processing. Takes in an image of thinned
handwriting created with thinImage()
and splits the the handwriting into
shapes called graphs. Instead of processing the entire document at once,
the thinned writing is separated into connected components and each component
is split into graphs.
processHandwriting(img, dims)
processHandwriting(img, dims)
img |
Thinned binary image created with |
dims |
Dimensions of thinned binary image. |
A list of the processed image
twoSent_document <- list() twoSent_document$image <- twoSent twoSent_document$thin <- thinImage(twoSent_document$image) twoSent_processList <- processHandwriting(twoSent_document$thin, dim(twoSent_document$image))
twoSent_document <- list() twoSent_document$image <- twoSent twoSent_document$thin <- thinImage(twoSent_document$image) twoSent_processList <- processHandwriting(twoSent_document$thin, dim(twoSent_document$image))
Development on read_and_process()
is complete. We recommend using processDocument()
.
read_and_process(image_name, "document")
is equivalent to processDocument(image_name)
.
read_and_process(image_name, transform_output)
read_and_process(image_name, transform_output)
image_name |
The file path to an image |
transform_output |
The type of transformation to perform on the output |
A list of the processed image components
# use handwriting example from handwriter package image_path <- system.file("extdata", "phrase_example.png", package = "handwriter") doc <- read_and_process(image_path, "document")
# use handwriting example from handwriter package image_path <- system.file("extdata", "phrase_example.png", package = "handwriter") doc <- read_and_process(image_path, "document")
This function reads in and binarizes a PNG image.
readPNGBinary( path, cutoffAdjust = 0, clean = TRUE, crop = TRUE, inversion = FALSE )
readPNGBinary( path, cutoffAdjust = 0, clean = TRUE, crop = TRUE, inversion = FALSE )
path |
File path for image. |
cutoffAdjust |
Multiplicative adjustment to the K-means estimated binarization cutoff. |
clean |
Whether to fill in white pixels with 7 or 8 neighbors. This will help a lot when thinning – keeps from getting little white bubbles in text. |
crop |
Logical value dictating whether or not to crop the white out around the image. TRUE by default. |
inversion |
Logical value dictating whether or not to flip each pixel of binarized image. Flipping happens after binarization. FALSE by default. |
Returns image from path. 0 represents black, and 1 represents white by default.
image_path <- system.file("extdata", "phrase_example.png", package = "handwriter") csafe_document <- list() csafe_document$image = readPNGBinary(image_path) plotImage(csafe_document)
image_path <- system.file("extdata", "phrase_example.png", package = "handwriter") csafe_document <- list() csafe_document$image = readPNGBinary(image_path) plotImage(csafe_document)
Changes RGB image to grayscale
rgb2grayscale(img)
rgb2grayscale(img)
img |
A 3D array with slices R, G, and B |
img as a 3D array as grayscale
Removes alpha channel from png image.
rgba2rgb(img)
rgba2rgb(img)
img |
A 3-d array with slices R, G, B, and alpha. |
img as a 3D array with alpha channel removed
This function returns a vector of locations for black pixels in the thinned image. Thinning done using Zhang - Suen algorithm.
thinImage(img)
thinImage(img)
img |
A binary matrix of the text that is to be thinned. |
A thinned, one pixel wide, image.
Two sentence printed example handwriting
twoSent
twoSent
Binary image matrix. 396 rows and 1947 columns
twoSent_document <- list() twoSent_document$image <- twoSent plotImage(twoSent_document) ## Not run: twoSent_document <- list() twoSent_document$image <- twoSent plotImage(twoSent_document) twoSent_document$thin <- thinImage(twoSent_document$image) plotImageThinned(twoSent_document) twoSent_processList <- processHandwriting(twoSent_document$thin, dim(twoSent_document$image)) ## End(Not run)
twoSent_document <- list() twoSent_document$image <- twoSent plotImage(twoSent_document) ## Not run: twoSent_document <- list() twoSent_document$image <- twoSent plotImage(twoSent_document) twoSent_document$thin <- thinImage(twoSent_document$image) plotImageThinned(twoSent_document) twoSent_processList <- processHandwriting(twoSent_document$thin, dim(twoSent_document$image)) ## End(Not run)
Finds pixels in the plot that shouldn't be white and makes them black. Quick and helpful cleaning for before the thinning algorithm runs.
whichToFill(img)
whichToFill(img)
img |
A binary matrix. |
A cleaned up image.