R/AllGenerics.R
, R/method-cure_barcode.R
bc_cure_cluster.Rd
bc_cure_cluster
performs clustering of barcodes by editing distance,
and remove the minority barcodes with a similar sequence. This function is only
applicable for the BarcodeObj object with a cleanBc
slot. The barcodes
with a smaller reads count will be removed.
bc_cure_cluster(
barcodeObj,
dist_threshold = 1,
depth_fold_threshold = 1,
dist_method = "hamm",
cluster_method = "greedy",
count_threshold = 1e+09,
dist_costs = list(replace = 1, insert = 1, delete = 1)
)
# S4 method for class 'BarcodeObj'
bc_cure_cluster(
barcodeObj,
dist_threshold = 1,
depth_fold_threshold = 1,
dist_method = "hamm",
cluster_method = "greedy",
count_threshold = 1e+07,
dist_costs = list(replace = 1, insert = 1, delete = 1)
)
A BarcodeObj object.
A single integer, or vector of integers with the length of
sample number, specifying the editing distance threshold for defining two
similar barcode sequences. If the input is a vector, each value in the vector
relates to one sample according to its order in BarcodeObj
object.
The sequences with editing distance equal to or less than the threshold will be
considered similar barcodes.
A single numeric or vector of numeric with the
length of sample number, specifying the depth fold change threshold of
removing the similar minority barcode. The majority of barcodes should have at
least depth_fold_threshold
times of reads of the similar minotiry
one, to remove the minority similar barcode. (TODO: more precise
description)
A character string, specifying the editing distance used for evaluating barcode similarity. It can be "hamm" for Hamming distance or "leven" for Levenshtein distance.
A character string specifying the algorithm used to perform the clustering of barcodes. Currently only "greedy" is available, in this case, The most and the least abundant barcode will be used for comparing, the least abundant barcode is preferentially removed.
An integer, read depth threshold to consider a barcode as a true barcode. If a barcode with a count higher than this threshold it will not be removed, even if the barcode is similar to a more abundant one. Default is 1e9.
A list, the cost of the events of distance algorithm,
applicable when Levenshtein distance is applied. The
names of vector have to be insert
, delete
and replace
,
specifying the weight of insertion, deletion, and replacement events
respectively. The default cost for each event is 1.
A BarcodeObj object with cleanBc slot updated.
data(bc_obj)
d1 <- data.frame(
seq = c(
"ACTTCGATCGATCGAAAAGATCGATCGATC",
"AATTCGATCGATCGAAGAGATCGATCGATC",
"CCTTCGATCGATCGAAGAAGATCGATCGATC",
"TTTTCGATCGATCGAAAAGATCGATCGATC",
"AAATCGATCGATCGAAGAGATCGATCGATC",
"CCCTCGATCGATCGAAGAAGATCGATCGATC",
"GGGTCGATCGATCGAAAAGATCGATCGATC",
"GGATCGATCGATCGAAGAGATCGATCGATC",
"ACTTCGATCGATCGAACAAGATCGATCGATC",
"GGTTCGATCGATCGACGAGATCGATCGATC",
"GCGTCCATCGATCGAAGAAGATCGATCGATC"
),
freq = c(
30, 60, 9, 10, 14, 5, 10, 30, 6, 4 , 6
)
)
pattern <- "([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC"
bc_obj <- bc_extract(list(test = d1), pattern, sample_name=c("test"),
pattern_type=c(UMI=1, barcode=2))
# Remove barcodes with depth < 5
(bc_cured <- bc_cure_depth(bc_obj, depth=5))
#> ------------
#> bc_cure_depth: isUpdate is TRUE, update the cleanBc.
#> ------------
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 3 field(s) available:
#> raw_read_count barcode_read_count depth_cutoff
#> ----------
#> @messyBc: 1 sample(s) for raw barcodes:
#> In sample $test there are: 10 Tags
#> ----------
#> @cleanBc: 1 samples for cleaned barcodes
#> In sample $test there are: 4 barcodes
# Do the clustering, remove the less abundant barcodes
# one by hamming distance <= 1
bc_cure_cluster(bc_cured, dist_threshold = 1)
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 3 field(s) available:
#> raw_read_count barcode_read_count depth_cutoff
#> ----------
#> @messyBc: 1 sample(s) for raw barcodes:
#> In sample $test there are: 10 Tags
#> ----------
#> @cleanBc: 1 samples for cleaned barcodes
#> In sample $test there are: 2 barcodes
# Levenshtein distance <= 1
bc_cure_cluster(bc_cured, dist_threshold = 2, dist_method = "leven",
dist_costs = list("insert" = 2, "replace" = 1, "delete" = 2))
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 3 field(s) available:
#> raw_read_count barcode_read_count depth_cutoff
#> ----------
#> @messyBc: 1 sample(s) for raw barcodes:
#> In sample $test there are: 10 Tags
#> ----------
#> @cleanBc: 1 samples for cleaned barcodes
#> In sample $test there are: 1 barcodes
###