Clean barcodes by editing distance — bc_cure

bc_cure_cluster performs clustering of barcodes by editing distance, and remove the minority barcodes with a similar sequence. This function is only applicable for the BarcodeObj object with a cleanBc slot. The barcodes with a smaller reads count will be removed.

bc_cure_cluster(
  barcodeObj,
  dist_threshold = 1,
  depth_fold_threshold = 1,
  dist_method = "hamm",
  cluster_method = "greedy",
  count_threshold = 1e+09,
  dist_costs = list(replace = 1, insert = 1, delete = 1)
)

# S4 method for class 'BarcodeObj'
bc_cure_cluster(
  barcodeObj,
  dist_threshold = 1,
  depth_fold_threshold = 1,
  dist_method = "hamm",
  cluster_method = "greedy",
  count_threshold = 1e+07,
  dist_costs = list(replace = 1, insert = 1, delete = 1)
)

Arguments

barcodeObj: A BarcodeObj object.
dist_threshold: A single integer, or vector of integers with the length of sample number, specifying the editing distance threshold for defining two similar barcode sequences. If the input is a vector, each value in the vector relates to one sample according to its order in BarcodeObj object. The sequences with editing distance equal to or less than the threshold will be considered similar barcodes.
depth_fold_threshold: A single numeric or vector of numeric with the length of sample number, specifying the depth fold change threshold of removing the similar minority barcode. The majority of barcodes should have at least depth_fold_threshold times of reads of the similar minotiry one, to remove the minority similar barcode. (TODO: more precise description)
dist_method: A character string, specifying the editing distance used for evaluating barcode similarity. It can be "hamm" for Hamming distance or "leven" for Levenshtein distance.
cluster_method: A character string specifying the algorithm used to perform the clustering of barcodes. Currently only "greedy" is available, in this case, The most and the least abundant barcode will be used for comparing, the least abundant barcode is preferentially removed.
count_threshold: An integer, read depth threshold to consider a barcode as a true barcode. If a barcode with a count higher than this threshold it will not be removed, even if the barcode is similar to a more abundant one. Default is 1e9.
dist_costs: A list, the cost of the events of distance algorithm, applicable when Levenshtein distance is applied. The names of vector have to be insert, delete and replace, specifying the weight of insertion, deletion, and replacement events respectively. The default cost for each event is 1.

Value

A BarcodeObj object with cleanBc slot updated.

Examples

data(bc_obj)

d1 <- data.frame(
    seq = c(
        "ACTTCGATCGATCGAAAAGATCGATCGATC",
        "AATTCGATCGATCGAAGAGATCGATCGATC",
        "CCTTCGATCGATCGAAGAAGATCGATCGATC",
        "TTTTCGATCGATCGAAAAGATCGATCGATC",
        "AAATCGATCGATCGAAGAGATCGATCGATC",
        "CCCTCGATCGATCGAAGAAGATCGATCGATC",
        "GGGTCGATCGATCGAAAAGATCGATCGATC",
        "GGATCGATCGATCGAAGAGATCGATCGATC",
        "ACTTCGATCGATCGAACAAGATCGATCGATC",
        "GGTTCGATCGATCGACGAGATCGATCGATC",
        "GCGTCCATCGATCGAAGAAGATCGATCGATC"
        ),
    freq = c(
        30, 60, 9, 10, 14, 5, 10, 30, 6, 4 , 6
        )
    )

pattern <- "([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC"
bc_obj <- bc_extract(list(test = d1), pattern, sample_name=c("test"), 
    pattern_type=c(UMI=1, barcode=2))

# Remove barcodes with depth < 5
(bc_cured <- bc_cure_depth(bc_obj, depth=5))
#> ------------
#> bc_cure_depth: isUpdate is TRUE, update the cleanBc.
#> ------------
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains: 
#> ----------
#> @metadata: 3 field(s) available:
#> raw_read_count  barcode_read_count  depth_cutoff
#> ----------
#> @messyBc: 1 sample(s) for raw barcodes:
#>     In sample $test there are: 10 Tags
#> ----------
#> @cleanBc: 1 samples for cleaned barcodes
#>     In sample $test there are: 4 barcodes 

# Do the clustering, remove the less abundant barcodes
# one by hamming distance <= 1 
bc_cure_cluster(bc_cured, dist_threshold = 1)
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains: 
#> ----------
#> @metadata: 3 field(s) available:
#> raw_read_count  barcode_read_count  depth_cutoff
#> ----------
#> @messyBc: 1 sample(s) for raw barcodes:
#>     In sample $test there are: 10 Tags
#> ----------
#> @cleanBc: 1 samples for cleaned barcodes
#>     In sample $test there are: 2 barcodes 

# Levenshtein distance <= 1
bc_cure_cluster(bc_cured, dist_threshold = 2, dist_method = "leven",
    dist_costs = list("insert" = 2, "replace" = 1, "delete" = 2))
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains: 
#> ----------
#> @metadata: 3 field(s) available:
#> raw_read_count  barcode_read_count  depth_cutoff
#> ----------
#> @messyBc: 1 sample(s) for raw barcodes:
#>     In sample $test there are: 10 Tags
#> ----------
#> @cleanBc: 1 samples for cleaned barcodes
#>     In sample $test there are: 1 barcodes 

###