bc_extract
identifies the barcodes (and UMI) from the sequences using
regular expressions. pattern
and pattern_type
arguments are
necessary, which provides the barcode (and UMI) pattern and their location
within the sequences.
bc_extract(
x,
pattern = "",
sample_name = NULL,
metadata = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
# S4 method for class 'data.frame'
bc_extract(
x,
pattern = "",
sample_name = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
# S4 method for class 'ShortReadQ'
bc_extract(
x,
pattern = "",
sample_name = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
# S4 method for class 'DNAStringSet'
bc_extract(
x,
pattern = "",
sample_name = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
# S4 method for class 'integer'
bc_extract(
x,
pattern = "",
sample_name = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
# S4 method for class 'character'
bc_extract(
x,
pattern = "",
sample_name = NULL,
metadata = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
# S4 method for class 'list'
bc_extract(
x,
pattern = "",
sample_name = NULL,
metadata = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
A single or a list of fastq files, ShortReadQ, DNAStringSet, data.frame, or named integer.
A string or a string vector with the same number of files, specifying the regular expression with capture. It matches the barcode (and UMI) with capture pattern.
A string vector, applicable when x
is a list or
fastq file vector. This argument specifies the sample names. If not provided,
the function will look for sample names in the rownames of metadata,
the fastqfile name or the list
names.
A data.frame
with sample names as the row names, with
each metadata record by column, specifying the sample characteristics.
An integer. The minimum mismatch threshold for barcode
matching, when maxLDist is 0, the str_match
is
invoked for barcode matching which is faster, otherwise
aregexec
is invoked and the costs
parameters can
be used to specify the weight of the distance calculation.
A vector. It defines the barcode (and UMI) capture group. See Details.
A named list, applicable when maxLDist > 0, specifying the
weight of each mismatch event while extracting the barcodes. The list
element name have to be sub
(substitution), ins
(insertion) and
del
(deletion). The default value is list(sub = 1, ins = 99, del
= 99)
. See aregexec
for more detailed information.
A logical value. If the value is true, the return barcodes (UMI-barcode tags) are sorted by the read counts.
This function returns a BarcodeObj object if the input is a list
or a
vector
of Fastq files, otherwise it returns a data.frame.
In
the later case
the data.frame
has columns:
umi_seq
(optional): UMI sequence, applicable when there is UMI
in `pattern` and `pattern_type` argument.
barcode_seq
: barcode sequence.
count
: reads number.
The pattern
argument is a regular expression, the capture operation
()
identifying the barcode or UMI. pattern_type
argument
annotates capture, denoting the UMI or the barcode captured pattern. In the
example:
([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC
|---------| starts with 3 base pairs UMI.
|----------| constant sequence in the backbone.
|-------| flexible barcode sequences.
|---------| 3' constant sequence.
In UMI part [ACGT]{3}
, [ACGT]
means it can be one of
the "A", "C", "G" and "T", and {3}
means it repeats 3 times.
In the barcode pattern [ACGT]+
, the +
denotes
that there is at least one of the A
or C
or G
or
T.
fq_file <- system.file("extdata", "simple.fq", package="CellBarcode")
library(ShortRead)
# barcode from fastq file
bc_extract(fq_file, pattern = "AAAAA(.*)CCCCC")
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 2 field(s) available:
#> raw_read_count barcode_read_count
#> ----------
#> @messyBc: 1 sample(s) for raw barcodes:
#> In sample $simple.fq there are: 1 Tags
# barcode from ShortReadQ object
sr <- readFastq(fq_file) # ShortReadQ
bc_extract(sr, pattern = "AAAAA(.*)CCCCC")
#> barcode_seq count
#> <char> <int>
#> 1: GG 1
# barcode from DNAStringSet object
ds <- sread(sr) # DNAStringSet
bc_extract(ds, pattern = "AAAAA(.*)CCCCC")
#> barcode_seq count
#> <char> <int>
#> 1: GG 1
# barcode from integer vector
iv <- tables(ds, n = Inf)$top # integer vector
bc_extract(iv, pattern = "AAAAA(.*)CCCCC")
#> barcode_seq count
#> <char> <int>
#> 1: GG 1
# barcode from data.frame
df <- data.frame(seq = names(iv), freq = as.integer(iv)) # data.frame
bc_extract(df, pattern = "AAAAA(.*)CCCCC")
#> barcode_seq count
#> <char> <int>
#> 1: GG 1
# barcode from list of DNAStringSet
l <- list(sample1 = ds, sample2 = ds) # list
bc_extract(l, pattern = "AAAAA(.*)CCCCC")
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 2 field(s) available:
#> raw_read_count barcode_read_count
#> ----------
#> @messyBc: 2 sample(s) for raw barcodes:
#> In sample $sample1 there are: 1 Tags
#> In sample $sample2 there are: 1 Tags
# Extract UMI and barcode
d1 <- data.frame(
seq = c(
"ACTTCGATCGATCGAAAAGATCGATCGATC",
"AATTCGATCGATCGAAGAGATCGATCGATC",
"CCTTCGATCGATCGAAGAAGATCGATCGATC",
"TTTTCGATCGATCGAAAAGATCGATCGATC",
"AAATCGATCGATCGAAGAGATCGATCGATC",
"CCCTCGATCGATCGAAGAAGATCGATCGATC",
"GGGTCGATCGATCGAAAAGATCGATCGATC",
"GGATCGATCGATCGAAGAGATCGATCGATC",
"ACTTCGATCGATCGAACAAGATCGATCGATC",
"GGTTCGATCGATCGACGAGATCGATCGATC",
"GCGTCCATCGATCGAAGAAGATCGATCGATC"
),
freq = c(
30, 60, 9, 10, 14, 5, 10, 30, 6, 4 , 6
)
)
# barcode backbone with UMI and barcode
pattern <- "([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC"
bc_extract(
list(test = d1),
pattern,
sample_name=c("test"),
pattern_type=c(UMI=1, barcode=2))
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains:
#> ----------
#> @metadata: 2 field(s) available:
#> raw_read_count barcode_read_count
#> ----------
#> @messyBc: 1 sample(s) for raw barcodes:
#> In sample $test there are: 10 Tags
###