Biostrings包测试1_20200129Wednesday

1.设置当前工作目录

setwd(“Biostrings/”)

2.导入R包

library(Biostrings)

3.R包简要信息

3.1 Description

Package: Biostrings

Title: Efficient manipulation of biological strings

Description: Memory efficient string containers, string matching

algorithms, and other utilities, for fast manipulation of large

biological sequences or sets of sequences.

Version: 2.54.0

Encoding: UTF-8

Author: H. Pagès, P. Aboyoun, R. Gentleman, and S. DebRoy

Maintainer: H. Pagès hpages@fredhutch.org

biocViews: SequenceMatching, Alignment, Sequencing, Genetics,

DataImport, DataRepresentation, Infrastructure

Depends: R (>= 3.5.0), methods, BiocGenerics (>= 0.31.5), S4Vectors (>=

0.21.13), IRanges, XVector (>= 0.23.2)

Imports: graphics, methods, stats, utils

LinkingTo: S4Vectors, IRanges, XVector

Enhances: Rmpi

Suggests: BSgenome (>= 1.13.14), BSgenome.Celegans.UCSC.ce2 (>=

1.3.11), BSgenome.Dmelanogaster.UCSC.dm3 (>= 1.3.11),

BSgenome.Hsapiens.UCSC.hg18, drosophila2probe, hgu95av2probe,

hgu133aprobe, GenomicFeatures (>= 1.3.14), hgu95av2cdf, affy

(>= 1.41.3), affydata (>= 1.11.5), RUnit

License: Artistic-2.0

LazyLoad: yes

Collate: 00datacache.R utils.R IUPAC_CODE_MAP.R AMINO_ACID_CODE.R

GENETIC_CODE.R XStringCodec-class.R seqtype.R XString-class.R

XStringSet-class.R XStringSet-comparison.R XStringViews-class.R

MaskedXString-class.R XStringSetList-class.R xscat.R

XStringSet-io.R letter.R getSeq.R letterFrequency.R

dinucleotideFrequencyTest.R chartr.R reverseComplement.R

translate.R toComplex.R replaceAt.R replaceLetterAt.R

injectHardMask.R padAndClip.R strsplit-methods.R misc.R

SparseList-class.R MIndex-class.R lowlevel-matching.R

match-utils.R matchPattern.R maskMotif.R matchLRPatterns.R

trimLRPatterns.R matchProbePair.R matchPWM.R findPalindromes.R

PDict-class.R matchPDict.R XStringPartialMatches-class.R

XStringQuality-class.R QualityScaledXStringSet.R InDel-class.R

AlignedXStringSet-class.R PairwiseAlignments-class.R

PairwiseAlignmentsSingleSubject-class.R PairwiseAlignments-io.R

align-utils.R pmatchPattern.R pairwiseAlignment.R stringDist.R

needwunsQS.R MultipleAlignment.R matchprobes.R zzz.R

git_url: https://git.bioconductor.org/packages/Biostrings

git_branch: RELEASE_3_10

git_last_commit: b8982e7

git_last_commit_date: 2019-10-29

Date/Publication: 2019-10-29

NeedsCompilation: yes

Packaged: 2019-10-30 01:22:38 UTC; biocbuild

Built: R 3.6.1; i386-w64-mingw32; 2019-10-30 12:46:43 UTC; windows

Archs: i386, x64

3.2 Main function

ls(package:Biostrings)

[1] “%in%”

[2] “AA_ALPHABET”

[3] “AA_PROTEINOGENIC”

[4] “AA_STANDARD”

[5] “AAMultipleAlignment”

[6] “AAString”

[7] “AAStringSet”

[8] “AAStringSetList”

[9] “aligned”

[10] “alignedPattern”

[11] “alignedSubject”

[12] “alphabet”

[13] “alphabetFrequency”

[14] “AMINO_ACID_CODE”

[15] “as.data.frame”

[16] “as.list”

[17] “as.matrix”

[18] “BString”

[19] “BStringSet”

[20] “BStringSetList”

[21] “chartr”

[22] “codons”

[23] “coerce”

[24] “collapse”

[25] “colmask”

[26] “colmask<-”

[27] “compareStrings”

[28] “complement”

[29] “computeAllFlinks”

[30] “consensusMatrix”

[31] “consensusString”

[32] “consensusViews”

[33] “countPattern”

[34] “countPDict”

[35] “countPWM”

[36] “coverage”

[37] “deletion”

[38] “detail”

[39] “dinucleotideFrequency”

[40] “dinucleotideFrequencyTest”

[41] “DNA_ALPHABET”

[42] “DNA_BASES”

[43] “DNAMultipleAlignment”

[44] “DNAString”

[45] “DNAStringSet”

[46] “DNAStringSetList”

[47] “duplicated”

[48] “encoding”

[49] “end”

[50] “endIndex”

[51] “errorSubstitutionMatrices”

[52] “extract_character_from_XString_by_positions”

[53] “extract_character_from_XString_by_ranges”

[54] “extractAllMatches”

[55] “extractAt”

[56] “fasta.index”

[57] “fasta.seqlengths”

[58] “fastq.geometry”

[59] “fastq.seqlengths”

[60] “findPalindromes”

[61] “gaps”

[62] “GENETIC_CODE”

[63] “GENETIC_CODE_TABLE”

[64] “get_seqtype_conversion_lookup”

[65] “getGeneticCode”

[66] “getSeq”

[67] “gregexpr2”

[68] “hasAllFlinks”

[69] “hasLetterAt”

[70] “hasOnlyBaseLetters”

[71] “head”

[72] “IlluminaQuality”

[73] “indel”

[74] “initialize”

[75] “injectHardMask”

[76] “insertion”

[77] “intersect”

[78] “is.unsorted”

[79] “isMatchingAt”

[80] “isMatchingEndingAt”

[81] “isMatchingStartingAt”

[82] “IUPAC_CODE_MAP”

[83] “lcprefix”

[84] “lcsubstr”

[85] “lcsuffix”

[86] “letter”

[87] “letterFrequency”

[88] “letterFrequencyInSlidingView”

[89] “longestConsecutive”

[90] “make_XString_from_string”

[91] “make_XStringSet_from_strings”

[92] “mask”

[93] “maskeddim”

[94] “maskedncol”

[95] “maskednrow”

[96] “maskedratio”

[97] “maskedwidth”

[98] “maskGaps”

[99] “maskMotif”

[100] “masks”

[101] “masks<-”

[102] “match”

[103] “matchLRPatterns”

[104] “matchPattern”

[105] “matchPDict”

[106] “matchProbePair”

[107] “matchprobes”

[108] “matchPWM”

[109] “maxScore”

[110] “maxWeights”

[111] “mergeIUPACLetters”

[112] “minScore”

[113] “minWeights”

[114] “mismatch”

[115] “mismatchSummary”

[116] “mismatchTable”

[117] “mkAllStrings”

[118] “N50”

[119] “nchar”

[120] “nedit”

[121] “neditAt”

[122] “neditEndingAt”

[123] “neditStartingAt”

[124] “needwunsQS”

[125] “nindel”

[126] “nmatch”

[127] “nmismatch”

[128] “nnodes”

[129] “nucleotideFrequencyAt”

[130] “nucleotideSubstitutionMatrix”

[131] “oligonucleotideFrequency”

[132] “oligonucleotideTransitions”

[133] “order”

[134] “padAndClip”

[135] “pairwiseAlignment”

[136] “PairwiseAlignments”

[137] “PairwiseAlignmentsSingleSubject”

[138] “palindromeArmLength”

[139] “palindromeLeftArm”

[140] “palindromeRightArm”

[141] “parallelSlotNames”

[142] “parallelVectorNames”

[143] “pattern”

[144] “patternFrequency”

[145] “pcompare”

[146] “PDict”

[147] “PhredQuality”

[148] “pid”

[149] “pmatchPattern”

[150] “PWM”

[151] “PWMscoreStartingAt”

[152] “quality”

[153] “QualityScaledAAStringSet”

[154] “QualityScaledBStringSet”

[155] “QualityScaledDNAStringSet”

[156] “QualityScaledRNAStringSet”

[157] “qualitySubstitutionMatrices”

[158] “rank”

[159] “readAAMultipleAlignment”

[160] “readAAStringSet”

[161] “readBStringSet”

[162] “readDNAMultipleAlignment”

[163] “readDNAStringSet”

[164] “readQualityScaledDNAStringSet”

[165] “readRNAMultipleAlignment”

[166] “readRNAStringSet”

[167] “relistToClass”

[168] “replaceAmbiguities”

[169] “replaceAt”

[170] “replaceLetterAt”

[171] “reverse”

[172] “reverseComplement”

[173] “RNA_ALPHABET”

[174] “RNA_BASES”

[175] “RNA_GENETIC_CODE”

[176] “RNAMultipleAlignment”

[177] “RNAString”

[178] “RNAStringSet”

[179] “RNAStringSetList”

[180] “rowmask”

[181] “rowmask<-”

[182] “saveXStringSet”

[183] “score”

[184] “seqtype”

[185] “seqtype<-”

[186] “setdiff”

[187] “setequal”

[188] “show”

[189] “showAsCell”

[190] “SolexaQuality”

[191] “sort”

[192] “stackStrings”

[193] “start”

[194] “startIndex”

[195] “stringDist”

[196] “strsplit”

[197] “subject”

[198] “subpatterns”

[199] “subseq”

[200] “subseq<-”

[201] “substr”

[202] “substring”

[203] “summary”

[204] “tail”

[205] “tb”

[206] “tb.width”

[207] “threebands”

[208] “toComplex”

[209] “toString”

[210] “translate”

[211] “trimLRPatterns”

[212] “trinucleotideFrequency”

[213] “twoWayAlphabetFrequency”

[214] “type”

[215] “unaligned”

[216] “union”

[217] “uniqueLetters”

[218] “unitScale”

[219] “unmasked”

[220] “unstrsplit”

[221] “updateObject”

[222] “vcountPattern”

[223] “vcountPDict”

[224] “Views”

[225] “vmatchPattern”

[226] “vmatchPDict”

[227] “vwhichPDict”

[228] “which.isMatchingAt”

[229] “which.isMatchingEndingAt”

[230] “which.isMatchingStartingAt”

[231] “whichPDict”

[232] “width”

[233] “width0”

[234] “windows”

[235] “write.phylip”

[236] “writePairwiseAlignments”

[237] “writeQualityScaledXStringSet”

[238] “writeXStringSet”

[239] “xscat”

[240] “xscodes”

3.3 Introduction

(1) use R external pointers to

store the string data,

(2) use bit patterns to encode the string data,

(3) provide the user with a convenient class of objects where each instance can store a set of views on the same big string

(these views being typically the matches returned by a search algorithm)

4.测试

4.1 The XString class and its subsetting operator [

b <- BString(“I am a BString object”)

#@ b的内容
b

21-letter “BString” instance

seq: I am a BString object

#@ b的长度
length(b)

[1] 21

#@ A DNAString object:
d <- DNAString(“TTGAAAA-CTC-N”)
d

13-letter “DNAString” instance

seq: TTGAAAA-CTC-N

#@ d的长度
length(d)

[1] 13

#@ The differences with a BString object are: (1) only letters from the IUPAC extended genetic alphabet + the gap letter (-) are allowed and (2) each letter in the argument passed to the DNAString function is encoded in a special way before it’s stored in the DNAString object

Access to the individual letters:

#@ 查看d的第三个元素
d[3]

1-letter “DNAString” instance

seq: G

#@ 查看d的第7个到第12个元素
d[7:12]

6-letter “DNAString” instance

seq: A-CTC-

#@ 查看d的第1个到第3个元素
d[1:3]

3-letter “DNAString” instance

seq: TTG

#@ 查看d的所有元素
d[]

13-letter “DNAString” instance

seq: TTGAAAA-CTC-N

#@ 对比b的正向和反向排序内部元素
b[length(b):1]

21-letter “BString” instance

seq: tcejbo gnirtSB a ma I

21-letter “BString” instance

seq: I am a BString object

#@ Only in bounds positive numeric subscripts are supported. In fact the subsetting operator for XString objects is not efficient and one should always use the subseq method to extract a substring from a big string:

bb <- subseq(b, 3, 6)

4-letter “BString” instance

seq: am a

dd1 <- subseq(d, end=7)
dd1

7-letter “DNAString” instance

seq: TTGAAAA

dd2 <- subseq(d, start=8)

6-letter “DNAString” instance

seq: -CTC-N

#@ To dump an XString object as a character vector (of length 1), use the toString method:
toString(dd2)

[1] “-CTC-N”

Note that length(dd2) is equivalent to nchar(toString(dd2)) but the latter would be very inefficient on a big DNAString object.

[TODO: Make a generic of the substr() function to work with XString objects. It will be essentially doing toString(subseq()).]

4.2 The == binary operator for XString objects

#@ The 2 following comparisons are TRUE:

bb == “am a”

[1] TRUE

4-letter “BString” instance

seq: am a

dd2 != DNAString(“TG”)

[1] TRUE

6-letter “DNAString” instance

seq: -CTC-N

#@ When the 2 sides of == don’t belong to the same class then the side belonging to the\lowest" class is first converted to an object belonging to the class of the other side (the \highest" class).
#@ The class (pseudo-)order is character < BString < DNAString. When both sides are XString objects of the same subtype (e.g. both are DNAString objects) then the comparison is very fast because it only has to call the C standard function memcmp() and no memory allocation or string encoding/decoding is required.
#@ The 2 following expressions provoke an error because the right member can’t be \upgraded" (converted) to an object of the same class than the left member:

bb == “”

Error in bb == “” :

comparison between a “BString” object and a character vector of length != 1 or an empty string or an NA is not supported

d == bb

Error in d == bb :

comparison between a “DNAString” instance and a “BString” instance is not supported

#@ When comparing an RNAString object with a DNAString object, U and T are considered equals:

r <- RNAString(d)

13-letter “RNAString” instance

seq: UUGAAAA-CUC-N

r == d

[1] TRUE

4.3 The XStringViews class and its subsetting operators [ and [[

#@ An XStringViews object contains a set of views on the same XString object called the subject string. Here is an XStringViews object with 4 views:

v4 <- Views(dd2, start=3:0, end=5:8)
class(v4)

[1] “XStringViews”

attr(,“package”)

[1] “Biostrings”

Views on a 6-letter DNAString subject

subject: -CTC-N

views:

start end width

[1] 3 5 3 [TC-]

[2] 2 6 5 [CTC-N]

[3] 1 7 7 [-CTC-N ]

[4] 0 8 9 [ -CTC-N ]

length(v4)

[1] 4

test_v <- Views(dd2, start = 4:1, end = 5:8)
class(test_v)

[1] “XStringViews”

attr(,“package”)

[1] “Biostrings”

test_v

Views on a 6-letter DNAString subject

subject: -CTC-N

views:

start end width

[1] 4 5 2 [C-]

[2] 3 6 4 [TC-N]

[3] 2 7 6 [CTC-N ]

[4] 1 8 8 [-CTC-N ]

#@ Note that the 2 last views are out of limits.
#@ You can select a subset of views from an XStringViews object:
v4[4:2]

Views on a 6-letter DNAString subject

subject: -CTC-N

views:

start end width

[1] 0 8 9 [ -CTC-N ]

[2] 1 7 7 [-CTC-N ]

[3] 2 6 5 [CTC-N]

#@ The returned object is still an XStringViews object, even if we select only one element.
#@ You need to use double-brackets to extract a given view as an XString object:
v4[[2]]

5-letter “DNAString” instance

seq: CTC-N

#@ You can’t extract a view that is out of limits:
v4[[3]]

Error in getListElement(x, i, …) : view is out of limits

#@ Note that, when start and end are numeric vectors and i is a single integer, Views(b, start, end)[[i]] is equivalent to subseq(b, start[i], end[i]).
#@ Subsetting also works with negative or logical values with the expected semantic (the same as for R built-in vectors):
v4[-3]

Views on a 6-letter DNAString subject

subject: -CTC-N

views:

start end width

[1] 3 5 3 [TC-]

[2] 2 6 5 [CTC-N]

[3] 0 8 9 [ -CTC-N ]

v4[c(TRUE, FALSE)]

Views on a 6-letter DNAString subject

subject: -CTC-N

views:

start end width

[1] 3 5 3 [TC-]

[2] 1 7 7 [-CTC-N ]

#@ Note that the logical vector is recycled to the length of v4

4.4 A few more XStringViews objects

12 views (all of the same width):

v12 <- Views(DNAString(“TAATAATG”), start=-2:9, end=0:11)
v12

Views on a 8-letter DNAString subject

subject: TAATAATG

views:

start end width

[1] -2 0 3 [ ]

[2] -1 1 3 [ T]

[3] 0 2 3 [ TA]

[4] 1 3 3 [TAA]

[5] 2 4 3 [AAT]

… … … … …

[8] 5 7 3 [AAT]

[9] 6 8 3 [ATG]

[10] 7 9 3 [TG ]

[11] 8 10 3 [G ]

[12] 9 11 3 [ ]

This is the same as doing Views(d, start=1, end=length(d)):

as(d, “Views”)

Views on a 13-letter DNAString subject

subject: TTGAAAA-CTC-N

views:

start end width

[1] 1 13 13 [TTGAAAA-CTC-N]

#@ Hence the following will always return the d object itself:
as(d, “Views”)[[1]]

13-letter “DNAString” instance

seq: TTGAAAA-CTC-N

#@ 3 XStringViews objects with no view:
v12[0]

Views on a 8-letter DNAString subject

subject: TAATAATG

views: NONE

v12[FALSE]

Views on a 8-letter DNAString subject

subject: TAATAATG

views: NONE

Views(d)

Views on a 13-letter DNAString subject

subject: TTGAAAA-CTC-N

views: NONE

4.5 The == binary operator for XStringViews objects

#@ This operator is the vectorized version of the == operator defined previously for XString objects:
v12 == DNAString(“TAA”)

[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE

[11] FALSE FALSE

v12

Views on a 8-letter DNAString subject

subject: TAATAATG

views:

start end width

[1] -2 0 3 [ ]

[2] -1 1 3 [ T]

[3] 0 2 3 [ TA]

[4] 1 3 3 [TAA]

[5] 2 4 3 [AAT]

… … … … …

[8] 5 7 3 [AAT]

[9] 6 8 3 [ATG]

[10] 7 9 3 [TG ]

[11] 8 10 3 [G ]

[12] 9 11 3 [ ]

v12 == DNAString(“ATG”)

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE

[11] FALSE FALSE

v12 == DNAString(“ATGA”)

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[11] FALSE FALSE

#@ To display all the views in v12 that are equals to a given view, you can type R cuties like:
v12[v12 == v12[4]]

Views on a 8-letter DNAString subject

subject: TAATAATG

views:

start end width

[1] 1 3 3 [TAA]

[2] 4 6 3 [TAA]

v12[v12 == v12[1]]

Views on a 8-letter DNAString subject

subject: TAATAATG

views:

start end width

[1] -2 0 3 [ ]

[2] 9 11 3 [ ]

#@ This is TRUE:
v12[3] == Views(RNAString(“AU”), start=0, end=2)

[1] FALSE

4.6 The start, end and width methods

start(v4)

[1] 3 2 1 0

end(v4)

[1] 5 6 7 8

width(v4)

[1] 3 5 7 9

#@ Note that start(v4)[i] is equivalent to start(v4[i]), except that the former will not issue an error if i is out of bounds (same for end and width methods).
#@ Also, when i is a single integer, width(v4)[i] is equivalent to length(v4[[i]]) except that the former will not issue an error if i is out of bounds or if view v4[i] is out of limits.

5.结束

sessionInfo()

R version 3.6.2 (2019-12-12)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:

[1] LC_COLLATE=Chinese (Simplified)_China.936

[2] LC_CTYPE=Chinese (Simplified)_China.936

[3] LC_MONETARY=Chinese (Simplified)_China.936

[4] LC_NUMERIC=C

[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages:

[1] stats4 parallel stats graphics grDevices utils

[7] datasets methods base

other attached packages:

[1] Biostrings_2.54.0 XVector_0.26.0 IRanges_2.20.2

[4] S4Vectors_0.24.2 BiocGenerics_0.32.0

loaded via a namespace (and not attached):

[1] Seurat_3.1.2 TH.data_1.0-10

[3] Rtsne_0.15 colorspace_1.4-1

[5] seqinr_3.6-1 pryr_0.1.4

[7] ggridges_0.5.2 rstudioapi_0.10

[9] leiden_0.3.2 listenv_0.8.0

[11] npsurv_0.4-0 ggrepel_0.8.1

[13] alakazam_0.3.0 mvtnorm_1.0-12

[15] codetools_0.2-16 splines_3.6.2

[17] R.methodsS3_1.7.1 mnormt_1.5-5

[19] lsei_1.2-0 TFisher_0.2.0

[21] zeallot_0.1.0 ade4_1.7-13

[23] jsonlite_1.6 packrat_0.5.0

[25] ica_1.0-2 cluster_2.1.0

[27] png_0.1-7 R.oo_1.23.0

[29] uwot_0.1.5 sctransform_0.2.1

[31] readr_1.3.1 compiler_3.6.2

[33] httr_1.4.1 backports_1.1.5

[35] assertthat_0.2.1 Matrix_1.2-18

[37] lazyeval_0.2.2 htmltools_0.4.0

[39] prettyunits_1.1.0 tools_3.6.2

[41] rsvd_1.0.2 igraph_1.2.4.2

[43] gtable_0.3.0 glue_1.3.1

[45] RANN_2.6.1 reshape2_1.4.3

[47] dplyr_0.8.3 Rcpp_1.0.3

[49] Biobase_2.46.0 vctrs_0.2.1

[51] multtest_2.42.0 gdata_2.18.0

[53] ape_5.3 nlme_3.1-142

[55] gbRd_0.4-11 lmtest_0.9-37

[57] stringr_1.4.0 globals_0.12.5

[59] lifecycle_0.1.0 irlba_2.3.3

[61] gtools_3.8.1 future_1.16.0

[63] zlibbioc_1.32.0 MASS_7.3-51.4

[65] zoo_1.8-7 scales_1.1.0

[67] hms_0.5.3 sandwich_2.5-1

[69] RColorBrewer_1.1-2 reticulate_1.14

[71] pbapply_1.4-2 gridExtra_2.3

[73] ggplot2_3.2.1 stringi_1.4.3

[75] mutoss_0.1-12 plotrix_3.7-7

[77] caTools_1.17.1.4 bibtex_0.4.2.2

[79] Rdpack_0.11-1 SDMTools_1.1-221.2

[81] rlang_0.4.2 pkgconfig_2.0.3

[83] bitops_1.0-6 lattice_0.20-38

[85] ROCR_1.0-7 purrr_0.3.3

[87] htmlwidgets_1.5.1 cowplot_1.0.0

[89] tidyselect_0.2.5 RcppAnnoy_0.0.14

[91] plyr_1.8.5 magrittr_1.5

[93] R6_2.4.1 gplots_3.0.1.2

[95] multcomp_1.4-12 pillar_1.4.3

[97] sn_1.5-4 fitdistrplus_1.0-14

[99] survival_3.1-8 tibble_2.1.3

[101] future.apply_1.4.0 tsne_0.1-3

[103] crayon_1.3.4 KernSmooth_2.23-16

[105] plotly_4.9.1 progress_1.2.2

[107] grid_3.6.2 data.table_1.12.8

[109] metap_1.2 digest_0.6.23

[111] tidyr_1.0.0 numDeriv_2016.8-1.1

[113] R.utils_2.9.2 RcppParallel_4.4.4

[115] munsell_0.5.0 viridisLite_0.3.0

转载自原文链接, 如需删除请联系管理员。

原文链接：Biostrings包测试1_2020-01-29，转载请注明来源！