Skip to contents

Introduction

This vignette has the objective to introduce MS2extract. The main goal of this package is providing a tool to create in-house MS/MS compound libraries. Users can access a specific function help through the command help([‘function name’]).

A simplified workflow is presented in Figure 1. Briefly, mzML/mzXML files are imported in memory. Then, based on metadata provided by the user such as metabolite chemical formula, and ionization mode, MS2extract computes the theoretical precursor m/z. With this information, product ion scans that matches the theoretical ionized mass are extracted, with a given ppm tolerance. Next, low intensity signals, or background noise can be removed from the spectra. Finally, users can export the extracted MS/MS spectra to a MS/MS library format (.msp/.mgf) to be used as reference library for further compound identification and annotation, or deposit the created library in different repositories such as GNPS, or MassBank.

Figure 1. Overview of the processing pipeline of MS2extract

Figure 1. Overview of the processing pipeline of MS2extract

Basic workflow

The workflow has four main steps:

  • data import,
  • extract MS/MS scans,
  • detect masses, and
  • export the MS/MS library

In this section, we will explain in a more detailed manner the main steps to create in-house MS/MS libraries, as well as provide information about the required and optional arguments that users may need to provide in order to effectively use this package.

Additionally, this package also includes a set of batch_*() functions that allows to process multiple .mzXML files at once. However, more metadata is required to run this automated pipeline and the use of this batch_*() functions will is described in the Using MS2extract Batch Pipeline.

Data import

This section is focused on describing how MS2extract package imports MS/MS data. We also include a more detailed document about this process in the Behind the curtains of importing MS/MS data vignette.

The main import function relies on R package metID. We adapted the import function in order to read mass spectrometry data from mzML/mzXML files. The new adaptation consists in importing scans data in a list (S3 object) rather than into a S4 object, facilitating the downstream tidy analysis of this object.

This function execute a back-end calculation of theoretical ionized m/z of the compound in order to extract the precursor ions that match that mass with a given ppm.

The arguments of the import_mzxml() function are:

  • file: mzML/mzXML file name
  • met_metadata: metadata of the analyte
  • ppm: error mass expressed in ppm
# Loading the package
library(MS2extract)


# Print function arg
formals(import_mzxml)
#> $file
#> NULL
#> 
#> $met_metadata
#> NULL
#> 
#> $ppm
#> [1] 10

file

This argument is to pass the file name of your MS/MS data. We provide an example file of procyanidin A2, collected in negative ionization mode at a 20 eV CE.

# Importing  Procyanidin A2 MS/MS spectra in negative ionization mode
# and 20 eV as the collision energy
ProcA2_file <- system.file("extdata",
  "ProcyanidinA2_neg_20eV.mzXML",
  package = "MS2extract"
)
# File name
ProcA2_file
#> [1] "/home/runner/work/_temp/Library/MS2extract/extdata/ProcyanidinA2_neg_20eV.mzXML"

met_metadata

This argument refers to the compound metadata that users need to provide in order to properly import scans that are related to the compound of interest.

The met_metadata is a data frame that has required and optional columns. The required columns are employed to calculate the theoretical ionized m/z value, for a given formula and ionization mode. In the optional columns, we have the option to provide a chromatographic region of Interest (ROI), specifying at what time the the compound elutes, in order to only keep this retention time window.

The required columns are:

  • Formula: A character string specifying the metabolite formula
  • Ionization_mode: The ionization mode employed in data collection.
  • COLLISIONENERGY: CE applied in MS/MS fragmentation

The optional columns are:

  • min_rt: a double, minimum retention time to keep (in seconds)
  • max_rt: a double, maximum retention time to keep (in seconds)
# Procyanidin A2 metadata
ProcA2_data <- data.frame(
  Formula = "C30H24O12", Ionization_mode = "Negative",
  min_rt = 163, max_rt = 180
)
ProcA2_data
#>     Formula Ionization_mode min_rt max_rt
#> 1 C30H24O12        Negative    163    180

ppm

ppm refers to the maximum m/z deviation from the theoretical mass. A ppm of 10 units will mean that the total allows m/z window in 20 ppm. By default, 10 ppm is used.

import_mzxml()

With all arguments explained, we can use the import_mzxml() function.

# Import Procyanidin A2 data
ProcA2_raw <- import_mzxml(ProcA2_file, met_metadata = ProcA2_data, ppm = 5)
#> • Processing: ProcyanidinA2_neg_20eV.mzXML
#> • Found 1 CE value: 20
#> • Remember to match CE velues in spec_metadata when exporting your library
#> • m/z range given 5 ppm: 575.11663 and 575.12238

# 24249 rows = ions detected in all scans
dim(ProcA2_raw)
#> [1] 17829     6

Extracting MS/MS spectra

Now that we have the data imported, we can proceed to extract the most intense MS/MS scan.

This function computes the MS/MS total ion chromatogram (TIC) by summing up all intensities of the MS/MS scans, and selects the scan with the highest total intensity. It is worth noting that we only imported MS/MS scans that the precursor ion matches the theoretical m/z value of the compound provided in the previous step. Therefore, it is more accurrate to interpret this chromatogram as an EIC of the precursor ion, where only the MS/MS scans are included.

This function takes three arguments:

  • spec: the imported MS/MS spectra
  • verbose: a boolean, if verbose = TRUE, the MS/MS TIC and spectra is printed, if verbose = FALSE, plots are not displayed
  • out_list: a boolean, if out_list = TRUE, the extracted MS/MS spectra table and plots are returned as list, otherwise only the MS/MS spectra is returned as data frame.
ProcA2_extracted <- extract_MS2(ProcA2_raw, verbose = TRUE, out_list = FALSE)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> Warning: `position_stack()` requires non-overlapping x
#> intervals.

We generated two plots, the MS/MS EIC of the precursor ion (top plot), and the MS/MS spectra of the most intense MS/MS scan (bottom plot). In the MS/MS spectra, the blue diamond is placed on top of the precursor m/z ion. If the diamond is a blue fill diamond, it means the precursor ion was found in the MS/MS fragmentation data, while a diamond that is not blue filled, it will represent that the precursor ion was not found in the fragmentation data.

Furthermore, we can note that the x axis in the MS/MS spectra ranges from 0 to 1700 m/z. This is more related to the acquisition parameters used in data collection m/z range: 50-1700, which creates low intensity signals that are captured and included in the resulting MS/MS spectra.

range(ProcA2_extracted$mz)
#> [1] 100.0852 578.1123

The range of the MS/MS m/z values are from 100 to 1699 m/z, but intensities are too low to be seen in the plot.

Detecting masses

Similarly to the MZmine pipeline, detecting masses refers to setting a signal intensity threshold that MS/MS ions has to meet to be kept in the data, while signals that do not meet the defined threshold are removed. This function can also normalize the spectra ion intensity to percentage based on the base peak. This is a filtering step that is based on percentage of the base peak (most intense ion).

The three required arguments are:

  • spec: a data frame containing the MS/MS spectra.
  • normalize: a boolean indicating if the MS/MS spectra is normalized by the base peak before proceeding to filter out low intensity signals (normalize = TRUE), if normalize = FALSE the user has to provide the minimum ion count.
  • min_int: an integer referring to the minimum ion intensity. The value of min_int has to be in line if user decides to normalize the spectra or not. If the spectra is normalized, the min_intensity value is in percentage, otherwise the min_intensity value is expressed in ion count units.

By default, the normalization is set to TRUE and the minimum intensity is set to 1% to remove background noise.

ProcA2_detected <- detect_mass(ProcA2_extracted, normalize = TRUE, min_int = 1)

We can see now the range of m/z values and the maximum value is 576.1221 m/z.

range(ProcA2_detected$mz)
#> [1] 125.0243 576.1221

MS/MS spectra plot

We can proceed to plot the filtered MS/MS spectra with plot_MS2spectra() function. This is a ggplot2 based function; the blue diamond refers to the precursor ion.

If we take a look to the previous MS/MS plot, there is less background noise in this MS/MS spectra because the low intensity ions have been removed.

plot_MS2spectra(ProcA2_detected)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.

Exporting MS/MS spectra

NIST .msp format

Finally after extracting the MS/MS spectra and removing background noise, we can proceed to export the MS/MS in a NIST .msp format.

For this task, we need extra information about the compound, such as SMILES, COLLISIONENERGY, etc. You can find the minimum required information by accessing the write_msp() function help by running the command ?write_msp.

An example of this table can be found at:

# Reading the metadata
metadata_file <- system.file("extdata",
  "msp_metadata.csv",
  package = "MS2extract"
)

metadata <- read.csv(metadata_file)
dplyr::glimpse(metadata)
#> Rows: 1
#> Columns: 8
#> $ NAME            <chr> "Procyanidin A2"
#> $ PRECURSORTYPE   <chr> "[M-H]-"
#> $ FORMULA         <chr> "C30H24O12"
#> $ INCHIKEY        <chr> "NSEWTSAADLNHNH-LSBOWGMISA-N"
#> $ SMILES          <chr> "C1C(C(OC2=C1C(=CC3=C2C4C(C(O3)(OC5=CC(=CC(=C45)O)O)C6…
#> $ IONMODE         <chr> "Negative"
#> $ INSTRUMENTTYPE  <chr> "LC-ESI-QTOF"
#> $ COLLISIONENERGY <chr> "20 eV"

The three arguments for this function are:

  • spec: a data frame containing the extracted MS/MS spectra
  • spec_metadata: a data frame containing the values to be including in the resulting .msp file
  • msp_name: a string with the name of the msp file not containing (.msp) extension
write_msp(
  spec = ProcA2_detected,
  spec_metadata = metadata,
  msp_name = "Procyanidin_A2"
)

After writing the msp file, you will see the following file content:

#> • Filtering MS/MS scans for 20 CE
#> NAME: Procyanidin A2
#> PRECURSORMZ: 575.11957
#> PRECURSORTYPE: [M-H]-
#> FORMULA: C30H24O12
#> RETENTIONTIME: 2.844
#> IONMODE: Negative
#> COMMENT: Spectra extracted with MS2extract R package
#> INCHIKEY: NSEWTSAADLNHNH-LSBOWGMISA-N
#> SMILES: C1C(C(OC2=C1C(=CC3=C2C4C(C(O3)(OC5=CC(=CC(=C45)O)O)C6=CC(=C(C=C6)O)O)O)O)C7=CC(=C(C=C7)O)O)O
#> CCS: 
#> COLLISIONENERGY: 20 eV
#> INSTRUMENTTYPE: LC-ESI-QTOF
#> Num Peaks: 38
#> 125.02431 10
#> 137.02441 3
#> 161.02449 2
#> 163.00355 3
#> 165.01881 2
#> 217.04996 2
#> 241.05002 2
#> 245.04547 2
#> 245.0817 2
#> 257.0451 2
#> 285.04063 62
#> 286.04387 4
#> 287.05579 7
#> 289.0718 48
#> 290.07495 3
#> 297.03993 4
#> 307.06114 2
#> 313.03573 2
#> 327.05044 7
#> 407.07693 16
#> 408.08161 2
#> 411.07227 9
#> 423.07231 53
#> 424.07537 5
#> 435.07138 3
#> 447.07296 5
#> 449.08799 51
#> 450.09044 6
#> 452.07453 15
#> 453.08155 7
#> 471.1086 2
#> 513.11796 2
#> 531.13006 2
#> 539.09834 22
#> 540.10156 3
#> 557.10809 5
#> 575.11968 100
#> 576.12208 13

Session info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] MS2extract_0.99.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rdpack_2.6.1                readxl_1.4.3               
#>   [3] rlang_1.1.4                 magrittr_2.0.3             
#>   [5] clue_0.3-65                 matrixStats_1.4.1          
#>   [7] compiler_4.4.1              systemfonts_1.1.0          
#>   [9] vctrs_0.6.5                 reshape2_1.4.4             
#>  [11] stringr_1.5.1               ProtGenerics_1.36.0        
#>  [13] pkgconfig_2.0.3             crayon_1.5.3               
#>  [15] fastmap_1.2.0               backports_1.5.0            
#>  [17] XVector_0.44.0              labeling_0.4.3             
#>  [19] utf8_1.2.4                  rmarkdown_2.28             
#>  [21] tzdb_0.4.0                  UCSC.utils_1.0.0           
#>  [23] preprocessCore_1.66.0       ragg_1.3.3                 
#>  [25] purrr_1.0.2                 xfun_0.48                  
#>  [27] MultiAssayExperiment_1.30.3 zlibbioc_1.50.0            
#>  [29] cachem_1.1.0                GenomeInfoDb_1.40.1        
#>  [31] jsonlite_1.8.9              highr_0.11                 
#>  [33] DelayedArray_0.30.1         BiocParallel_1.38.0        
#>  [35] broom_1.0.7                 parallel_4.4.1             
#>  [37] cluster_2.1.6               R6_2.5.1                   
#>  [39] bslib_0.8.0                 stringi_1.8.4              
#>  [41] limma_3.60.6                car_3.1-3                  
#>  [43] cellranger_1.1.0            GenomicRanges_1.56.2       
#>  [45] jquerylib_0.1.4             iterators_1.0.14           
#>  [47] Rcpp_1.0.13                 SummarizedExperiment_1.34.0
#>  [49] knitr_1.48                  readr_2.1.5                
#>  [51] IRanges_2.38.1              Matrix_1.7-0               
#>  [53] igraph_2.1.1                tidyselect_1.2.1           
#>  [55] abind_1.4-8                 yaml_2.3.10                
#>  [57] doParallel_1.0.17           codetools_0.2-20           
#>  [59] affy_1.82.0                 lattice_0.22-6             
#>  [61] tibble_3.2.1                plyr_1.8.9                 
#>  [63] withr_3.0.1                 Biobase_2.64.0             
#>  [65] evaluate_1.0.1              OrgMassSpecR_0.5-3         
#>  [67] desc_1.4.3                  ggpubr_0.6.0               
#>  [69] pillar_1.9.0                affyio_1.74.0              
#>  [71] BiocManager_1.30.25         carData_3.0-5              
#>  [73] MatrixGenerics_1.16.0       foreach_1.5.2              
#>  [75] stats4_4.4.1                MSnbase_2.30.1             
#>  [77] MALDIquant_1.22.3           ncdf4_1.23                 
#>  [79] generics_0.1.3              hms_1.1.3                  
#>  [81] S4Vectors_0.42.1            ggplot2_3.5.1              
#>  [83] munsell_0.5.1               scales_1.3.0               
#>  [85] glue_1.8.0                  lazyeval_0.2.2             
#>  [87] tools_4.4.1                 mzID_1.42.0                
#>  [89] QFeatures_1.14.2            vsn_3.72.0                 
#>  [91] mzR_2.38.0                  ggsignif_0.6.4             
#>  [93] fs_1.6.4                    XML_3.99-0.17              
#>  [95] cowplot_1.1.3               grid_4.4.1                 
#>  [97] impute_1.78.0               tidyr_1.3.1                
#>  [99] rbibutils_2.3               MsCoreUtils_1.16.1         
#> [101] colorspace_2.1-1            GenomeInfoDbData_1.2.12    
#> [103] PSMatch_1.8.0               Formula_1.2-5              
#> [105] cli_3.6.3                   textshaping_0.4.0          
#> [107] fansi_1.0.6                 S4Arrays_1.4.1             
#> [109] dplyr_1.1.4                 AnnotationFilter_1.28.0    
#> [111] pcaMethods_1.96.0           gtable_0.3.5               
#> [113] rstatix_0.7.2               sass_0.4.9                 
#> [115] digest_0.6.37               BiocGenerics_0.50.0        
#> [117] ggrepel_0.9.6               SparseArray_1.4.8          
#> [119] farver_2.1.2                htmlwidgets_1.6.4          
#> [121] htmltools_0.5.8.1           pkgdown_2.1.1              
#> [123] lifecycle_1.0.4             httr_1.4.7                 
#> [125] Rdisop_1.64.0               statmod_1.5.0              
#> [127] MASS_7.3-60.2