Characterization of Biodiesels and Vegetable Oils after Combustion using GC-MS

An Honour’s student that I supervise presented an excellent poster at this year’s Research Week at Bishop’s University: Characterization of Biodiesels and Vegetable Oils and their Corresponding Combustion Residues.

Here is the abstract

Biodiesel is one of the most common alternative fuels and is becoming more predominant on the market today. Due to the emergence of biodiesel forensic analysts should be more aware of biodiesel components and properties since it may be encountered more in arson crime scene samples. Biodiesels are vegetable oil or animal oil based diesel fuels. Vegetable oils themselves undergo burning, self-heating, and spontaneous ignition which means they too, albeit less often, are observed in fire debris samples. Vegetable oils and fuels derived from them are not effectively analyzed using regular fire debris analysis methods. A solvent extraction is more suitable than the typical passive headspace extraction that is used for ignitable liquids. The vegetable oils must also be derivatized in order to convert the fatty acids (FAs) found in the oils to the volatile fatty acid methyl esters (FAMEs) which are necessary for GC-MS analysis. This work will demonstrate and analyze the changes, if any, in the FAME components that are observed between neat and burned alternative fuel accelerants. Biodiesel blends and multiple household oils, such as soy and canola oils, will be used as the accelerants. The findings of this research will aid in further understanding and in recognition of biodiesels and vegetable oils in fire debris.

Download the poster (pdf, 2 MB): ksaunders-gkos_biodiesel2015

… and research for the summer

I was able to do quite a bit of research during last fall and winter term, despite my high teaching load, preparing/co-writing proposals for atmospheric chemistry studies.

My current work focuses on

  1. Methodology development and validation for formaldehyde in air
  2. Multivariate statistics model development (mostly PCA-based) for the discrimination of contaminated foodstuffs by FTIR-ATR spectroscopy. The work is part of the EU FP7 funded MYCOSPEC project.

Especially the second project has been most interesting and challenging, working with medium sized datasets (2000 x 3000 elements) in Matlab. After transitioning from R I am now fairly well versed in Matlab as well.

Finally, I am occasionally helping out putting the finishing touches on the Emissions Chapter of the Canadian Mercury Science Assessment, which I have led as chapter coordinator and lead author.

A review article on atmospheric & aerosol chemistry

I have co-authored a publication on the measurement of organics in the atmosphere and snow. My main contribution in the article are the analytical methodology parts together with the final editing job and seeing the paper through the review process.

P.A. Ariya, G. Kos, R. Mortazavi, E.D. Hudson, V. Kanthasamy, N. Eltouny, J. Sun and C. Wilde, Bio-Organic Materials in the Atmosphere and Snow: Measurement and Characterization, in  V.F. McNeill, P.A. Ariya (eds.), Atmospheric and Aerosol Chemistry, ISBN 978-3-642-41214-1, Springer, NY (2013). doi:10.1007/128_2013_461.

Results from my mercury work now available in ACP

An article describing my work on the statistical evaluation of mercury transport model estimates and observations on a continental scale is now available in Atmospheric Chemistry and Physics.

AMNet oxidized mercury observation data from North America was used for the evaluation of GEM/RGM and TPM model estimates. A comprehensive uncertainty analysis is presented for measurement and model parameters. Statistical calculations and plots were done in R (except for the visualization of model output).

G. Kos, A. Ryzhkov, A. Dastoor, J. Narayan, A. Steffen, P. A. Ariya, L. Zhang,Evaluation of Discrepancy between Measured and Modeled Oxidized Mercury Species, Atmospheric Chemistry & Physics 13 (2013) 4839-4863, doi:10.5194/acp-13-4839-2013.

What to do with nominal variables in a mixed data set?

… and I keep looking for a definitive answer … (so do not read on, if you expect one!)

I would like to perform PCA on a mixed dataset of nominal and continuous variables. Examples for such variables  are gender (m/f; nominal), living on a farm (y/n; nominal) and and organic contaminant concentration data (continuous). The nominal variables cannot be ordered.

According to Kolenikov & Angeles the problem is that

“discrete data tend to have high skewness and kurtosis, especially if the majority of the data points are concentrated in a single category”

[Kolenikov & Angeles, 2004, The Use of Discrete Data in PCA: Theory, Simulations, and Applications to Socioeconomic Indices].

In brief – if using my data as is, since I have lots of data in a single category, I am potentially introducing a bias. Furthermore, in a board post on methodspace.com it has been stated that

“the problem is that when the distributions are far from 50:50 for a dichotomous variable, the correlations are suppressed.”

This is certainly the case for my data. But also a first step towards a solution is offered (reiterating Kolenikov & Angeles’ suggestion)

” You can also go for estimating polychoric/ tetrachoric correlations. Here you assume that each binary (or ordered discrete) indicator is a manifestation of underlying continuous variables, and so you estimate the correlations between these underlying variables.”

[http://www.methodspace.com/forum/topics/principal-component-analysis?commentId=2289984%3AComment%3A103267]

On the stats.stackexchange.com board, different avenues are discussed:

“Although a PCA applied on binary data would yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package. […] The challenge with categorical variables is to find a suitable way to represent distances between variable categories and individuals in the factorial space. To overcome this problem, you can look for a non-linear transformation of each variable — whether it be nominal, ordinal, polynomial, or numerical — with optimal scaling.”

[http://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont]

On the other hand, at least for some data sets, polychoric PCA does not significantly improve results, as Kolenikov & Angeles, 2004 admit:

“The gain from using computationally intensive polychoric correlations in getting the “correct” variable weights may not be very large compared to the PCA on ordinal data. However, only the polychoric analysis gives consistent estimates of the explained proportion. […] The misclassification rates, as well as Spearman correlation of the theoretical and empirical welfare indices, are not substantially different between the ordinal, group means and polychoric versions of PCA, although the difference is statistically significant due to huge sample size of the simulation results data set.”

Additionally, in my data set I have the a lot of data below the detection limit (<LOD, coded as zero) and some missing data. The former need to be recoded to better represent an “actual value” since it is not zero. Replacement could be e.g., 1/2 LOD as a simple and often used approach, but bias prone, or, better, Kaplan-Meier estimates (see D. R. Helsel (2005) More than obvious: Better methods for interpreting nondetect data, Environ Sci Technol 39, 419A–423A). A convenient side-effect is the removal of (non-finite) zero values as required for PCA.

As a conclusion I will be trying the following for now – always in a basic (b) and comprehensive approach for comparison (c):

  • 1/2 LOD replacement for <LOD values (b)
  • Kaplan-Meier estimates for <LOD values (c)
  • Replacement of discrete variables: 0.5 & 1 instead of 0 & 1 to make values finite (b)
  • Employ the polycor R package to treat discrete variables, hopefully giving me finite values (c)

Other avenues to explore…

  • not to carry out PCA, but use the FactominR package for Multiple Factor Analysis
  • is optimal scaling using the homals package (which does not work right now in the most recently released version of R. It aborts when loading required rgl package)

The next issue is that I do not know about is how to validate results, yet. I will certainly carry out a detailed analysis & comparison of the (scree, loadings & scores) plots resulting from the approaches described above, first.

New reference manager software

I switched from Bookends to Sente as my reference manager. Not cheap, but impressive alternative work flow. Most importantly, Sente also supports BibTeX, although automatic key assignment is less than satisfying; needs manual corrections or assigned keys are unreadable/writeable (e.g., DOI number rather than “AuthorYear”) – so works fine for shorter papers and new imports.

Bookends: http://www.sonnysoftware.com
Sente: http://www.thirdstreetsoftware.com

New Chemometrics Project

I will be employing multivariate data analysis (mostly PCA & factor analysis) for toxins in floor dust samples and study their health impact from a birth cohort study of rural children. The data set is a matrix of 228 observations of 379 variables, mostly metabolite concentrations of bacterial and fungal species.

The main challenge are that a lot of the concentrations are below the detection limit and that the data set is made up of some dichotomous data that cannot easily be transformed for PCA.

I will be experimenting with 1/2-LOD replacements (as a baseline only) and Kaplan-Meier treatment (better!) of censored data (see Helsel, 2005) to replace initial zeros for non-detects with estimates.

Furthermore I need to study the transformation of my non-numeric data matrix (e.g., employing the Filmer-Pritchett procedure or a polychoric PCA) for proper treatment of my dichotomous variables.