A review article on atmospheric & aerosol chemistry

I have co-authored a publication on the measurement of organics in the atmosphere and snow. My main contribution in the article are the analytical methodology parts together with the final editing job and seeing the paper through the review process.

P.A. Ariya, G. Kos, R. Mortazavi, E.D. Hudson, V. Kanthasamy, N. Eltouny, J. Sun and C. Wilde, Bio-Organic Materials in the Atmosphere and Snow: Measurement and Characterization, in  V.F. McNeill, P.A. Ariya (eds.), Atmospheric and Aerosol Chemistry, ISBN 978-3-642-41214-1, Springer, NY (2013). doi:10.1007/128_2013_461.

Results from my mercury work now available in ACP

An article describing my work on the statistical evaluation of mercury transport model estimates and observations on a continental scale is now available in Atmospheric Chemistry and Physics.

AMNet oxidized mercury observation data from North America was used for the evaluation of GEM/RGM and TPM model estimates. A comprehensive uncertainty analysis is presented for measurement and model parameters. Statistical calculations and plots were done in R (except for the visualization of model output).

G. Kos, A. Ryzhkov, A. Dastoor, J. Narayan, A. Steffen, P. A. Ariya, L. Zhang,Evaluation of Discrepancy between Measured and Modeled Oxidized Mercury Species, Atmospheric Chemistry & Physics 13 (2013) 4839-4863, doi:10.5194/acp-13-4839-2013.

Teaching and current events

What I like most about teaching ‘Atmospheric Chemistry’ (this semester again at McGill) is the fact that current examples are always at hand: Discussing biogenic aerosols? A dust storm is never far away! Temperature inversions and increased pollutant concentrations in the PBL? Just look to the South-Western US!

In a nutshell – when talking about atmospheric processes, these examples make the taught material relevant and important for students. And in the best of cases they bring these events to class for further discussion, such as recently during a Sudden Stratospheric Warming event (including the opportunity to discuss and rectify some serious mistakes in the article).

Teaching in the current context at its best (and I have not even talked about research papers that are published every week and warrant a discussion in class!)

What to do with nominal variables in a mixed data set?

… and I keep looking for a definitive answer … (so do not read on, if you expect one!)

I would like to perform PCA on a mixed dataset of nominal and continuous variables. Examples for such variables  are gender (m/f; nominal), living on a farm (y/n; nominal) and and organic contaminant concentration data (continuous). The nominal variables cannot be ordered.

According to Kolenikov & Angeles the problem is that

“discrete data tend to have high skewness and kurtosis, especially if the majority of the data points are concentrated in a single category”

[Kolenikov & Angeles, 2004, The Use of Discrete Data in PCA: Theory, Simulations, and Applications to Socioeconomic Indices].

In brief – if using my data as is, since I have lots of data in a single category, I am potentially introducing a bias. Furthermore, in a board post on methodspace.com it has been stated that

“the problem is that when the distributions are far from 50:50 for a dichotomous variable, the correlations are suppressed.”

This is certainly the case for my data. But also a first step towards a solution is offered (reiterating Kolenikov & Angeles’ suggestion)

” You can also go for estimating polychoric/ tetrachoric correlations. Here you assume that each binary (or ordered discrete) indicator is a manifestation of underlying continuous variables, and so you estimate the correlations between these underlying variables.”

[http://www.methodspace.com/forum/topics/principal-component-analysis?commentId=2289984%3AComment%3A103267]

On the stats.stackexchange.com board, different avenues are discussed:

“Although a PCA applied on binary data would yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package. […] The challenge with categorical variables is to find a suitable way to represent distances between variable categories and individuals in the factorial space. To overcome this problem, you can look for a non-linear transformation of each variable — whether it be nominal, ordinal, polynomial, or numerical — with optimal scaling.”

[http://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont]

On the other hand, at least for some data sets, polychoric PCA does not significantly improve results, as Kolenikov & Angeles, 2004 admit:

“The gain from using computationally intensive polychoric correlations in getting the “correct” variable weights may not be very large compared to the PCA on ordinal data. However, only the polychoric analysis gives consistent estimates of the explained proportion. […] The misclassification rates, as well as Spearman correlation of the theoretical and empirical welfare indices, are not substantially different between the ordinal, group means and polychoric versions of PCA, although the difference is statistically significant due to huge sample size of the simulation results data set.”

Additionally, in my data set I have the a lot of data below the detection limit (<LOD, coded as zero) and some missing data. The former need to be recoded to better represent an “actual value” since it is not zero. Replacement could be e.g., 1/2 LOD as a simple and often used approach, but bias prone, or, better, Kaplan-Meier estimates (see D. R. Helsel (2005) More than obvious: Better methods for interpreting nondetect data, Environ Sci Technol 39, 419A–423A). A convenient side-effect is the removal of (non-finite) zero values as required for PCA.

As a conclusion I will be trying the following for now – always in a basic (b) and comprehensive approach for comparison (c):

  • 1/2 LOD replacement for <LOD values (b)
  • Kaplan-Meier estimates for <LOD values (c)
  • Replacement of discrete variables: 0.5 & 1 instead of 0 & 1 to make values finite (b)
  • Employ the polycor R package to treat discrete variables, hopefully giving me finite values (c)

Other avenues to explore…

  • not to carry out PCA, but use the FactominR package for Multiple Factor Analysis
  • is optimal scaling using the homals package (which does not work right now in the most recently released version of R. It aborts when loading required rgl package)

The next issue is that I do not know about is how to validate results, yet. I will certainly carry out a detailed analysis & comparison of the (scree, loadings & scores) plots resulting from the approaches described above, first.

New reference manager software

I switched from Bookends to Sente as my reference manager. Not cheap, but impressive alternative work flow. Most importantly, Sente also supports BibTeX, although automatic key assignment is less than satisfying; needs manual corrections or assigned keys are unreadable/writeable (e.g., DOI number rather than “AuthorYear”) – so works fine for shorter papers and new imports.

Bookends: http://www.sonnysoftware.com
Sente: http://www.thirdstreetsoftware.com

Some more teaching

In addition to teaching labs at Bishop’s University (this semester Analytical Chemistry and Physical Chemistry), I will be teaching “Atmospheric Chemistry” at McGill University (ATOC/CHEM 219).

Check out the ad: ATOC219 Course Advertisement (and spread the word, if you like!)

New Chemometrics Project

I will be employing multivariate data analysis (mostly PCA & factor analysis) for toxins in floor dust samples and study their health impact from a birth cohort study of rural children. The data set is a matrix of 228 observations of 379 variables, mostly metabolite concentrations of bacterial and fungal species.

The main challenge are that a lot of the concentrations are below the detection limit and that the data set is made up of some dichotomous data that cannot easily be transformed for PCA.

I will be experimenting with 1/2-LOD replacements (as a baseline only) and Kaplan-Meier treatment (better!) of censored data (see Helsel, 2005) to replace initial zeros for non-detects with estimates.

Furthermore I need to study the transformation of my non-numeric data matrix (e.g., employing the Filmer-Pritchett procedure or a polychoric PCA) for proper treatment of my dichotomous variables.

Article open for discussion

I have been working hard on a manuscript dealing with the evaluation of uncertainties in measurements and model estimations and the result is now available online for discussion:

G. Kos, A. Ryzhkov, A. Dastoor, J. Narayan, A. Steffen, P. A. Ariya, and L. Zhang, Evaluation of discrepancy between measured and modeled oxidized mercury species, Atmos. Chem. Phys. Discuss., 12, 17245-17293, 2012.

http://www.atmos-chem-phys-discuss.net/12/17245/2012/acpd-12-17245-2012.html

or

http://dx.doi.org/10.5194/acpd-12-17245-2012

Some items to keep from the Course Design workshop…

The Course Design workshop I took at McGill Teaching and Learning Services is done and it was a formidable experience. It was a good reminder for several issues that I first discussed during an earlier workshop in 2008.

Here are a couple of keywords about things to investigate in more detail (again):

I also noticed that a lot of methodology originating from the non-profit sector and professional training increasingly find their way into university teaching!