r – Greg's Teaching & Researchblog

New paper in press: 25% of agricultural commodities are contaminated with mycotoxins! Really?

With collaborators in Finland, Austria and the United Kingdom I have been working on an update of the decades-old Food and Agriculture Organization (FAO) estimate that, globally 25% of agricultural commodities are contaminated with mycotoxins.

The origin of this number is very much unknown — the original report seems to have gone missing; yet the number remains widely cited in the scientific community, in articles, at conferences and in the general news media.

We took a good hard look at recent studies and results, mainly from the Joint FAO/WHO Expert Committee on Food Additives (JECFA), publicly available data from the European Food Safety Authority (EFSA) and proprietary data from the feed additive company Biomin, looking at different threshold values to establish a data-based current estimated.

Results are quite surprising and soon to be published in Annual Review of Food Science and Technology. Publication details and citation to follow once it becomes available.

R workshop at Université Laval

Last week I gave a 2-day R workshop at the Parerea group at INAF at Université Laval. While the first day was roughly modeled on my introduction to R from my Advanced Data Analysis course at Concordia (but with agricultural data rather than meteorological data), I have added new material for the treatment of censored data and allergenic risk assessments using Monte-Carlo simulations.

The latter was also new territory for me. While I had immersed myself into allergenic risk assessments during the past 6 months for PhD student advising, implementing distribution estimations and Monte-Carlo simulations in R was a nice challenge!

The R packages fitdistrplus (for distribution estimates) and mc2d (for Monte-Carlo simulations with fitdistrplus output) were extremely helpful to get the job done!

A new Advanced Data Analysis course

Last fall have have taught my first grad course (for Masters and PhD students) at Concordia; CHEM 498/610: Advanced Data Analysis.

The course is an introduction to data analysis using R and experimental design using Modde Pro and was given by 2 colleagues and myself. After an introduction to R and practice problems to gain some initial programming experience (Chemistry and Biochemistry students have little exposure to programming) we looked at preparation of (censored and non-censored) data before tackling topics such as regression, ANOVA, multivariate classification and regression methods…)

Grad students used their own data or chose suitable datasets from the public domain after discussion with the instructors!

EFSA data — coding for analysis using R

Whenever I have a few hours, I work on the data analysis code for the European Food Safety Authority data on mycotoxins that I have received a few months ago.

plyr and dplyr packages (dplyr is part of tidyverse) have been instrumental to get an overview of the data, answering questions like “How many samples from country X? or How many toxins analysed in sample Y…?” Basic boxplotting of subsets (e.g., from a time range, a specific country or for a specific toxin) provide a crucial overview before conducting a detailed analysis.

Food safety data from the European Union

I have successfully requested mycotoxin occurrence data from the European Food Safety Authority (EFSA). It’s been a lengthy, but successful process getting the data (but surprisingly transparent and being kept up to date by the legal department of EFSA), but once all member states had approved the release of their data a CD arrived in the mail a few weeks later with more than 500,000 data sets of regulated mycotoxin concentrations in a variety of raw and processes food matrices.

Together with collaboration partners from Europe, I am now in the process of analysing occurrence data for regulated toxin species. I am setting up a series of analysis scripts in R (using the tidyverse, such as dyplyr and ggplot2) to shed light on the contamination of food products with toxins such as Aflatoxins, Deoxynivalenol and Ochratoxin A.

Results from my mercury work now available in ACP

An article describing my work on the statistical evaluation of mercury transport model estimates and observations on a continental scale is now available in Atmospheric Chemistry and Physics.

AMNet oxidized mercury observation data from North America was used for the evaluation of GEM/RGM and TPM model estimates. A comprehensive uncertainty analysis is presented for measurement and model parameters. Statistical calculations and plots were done in R (except for the visualization of model output).

G. Kos, A. Ryzhkov, A. Dastoor, J. Narayan, A. Steffen, P. A. Ariya, L. Zhang,Evaluation of Discrepancy between Measured and Modeled Oxidized Mercury Species, Atmospheric Chemistry & Physics 13 (2013) 4839-4863, doi:10.5194/acp-13-4839-2013.

What to do with nominal variables in a mixed data set?

… and I keep looking for a definitive answer … (so do not read on, if you expect one!)

I would like to perform PCA on a mixed dataset of nominal and continuous variables. Examples for such variables are gender (m/f; nominal), living on a farm (y/n; nominal) and and organic contaminant concentration data (continuous). The nominal variables cannot be ordered.

According to Kolenikov & Angeles the problem is that

“discrete data tend to have high skewness and kurtosis, especially if the majority of the data points are concentrated in a single category”

[Kolenikov & Angeles, 2004, The Use of Discrete Data in PCA: Theory, Simulations, and Applications to Socioeconomic Indices].

In brief – if using my data as is, since I have lots of data in a single category, I am potentially introducing a bias. Furthermore, in a board post on methodspace.com it has been stated that

“the problem is that when the distributions are far from 50:50 for a dichotomous variable, the correlations are suppressed.”

This is certainly the case for my data. But also a first step towards a solution is offered (reiterating Kolenikov & Angeles’ suggestion)

” You can also go for estimating polychoric/ tetrachoric correlations. Here you assume that each binary (or ordered discrete) indicator is a manifestation of underlying continuous variables, and so you estimate the correlations between these underlying variables.”

[http://www.methodspace.com/forum/topics/principal-component-analysis?commentId=2289984%3AComment%3A103267]

On the stats.stackexchange.com board, different avenues are discussed:

“Although a PCA applied on binary data would yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package. […] The challenge with categorical variables is to find a suitable way to represent distances between variable categories and individuals in the factorial space. To overcome this problem, you can look for a non-linear transformation of each variable — whether it be nominal, ordinal, polynomial, or numerical — with optimal scaling.”

[http://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont]

On the other hand, at least for some data sets, polychoric PCA does not significantly improve results, as Kolenikov & Angeles, 2004 admit:

“The gain from using computationally intensive polychoric correlations in getting the “correct” variable weights may not be very large compared to the PCA on ordinal data. However, only the polychoric analysis gives consistent estimates of the explained proportion. […] The misclassification rates, as well as Spearman correlation of the theoretical and empirical welfare indices, are not substantially different between the ordinal, group means and polychoric versions of PCA, although the difference is statistically significant due to huge sample size of the simulation results data set.”

Additionally, in my data set I have the a lot of data below the detection limit (<LOD, coded as zero) and some missing data. The former need to be recoded to better represent an “actual value” since it is not zero. Replacement could be e.g., 1/2 LOD as a simple and often used approach, but bias prone, or, better, Kaplan-Meier estimates (see D. R. Helsel (2005) More than obvious: Better methods for interpreting nondetect data, Environ Sci Technol 39, 419A–423A). A convenient side-effect is the removal of (non-finite) zero values as required for PCA.

As a conclusion I will be trying the following for now – always in a basic (b) and comprehensive approach for comparison (c):

1/2 LOD replacement for <LOD values (b)
Kaplan-Meier estimates for <LOD values (c)
Replacement of discrete variables: 0.5 & 1 instead of 0 & 1 to make values finite (b)
Employ the polycor R package to treat discrete variables, hopefully giving me finite values (c)

Other avenues to explore…

not to carry out PCA, but use the FactominR package for Multiple Factor Analysis
is optimal scaling using the homals package (which does not work right now in the most recently released version of R. It aborts when loading required rgl package)

The next issue is that I do not know about is how to validate results, yet. I will certainly carry out a detailed analysis & comparison of the (scree, loadings & scores) plots resulting from the approaches described above, first.

Subsetting made easy

Calculating characteristics such as median, mean,… of a subset of data is quite straightforward in R:

For a data set containing results from several “models”, a subset for the model “base” is created by

1	`base_only <-` `subset(all, all$model ==` `"base")`

Then, the median of the variable “data” can be computed via

1	`median(base_only$data)`

Fixed: Unable to plot a decent x-Axis in a time series plot using zoo

Here is the link to the original problem. Briefly, I was unable to plot a custom x-axis showing abbreviated months in a time series plot of a zoo object.

In the plot.zoo() function set

1	`xaxt =` `"n"`

to suppress plotting of the x-axis

Since I needed abbreviated months to be plotted as the x-axis, I loaded a csv file with the dates of the first of each month as follows and converted it to dates:

“month”
2004-12-01
2005-01-01
2005-02-01
…

(I included Dec 2004, because otherwise “Jan” would not be plotted)

1	`month < -` `read.csv("~/R/2005months.csv")`

1	`month$month < -` `as.Date(month$month,` `"%Y-%m-%d")`

After the plot.zoo command I added the following line –

1	`axis(1, month$month,` `format(month$month,` `"%b"))`

The solution was inspired by a post from Gabor Grothendieck on r-help. The original solution is still the only way to plot abbreviated months for time series plots of monthly averages.

Boxplots without boxes

Let’s say you have several categories with multiple data points each that you would like to plot as individual points. Even if you have only a single point, the R graphics package will plot a line (without a box for lack of data). Overriding the default setting with e.g. pch = 1 does not help.

R’s boxplot function (or the plot function for that matter) – correctly – generates a boxplot for each category. If you would like to see individual points instead of boxes, the following code snippet could help by using the points function:

# Plot the boxplot as usual, but in white colour to make the boxes invisible, but keep the axes.
boxplot(m$var ~ m$cat, xlab = "Category", ylab = "Variable", border = "white")

# Plot the data again into the same plot and customise the point shape, etc to your liking
points(m$var ~ m$cat, pch = 1)

Voila!