EFSA data — coding for analysis using R

Whenever I have a few hours, I work on the data analysis code for the European Food Safety Authority data on mycotoxins that I have received a few months ago.

plyr and dplyr packages (dplyr is part of tidyverse) have been instrumental to get an overview of the data, answering questions like “How many samples from country X? or How many toxins analysed in sample Y…?” Basic boxplotting of subsets (e.g., from a time range, a specific country or for a specific toxin) provide a crucial overview before conducting a detailed analysis.

Food safety data from the European Union

I have successfully requested mycotoxin occurrence data from the European Food Safety Authority (EFSA). It’s been a lengthy, but successful process getting the data (but surprisingly transparent and being kept up to date by the legal department of EFSA), but once all member states had approved the release of their data a CD arrived in the mail a few weeks later with more than 500,000 data sets of regulated mycotoxin concentrations in a variety of raw and processes food matrices.

Together with collaboration partners from Europe, I am now in the process of analysing occurrence data for regulated toxin species. I am setting up a series of analysis scripts in R (using the tidyverse, such as dyplyr and ggplot2) to shed light on the contamination of food products with toxins such as Aflatoxins, Deoxynivalenol and Ochratoxin A.

Results from my mercury work now available in ACP

An article describing my work on the statistical evaluation of mercury transport model estimates and observations on a continental scale is now available in Atmospheric Chemistry and Physics.

AMNet oxidized mercury observation data from North America was used for the evaluation of GEM/RGM and TPM model estimates. A comprehensive uncertainty analysis is presented for measurement and model parameters. Statistical calculations and plots were done in R (except for the visualization of model output).

G. Kos, A. Ryzhkov, A. Dastoor, J. Narayan, A. Steffen, P. A. Ariya, L. Zhang,Evaluation of Discrepancy between Measured and Modeled Oxidized Mercury Species, Atmospheric Chemistry & Physics 13 (2013) 4839-4863, doi:10.5194/acp-13-4839-2013.

What to do with nominal variables in a mixed data set?

… and I keep looking for a definitive answer … (so do not read on, if you expect one!)

I would like to perform PCA on a mixed dataset of nominal and continuous variables. Examples for such variables  are gender (m/f; nominal), living on a farm (y/n; nominal) and and organic contaminant concentration data (continuous). The nominal variables cannot be ordered.

According to Kolenikov & Angeles the problem is that

“discrete data tend to have high skewness and kurtosis, especially if the majority of the data points are concentrated in a single category”

[Kolenikov & Angeles, 2004, The Use of Discrete Data in PCA: Theory, Simulations, and Applications to Socioeconomic Indices].

In brief – if using my data as is, since I have lots of data in a single category, I am potentially introducing a bias. Furthermore, in a board post on methodspace.com it has been stated that

“the problem is that when the distributions are far from 50:50 for a dichotomous variable, the correlations are suppressed.”

This is certainly the case for my data. But also a first step towards a solution is offered (reiterating Kolenikov & Angeles’ suggestion)

” You can also go for estimating polychoric/ tetrachoric correlations. Here you assume that each binary (or ordered discrete) indicator is a manifestation of underlying continuous variables, and so you estimate the correlations between these underlying variables.”

[http://www.methodspace.com/forum/topics/principal-component-analysis?commentId=2289984%3AComment%3A103267]

On the stats.stackexchange.com board, different avenues are discussed:

“Although a PCA applied on binary data would yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package. […] The challenge with categorical variables is to find a suitable way to represent distances between variable categories and individuals in the factorial space. To overcome this problem, you can look for a non-linear transformation of each variable — whether it be nominal, ordinal, polynomial, or numerical — with optimal scaling.”

[http://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont]

On the other hand, at least for some data sets, polychoric PCA does not significantly improve results, as Kolenikov & Angeles, 2004 admit:

“The gain from using computationally intensive polychoric correlations in getting the “correct” variable weights may not be very large compared to the PCA on ordinal data. However, only the polychoric analysis gives consistent estimates of the explained proportion. […] The misclassification rates, as well as Spearman correlation of the theoretical and empirical welfare indices, are not substantially different between the ordinal, group means and polychoric versions of PCA, although the difference is statistically significant due to huge sample size of the simulation results data set.”

Additionally, in my data set I have the a lot of data below the detection limit (<LOD, coded as zero) and some missing data. The former need to be recoded to better represent an “actual value” since it is not zero. Replacement could be e.g., 1/2 LOD as a simple and often used approach, but bias prone, or, better, Kaplan-Meier estimates (see D. R. Helsel (2005) More than obvious: Better methods for interpreting nondetect data, Environ Sci Technol 39, 419A–423A). A convenient side-effect is the removal of (non-finite) zero values as required for PCA.

As a conclusion I will be trying the following for now – always in a basic (b) and comprehensive approach for comparison (c):

  • 1/2 LOD replacement for <LOD values (b)
  • Kaplan-Meier estimates for <LOD values (c)
  • Replacement of discrete variables: 0.5 & 1 instead of 0 & 1 to make values finite (b)
  • Employ the polycor R package to treat discrete variables, hopefully giving me finite values (c)

Other avenues to explore…

  • not to carry out PCA, but use the FactominR package for Multiple Factor Analysis
  • is optimal scaling using the homals package (which does not work right now in the most recently released version of R. It aborts when loading required rgl package)

The next issue is that I do not know about is how to validate results, yet. I will certainly carry out a detailed analysis & comparison of the (scree, loadings & scores) plots resulting from the approaches described above, first.

Fixed: Unable to plot a decent x-Axis in a time series plot using zoo

Here is the link to the original problem. Briefly, I was unable to plot a custom x-axis showing abbreviated months in a time series plot of a zoo object.

In the plot.zoo() function set

xaxt = "n"

to suppress plotting of the x-axis

Since I needed abbreviated months to be plotted as the x-axis, I loaded a csv file with the dates of the first of each month as follows and converted it to dates:

“month”
2004-12-01
2005-01-01
2005-02-01

(I included Dec 2004, because otherwise “Jan” would not be plotted)

month < - read.csv("~/R/2005months.csv")
month$month < - as.Date(month$month, "%Y-%m-%d")

After the plot.zoo command I added the following line –

axis(1, month$month, format(month$month, "%b"))

The solution was inspired by a post from Gabor Grothendieck on r-help. The original solution is still the only way to plot abbreviated months for time series plots of monthly averages.

Boxplots without boxes

Let’s say you have several categories with multiple data points each that you would like to plot as individual points. Even if you have only a single point, the R graphics package will plot a line (without a box for lack of data). Overriding the default setting with e.g. pch = 1 does not help.

R’s boxplot function (or the plot function for that matter) – correctly – generates a boxplot for each category. If you would like to see individual points instead of boxes, the following code snippet could help by using the points function:

# Plot the boxplot as usual, but in white colour to make the boxes invisible, but keep the axes.

boxplot(m$var ~ m$cat, xlab = "Category", ylab = "Variable", border = "white")

# Plot the data again into the same plot and customise the point shape, etc to your liking

points(m$var ~ m$cat, pch = 1)

Voila!

More fun with boxplots

Here are a few more plotting options for boxplots:

Let’s start plotting the full set
plot(b$mod, b$x)

Plot labels for a subset in full set plot (label all points x < -1)
text(subset(b$mod, b$x < -1), subset(b$x, b$x < -1), subset(b$site, b$x < -1), cex=0.6, pos=4, col="red")

Plot subset with x > -1
plot(subset(b$mod, b$x > -1), subset(b$x, b$x > -1))

Plot horizontal gridlines
grid(nx = NA, ny = NULL)

Converting vectors to numeric in mixed-type dataframe

Coercing variables of character and numeric type into a single dataframe yields all vectors to be defined as factors

all <- data.frame(cbind(site, year, model, x, y, z))

The following converts selected variables from “factor” back to “numeric”
all$x <- as.numeric(x)
all$y <- as.numeric(y)
all$z <- as.numeric(z)