Course topics

Monday
Large-scale studies of diabetes
Tuesday
Clinical characterization of diabetic complications
Wednesday
Dissection of genetics of diabetes and its complications
Thursday
From genes to function
Friday
Metabolic patterns of diabetes

Route to Biomedicum

Exercise 3: Textbook case

It is highly recommended that you go through the first two exercises before starting this one.

The self-organizing map can be used for clustered datasets, where the data points can be divided into clear non-overlapping categories. This is the typical textbook case of classification but, unfortunately, such situations are rare in the study of diabetic complications. Nevertheless, it is important to recognize if the dataset has a clearly defined intrinsic structure.

The goal of this exercise is to demonstrate how the Melikerion software can be used in the description of the data space and how to identify possible errors and peculiarities in the dataset.

Task 1: View the material

Simulated data is useful for technical demonstrations, since the author knows the "true" phenomenon beforehand and can manipulate the dataset to create specific effects for data analysis. Here, a dataset with clearly defined clusters of samples is created. A few erroneous samples have also been added to make the exercise more instructive.

Download data
Download config/info

Task 2: Create self-organizing map

Submit the configuration and data files to the online system and follow the links until the job is finished. When the results are ready, you should see a collection of map colorings and other images.

Find the image entitled 'qerrors'. You should see a histogram and a few highlighted values on the right. The histogram depicts the distribution of the difference between a profile "predicted" by the SOM, and the actual observed profile for a sample. Put differently, for every sample there is a numeric value that tells how accurately can the SOM describe it. Evidently, a poor description means that there is something peculiar about a sample, which may indicate some type of mistake in data collection, for instance.

Go to upload form

After you have finished looking at the colorings and the histograms, please download the entire result archive onto your desktop (link on the right) for a more detailed inspection.

Task 3: Find errors

Download the ZIP-archive onto your desktop, rename it to 'clusters_results1' and view its contents. There are, in fact, several files named 'qerrors'. Choose the one compatible with Excel (screenshot).

You should now see the so called quantification errors for each sample as a spreadsheet. Order the data according to the QERROR column to highlight those samples with the largest error (focus on the worst five). You can now go back to the original data file 'clusters.xls' to see if these particular samples have strange measurement values, for instance.

Once you have located the suspicious data rows, you can remove them. If the values seem to be valid, it is possible that the profile is atypical. A real-world equivalent could be a patient with a rare hereditary form of diabetes (MODY) who has an overall uncharacteristic metabolic profile, but looking at blood glucose alone may be classified as type 1 or type 2 diabetes.

Task 4: Re-submit data

Save the updated spreadsheet and submit it to the Melikerion tool. Compare the results with the first analysis.

Go to upload form

Questions

  1. The regional distribution of samples is saved in the file 'ubook.png'. Can you see separate groups on the map?
  2. Based on the previous question, what would be the model phenotypes that could characterize the dataset? The files 'proto.png' and 'protolegend.png' might be helpful in this regard.
  3. Melikerion includes principal component analysis (PCA). The PC scores are saved in the file 'pcscores.xls'. Use the plotting functions in Excel to view the scatter plots for PC1 vs. PC2, PC1 vs PC3 and PC2 vs. PC3 (screenshot). Can you see any groupings of samples?
  4. Repeat the above questions for both the original results with the outliers and the latter analysis of the cleaned dataset. Do the outliers have a significant impact?

GWAS exercises

Download material

Statistics exercises

1) Networking without Facebook

2) Some are more equal than others

3) Textbook case

4) Nuclear proliferation

Updated 2009-11-27 by vpmakine.