Scientific environment3 Scientific environment TheworkpresentedinthisthesiswascarriedoutattheFacultyofMathematicsandNat ural Sciences of the University of Bergen (UiB), at the Department of Biology (Marine Microbiology Group) and Centre for Geomicrobiology, as well as the Computational Biology Unit (Jonassen Group) at the UniComputing department of Uni Research; a non profitresearchcompanyaffiliatedwithUiB.TheprojectwasfundedthroughaPhD grant from the University of Bergen and additional funding for sequencing, laboratory, field and travel expenses was provided by a scholarship from L. Meltzers H yskole fond. My contributions to the work outlined in papers III and IV (AmpliconNoise) were possible through a long standing close collaboration between the group of Pro fessor Lise vre s in UiB, and Dr. Christopher Quince and Prof. William Sloan at the University of Glasgow. The work outlined in paper V was possible through a re search collaboration with the University of Addis Ababa, funded through NUFU (the Norwegian Agency for Development Cooperation).
Acknowledgements5 Acknowledgements There is a whole legion of people to whom I am grateful and without them, this thesis would not be what it is. Lise vre s, you have been a fantastic main supervisor. Of course, there were times that I did not see that so clearly, like when being forced to do actual wet labbing with almost no experience (at least for eight years). Then, you decided to leave the continent when I needed you the most to write this thesis. But, "no pain, no gain": without this, I would have gone nine years without touching a pipette and I would not have visited Berkeley. I would also like to thank you and your family for opening your Californian house to me and to Agur. I was also fortunate enough to have three great co supervisors, Inge Jonassen, Tim Urich and P l Puntervoll. Inge, I owe a lot to you, mainly that I managed to keep one foot in bioinformatics. Your supervision and expertise has really provided a unique complement and you have always showed great interest and patience in applying this to exotic problems of microbiology, that I may not always have understood or explained very well. Tim, thank you for the third dimension of supervision, friendship, and in credible patience, and taking time to analyse and discuss the most tiny but important issues in great detail. Also, it is thanks to you that I opened my eyes to microbial ecology, in the first place. P l, I am grateful for introducing me to a world of new con cepts, languages and tools during my time in the Bioinformatics Service Group. This experience was essential for my PhD (incidentally the thesis was written in L Y X, for example). Gratitude also goes to all my colleagues in the Marine Microbiology group and others at the Department of Biology, the Centre for Geobiology and Uni Computing. You have provided a very rich working environment, with diverse knowledge in everything from supercomputers to deep sea vents and microbial metabolism. Thanks especially to Mia Bengtsson and Steffen J rgensen, for our collaborations and for countless, endless, dis cussions. Science would have lost some of its magic without either of you. Mia, like you wrote in my copy of your thesis: "tack som fan! <3". And Steffen, I was going to make a joke about a Christmas party but I have to save something for a speech. Special thanks also to Antonio Garc a Moyano for your expertise and support, to Dominika
6Acknowledgements Chmolowska for being a dedicated and knowledgable co worker in the lab, and to all my other co workers and co authors: Svenn Helge Grindhaug, Susanne Balzer, Ant onio Pagarete, Vigdis Torsvik, Hallgerd Eydal, Addis Simachew, Ingrid M rkeseth, Baye Sitotaw, Amare Gessesse, Yemisirach Mulugeta, Runar Stokke, H kon Dahle, Ida Steen, Irene Roalkvam, Christa Schleper, Ramiro Logares, Eva Lindstr m, Nath alie Reuter, Kjell Petersen, Kidane Tekla, Pawel Stormwasser, Siv Midtun Hollup and all the members in Inge Jonassen s group (especially Animesh and Matus for laughs, support and philosophical insights). Thanks Torbj rn Lium and Saerdar Halifu for fant astic 24 7 tech and HPC support (and crazy out of work adventures). Thanks to this thesis project and my supervisors, I had the privilege to visit, work with and get to know some exceptional scientists in Glasgow and Newcastle. I am especially grateful to Chris Quince, Bill Sloan and Tom Curtis for our collaborations and all I have learnt from you. In addition to being a greatfriend, Chris has arguably actedas an extra, unofficial supervisor. A not insignificant portion of our work was carried out in various pubs around the world, making it yet more enjoyable. Another special thanks to all past and present members of the "international lunch table" for fantastic company at work and after: ystein, Eric, Paolo, Jim, Anne Laure, Paco, Cecile, Nico, Mari, David, Sofia, Sam, Sara, Cindy, Ana, Fabian, Bea, Becky, Laurent, Mahaut, Valentina. There are so many that I cannot list you all, but I have not forgotten. Your everyday support and friendship has been extremely important, and helped to carry me through (without doubt). So did my Swedish friends, helping me relax and gain perspective during my Stockholm visits and always interested in what it really was I was really working with ("cod DNA?"). Everyone in my family, back in Sweden: You have also meant a lot for this thesis be coming reality, supporting me and showing interest in my work. Thanks to my parents, for taking care of me in Sweden and for telling me to relax when I needed to hear it. And to my beloved grandmother Hillevi, no longer with us, for wise words. Finally Agur, thank you for everything, for your constant and heartfelt support, and an incredible patience. Also thanks for proof reading of this thesis, for support with re hearsals of presentations, mathematical problems and R. But, most importantly, thanks for making the last three years the best ones imaginable (actually much better). Al though a tough measure, moving from Bergen in advance also provided a final push to finish up quickly, in order to rejoin you in the Basque Country.
8CONTENTS 2.2.2 Using Operational Taxonomic Units (OTUs) as proxies for mi crobial species.......................... 29 2.2.3 Diversity estimates, comparison and extrapolation of richness . 31 2.2.4 Comparison of community composition across datasets..... 32 2.3 Sources of random and systematic errors, and methods for compensation 34 2.3.1 Samplehandling, nucleicacidextractionandreversetranscription 34 2.3.2 PCR amplification bias and random drift............ 35 2.3.3 Chimeras, misincorporations and other PCR artefacts..... 35 2.3.4 Detection and removal of chimeric sequences.......... 36 2.3.5 Noise, artefacts and compensation in pyrosequencing and Ion Torrent data............................ 37 3 Research questions 39 4 Discussion 43 4.1 Taxonomic classification of SSU rRNA sequence data.......... 43 4.2 Bias and reproducibility of SSU rRNA targeted pyrosequencing.... 45 4.3 Dealing with sequence noise and determination of microbial diversity . 47 4.4 Community structure in environmental datasets............. 51 4.5 Complementarity of environmental genomics approaches........ 55 5 Conclusions and future perspectives 57 Bibliography 60 II Scientific results 77 Paper I...................................... 79 Paper II..................................... 91 Paper III.....................................107 Paper IV.....................................114 Paper V.....................................134
Summary9 Summary Most life on this planet is microbial and for the last two decades, environmental gen omics has contributed to reveal an impressive biodiversity of this microbial life. This approach applies DNA sequencing to environmental samples, with the significant ad vantage of not relying on cell cultures, since only a minority of microorganisms are easily cultured in the laboratory. This thesis deals primarily with analysis of microbial diversity based on community profiling. This variant of environmental genomics tar gets defined marker genes to study the structure of microbial communities. The use of the small subunit ribosomal RNA as a phylogenetic marker is discussed and evaluated, with emphasis on taxonomic classification, estimation of diversity and comparison of community structure between samples. Thanks to improved sequencing technologies, community profiling is an increasingly powerful and cost efficient technique. Like all methodologies it has limitations and sources of random and systematic errors, many of which remain poorly understood. In relation to this, a number of recommendations and novel analysis methods are developed and provided. These are subsequently applied to study environmental communities, targeting issues like the "rare biosphere" concept, and variation of community structure across space and environmental gradients. Taxonomic classification is the process of placing environmental sequences in con text of previously studied organisms. Thus, ecologically meaningful information such as putative metabolic functions can be derived. InPaper I, a set of resources for taxo nomic classification is provided and evaluated. The performance of the resulting frame work, CREST (Classification Resources for Environmental Sequence Tags), is shown to compare favourably to existing methods. It also provides a manually curated tax onomy and functionality for comparing composition across datasets. InPaper II,a hydrothermal vent associated microbial mat community is studied, using a set of differ ent environmental genomics methods. Based on this study, several important sources of bias and reproducibility of community profiling are evaluated and discussed. The res ults highlight the importance of applying complementary methods. They also illustrate the in"uence of primer choice, PCR bias and whether RNA or DNA is targeted. Ran dom variation, or noise, is another important factor to consider in community profiling
10Abstract studies.Papers IIIandIV, examines the effect of such noise from PCR amplification and pyrosequencing. Currently, this is the most common sequencing method applied to environmental samples. The results ofPaper IIIdemonstrate that early community profiling studies using pyrosequencing have significantly overestimated the extent of biodiversity, because of noise. To compensate for such noise in amplicon sequence datasets, the program AmpliconNoise was developed. Using "mock communities", a mix of clones with known sequences, the performance of AmpliconNoise is demon strated and compared to alternative methods. Analyses of diversity in the microbial mat community studied inPaper IIutilise AmpliconNoise. Resulting estimates are compared to previous findings, from similar environments. In addition to biodiversityper se, the underlying diversity structures of communities and the mechanisms shaping them, remain important but poorly understood issues in microbial ecology. Because of their many useful characteristics, alkaline soda lakes are used as model ecosystem to study several such issues, inPaper V. Results reveal that these extreme environments harbour surprisingly high microbial diversity. Inter estingly, the most alkaline and saline lakes studied also appear to be the most diverse. Further, it is shown that pH, oxygen level, and sodium and potassium concentrations can explain 30% of the compositional variance between the lakes studied. The exist ence of organisms endemic to individual lakes is also indicated. Although soda lakes are relatively uncommon environments, this study provides an example of how fun damental biogeographical questions can be targeted using a careful choice of experi mental design and analysis methodology. The results call into question several estab lished notions such as extreme environments generally being less diverse and that few prokaryotic organisms are endemic. Hopefully the findings will inspire future studies, exploring these relationships further. Insummary, theworkpresentedhereillustratestheimportanceofevaluatingandoptim ising the methodology used in environmental genomics, particularly for amplicon se quencing, taxonomic classification, and estimation of phylogenetic diversity. It is likely that methodological limitations have biassed and slowed down data analysis and inter pretation of important ecological issues like the rare biosphere and microbial biogeo graphy.