(1) the complete title of one (or more) paper(s) published in the open literature describing the work that the author claims describes a human-competitive result Paper title:"Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein Structure Prediction" (2) the name, complete physical mailing address, e-mail address, and phone number of EACH author of EACH paper, Author 1: -Name: Jaume Bacardit, -Address: School of Computer Science & IT, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK -Email: jqb@cs.nott.ac.uk -Phone: +44 115 951 4234 Author 2: -Name: Michael Stout, -Address: School of Computer Science & IT, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK -Email: mqs@cs.nott.ac.uk -Phone: +44 115 951 4234 Author 3: -Name: Jonathan D. Hirst, -Address: School of Chemistry, University of Nottingham, University Park, Nottingham, NG7 2RD, UK -Email: jonathan.hirst@nottingham.ac.uk -Phone: +44 115 951 3478 Author 4: -Name: Kumara Sastry, -Address: Department of Industrial and Enterprise Systems Engineering, Univerity of Illinois at Urbana-Champaign, 117 Transportation Bldg., MC-238, 104 S. Mathews Ave, Urbana, IL, 61801, USA -Email: ksastry@uiuc.edu -Phone: +1 217 333 2346 Author 5: -Name: Xavier LLora, -Address: National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, 1205 W.Clark Street, Urbana, IL, 61801, USA -Email: xllora@uiuc.edu -Phone: +1 217 265 0894 Author 6: -Name: Natalio Krasnogor, -Address: School of Computer Science & IT, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK -Email: nxk@cs.nott.ac.uk -Phone: +44 115 846 7592 (3) the name of the corresponding author (i.e., the author to whom notices will be sent concerning the competition), Natalio Krasnogor & Jaume Bacardit (4) the abstract of the paper(s), Paper abstract: This paper focuses on automated procedures to reduce the dimensionality of protein structure prediction datasets by simplifying the way in which the primary sequence of a protein is represented. The potential benefits of this procedure are faster and easier learning process as well as the generation of more compact and human-readable classifiers. The dimensionality reduction procedure we propose consists on the reduction of the 20-letter amino acid (AA) alphabet, which is normally used to specify a protein sequence, into a lower cardinality alphabet. This reduction comes about by a clustering of AA types accordingly to their physical and chemical similarity. Our automated reduction procedure is guided by a fitness function based on the Mutual Information between the AA-based input attributes of the dataset and the protein structure feature that being predicted. To search for the optimal reduction, the Extended Compact Genetic Algorithm (ECGA) was used, and afterwards the results of this process were fed into (and validated by) BioHEL, a genetics-based machine learning technique. BioHEL used the reduced alphabet to induce rules for protein structure prediction features. BioHEL results are compared to two standard machine learning systems. Our results show that it is possible to reduce the size of the alphabet used for prediction from twenty to just three letters resulting in more compact, i.e. interpretable, rules. Also, a protein-wise accuracy performance measure suggests that the loss of accuracy acrued by this substantial alphabet reduction is not statistically significant when compared to the full alphabet. (5) a list containing one or more of the eight letters (A, B, C, D, E, F, G, or H) that correspond to the criteria (see above) that the author claims that the work satisfies, B, D, E and G (6) a statement stating why the result satisfies the criteria that the contestant claims (see the examples below as a guide to aid in constructing this part of the submission), (B) The result is equal to or better than a result that was accepted as a new scientific result at the time when it was published in a peer-reviewed scientific journal. This entry deals with alphabet reduction applied to protein structure prediction (PSP) domains. PSP attempts to solve the, yet unknown, mapping from a protein's primary sequence to its three dimensional structure, called native state. A primary sequence is a string defined using an alphabet of 20 letters, each of which represents one of the 20 Amino Acids (AA) that appear in nature. This entry demonstrate a protocol based on evolutionary algorithms that transforms the representation of the primary sequence of a protein, using the AA 20-letter alphabet, into another string using a much lower cardinality alphabet *without* loosing critical information. Alphabet reduction can be applied to many subproblems of the PSP field provided that, as mentioned, above, in each case critical information is preserved. Searching for good reduced alphabets is and has been an active field of research and some of the best reductions are man-made. For instance, in [1] the authors claim that at least 10 symbols are needed in order to be able to perform successful sequence alignment. In [2] several reduced alphabets, some man-made and other automatically generated are compared in a sequence alignment related task. The best reduced alphabet, of 5 symbols, obtains an accuracy 0.9% lower than the original 20-letter alphabet. In this work we apply our automated evolutionary computation-based alphabet reduction method to a PSP domain called coordination number prediction. We obtain a reduced alphabet of only three symbols that can accurately capture the essence of coordination number prediction with only 0.6% lower protein-wise accuracy, using a protein-wise accuracy metric, than the original 20-letter alphabet. (D) The result is publishable in its own right as a new scientific result - independent of the fact that the result was mechanically created. The work has been accepted for publication in the GECCO2007 conference. A second, much larger, paper is being submitted to one of the key bioinformatics journal showing that our evolutionary algorithm-based alphabet reduction method also produces better than current state of the art (whether human made or not) alphabet reductions for a variety of PSP related problems. (E) The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions. Alphabet reduction is a key component of structural bioinformatics research. One famous alphabet reductions are the HP-model (also called Dill's model) that reduces the 20 amino acids alphabet to only two classes, namely hydrophobic and polar. These two classes focus on hydrophobicity, the main physical force that controls the folding process. Other textbooks reduce them accordingly to properties such as size (small vs big), charge (charged vs uncharged, positive vs negative), etc. [3]. There is plenty of information of alternative alphabets available, see for example references [4-6]. The method proposed here can be applied to many PSP related problems and supersedes some of the best reduced alphabets. (G) The result solves a problem of indisputable difficulty in its field. Protein Structure Prediction (PSP) and its related problems (e.g. contact map prediction, contact number prediction, solvent accessibility prediction, domain prediction, etc.) is, after decades of research, still one of the main unsolved problems in Computational Molecular Biology. This is due to many factors, such as incomplete understanding of the folding process, imperfect and noisy data, enormous search spaces, etc. The relevence of this problem is enormous as a better understanding and predictive capacity of the PSP models can lead do improvements in the design of pharmaceuticals, among other benefits. As a result of the above factors, intensive research on how to reduce the dimensionality of these problems has been pursued over the years. One such dimensionality reduction technique is, precisely, alphabet reduction. This entry is meant to help mainly in the latter factor, alleviating the huge computational cost (which can be in the order of 10^4 CPU days to predict the structure of a single protein, using state-of-the-art methods [7]) of PSP by strategically simplifying without much information loss the representation for the primary sequence of a protein and therefore, reducing the dimensionality of the search space that the PSP methods have to handle. Moreover, the solutions found by the methods in this entry can also contribute to a better understanding of the protein folding process. Perhaps more significant, our results indicate that there is no one reduce alphabet that fits all problems, hence having a human-competitive alphabet reduction algorithm as the one we present could provide in the near future key insights for many different PSP subproblems, where a single overall reduction policy would fail. Thus, our entry is not only human-competitive in the sense that it improves current state of the art but also in that it exceeds human-capacity for specialising these alphabets to other PSP related problems. [1] T. Li, K. Fan, J. Wang and W. Wang. Reduction of protein sequence complexity by residue grouping. Protein Engineering vol. 16 no. 5 pp. 323-330, 2003. [2] F. Melo and M. Marti-Renom. Accuracy of sequence alignment and fold assessment using reduced amino acid alphabets. Proteins, 63:986-995, 2006. [3] M. Betts and R. Russell. Amino acid properties and consequences of subsitutions. In Bioinformatics for Geneticists. Wiley, 2003. [4] C. Branden and J. Tooze. Introduction to Protein Structure. Routledge, 1998. [5] L. Mirmiy and E. Shakhnovich. Evolutionary conservation of folding nucleus. Journal of Molecular Biology, 2001, Vol.308: 123-129 [6] National Institute for Biotechnical Information. http://www.ncbi.nlm.nih.gov. [7] K. M. Misura, D. Chivian, C. A. Rohl, D. E. Kim, and D. Baker. Physically realistic homology models built with rosetta can be more accurate than their templates. Proc Natl Acad Sci USA, 103(14):5361?5366, 2006. (7) a full citation of the paper (that is, author names; publication date; name of journal, conference, technical report, thesis, book, or book chapter; name of editors, if applicable, of the journal or edited book; publisher name; publisher city; page numbers, if applicable); J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora and N. Krasnogorr. Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein Structure Prediction. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (GECCO2007), to appear, ACM Press, 2007. (8) a statement either that "any prize money, if any, is to be divided equally among the co-authors" OR a specific percentage breakdown as to how the prize money, if any, is to be divided among the co-authors; Any prize money, if any, is to be divided equally among the co-authors (9) a statement stating why the judges should consider the entry as "best" in comparison to other entries that may also be "human-competitive." Protein Structure Prediction, as stated above, is still one of the main unsolved problems in Computational Molecular Biology. The impact of being able to provide better PSP solutions than the current one are countless in biomedical research. Genetic therapy, synthesis of drugs for incurable diseases, etc. The results obtained through our method takes a significant step forward towards achieving this goal by reducing the dimensionality of the problem through an evolutionary based automated reduction of protein alphabets. We think it has to be considered as best because it addresses a very relevant, important, high profile and timely problem.