(1) the complete title of one (or more) paper(s) published in the open 
literature describing the work that the author claims describes a 
human-competitive result

Paper title:"Automated Alphabet Reduction Method with Evolutionary Algorithms 
for Protein Structure Prediction"


(2) the name, complete physical mailing address, e-mail address, and phone 
number of EACH author of EACH paper,

Author 1:
-Name: Jaume Bacardit,
-Address: School of Computer Science & IT, University of Nottingham,
Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK
-Email: jqb@cs.nott.ac.uk
-Phone: +44 115 951 4234 

Author 2:
-Name: Michael Stout,
-Address: School of Computer Science & IT, University of Nottingham,
Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK
-Email: mqs@cs.nott.ac.uk
-Phone: +44 115 951 4234 

Author 3:
-Name: Jonathan D. Hirst,
-Address: School of Chemistry, University of Nottingham,
University Park, Nottingham, NG7 2RD, UK
-Email: jonathan.hirst@nottingham.ac.uk
-Phone: +44 115 951 3478

Author 4:
-Name: Kumara Sastry,
-Address: Department of Industrial and Enterprise Systems Engineering, 
Univerity of Illinois at Urbana-Champaign, 117 Transportation Bldg., MC-238,
104 S. Mathews Ave, Urbana, IL, 61801, USA
-Email: ksastry@uiuc.edu
-Phone: +1 217 333 2346 

Author 5:
-Name: Xavier LLora,
-Address: National Center for Supercomputing Applications, 
University of Illinois at Urbana-Champaign, 1205 W.Clark Street, 
Urbana, IL, 61801, USA
-Email: xllora@uiuc.edu
-Phone: +1 217 265 0894

Author 6:
-Name: Natalio Krasnogor,
-Address: School of Computer Science & IT, University of Nottingham,
Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK
-Email: nxk@cs.nott.ac.uk
-Phone: +44 115 846 7592


(3) the name of the corresponding author (i.e., the author to whom notices 
will be sent concerning the competition),

Natalio Krasnogor & Jaume Bacardit

(4) the abstract of the paper(s),

Paper abstract:
This paper focuses on automated procedures to reduce the dimensionality of
protein structure prediction datasets by simplifying the way in which the
primary sequence of a protein is represented. The potential benefits of
this procedure are faster and easier learning process as well as the generation
of more compact and human-readable classifiers. The dimensionality reduction 
procedure we propose  consists on the reduction of the 20-letter amino acid 
(AA) alphabet, which is normally used to specify a protein sequence, into a 
lower cardinality alphabet. This reduction comes about by a clustering of AA 
types accordingly to their physical and chemical similarity. Our automated 
reduction procedure is guided by a fitness function based on the Mutual 
Information between the AA-based input attributes of the dataset and the 
protein structure feature that being predicted.  

To search for the optimal reduction, the Extended Compact Genetic Algorithm 
(ECGA) was used, and afterwards the results of this process were fed into (and 
validated by)  BioHEL, a genetics-based machine learning technique. BioHEL 
used the reduced alphabet to induce rules for protein structure prediction 
features. BioHEL results are compared to two standard machine learning 
systems. Our results show that it is possible to reduce the size of the 
alphabet used for prediction from twenty to just three letters resulting in 
more compact, i.e. interpretable, rules. Also,  a protein-wise accuracy 
performance measure suggests that the loss of accuracy acrued by this 
substantial alphabet reduction is not statistically significant when compared 
to the full alphabet.


(5) a list containing one or more of the eight letters (A, B, C, D, E, F, G, 
or H) that correspond to the criteria (see above) that the author claims that 
the work satisfies,

B, D, E and G

(6) a statement stating why the result satisfies the criteria that the 
contestant claims (see the examples below as a guide to aid in constructing 
this part of the submission),


(B) The result is equal to or better than a result that was accepted as a new 
scientific result at the time when it was published in a peer-reviewed 
scientific journal.

This entry deals with alphabet reduction applied to protein structure 
prediction (PSP) domains. PSP attempts to solve the, yet unknown, mapping from 
a protein's primary sequence to its three dimensional structure, called
native state. A primary sequence is a string defined using an alphabet of 20 
letters, each of which represents one of the 20 Amino Acids (AA) that appear in 
nature. This entry demonstrate a protocol based on evolutionary algorithms 
that transforms the representation of the primary sequence of a protein, 
using the AA 20-letter alphabet, into another string using a much lower 
cardinality alphabet *without* loosing critical information.

Alphabet reduction can be applied to many subproblems of the PSP field 
provided that, as mentioned, above, in each case critical information is 
preserved. Searching for good reduced alphabets is and has been an active 
field of research and some of the best reductions are man-made. For instance, 
in [1] the authors claim that at least 10 symbols are needed in order to be 
able to perform successful sequence alignment. In [2] several reduced 
alphabets, some man-made and other automatically generated are compared 
in a sequence alignment related task. The best reduced alphabet, of 5 symbols, 
obtains an accuracy 0.9% lower than the original 20-letter alphabet.

In this work we apply our automated evolutionary computation-based alphabet 
reduction method to a PSP domain called coordination number prediction. We 
obtain a reduced alphabet of only three symbols that can accurately capture 
the essence of coordination number prediction with only 0.6% lower 
protein-wise accuracy, using a protein-wise accuracy metric, than the original 
20-letter alphabet.

(D) The result is publishable in its own right as a new scientific result - 
independent of the fact that the result was mechanically created.

The work has been accepted for publication in the GECCO2007 conference. A 
second, much larger, paper is being submitted to one of the key bioinformatics 
journal showing that our evolutionary algorithm-based alphabet reduction 
method also produces better than current state of the art (whether human made 
or not) alphabet reductions for a variety of PSP related problems.

(E) The result is equal to or better than the most recent human-created
solution to a long-standing problem for which there has been a
succession of increasingly better human-created solutions.

Alphabet reduction is a key component of structural bioinformatics research. 
One famous alphabet reductions are the HP-model (also called Dill's model) 
that reduces the 20 amino acids alphabet to only two classes, namely 
hydrophobic and polar. These two classes focus on hydrophobicity, the main
physical force that controls the folding process. Other textbooks reduce them 
accordingly to properties such as size (small vs big), charge (charged vs 
uncharged, positive vs negative), etc. [3]. 

There is plenty of information of alternative alphabets 
available, see for example references [4-6].

The method proposed here can be applied to many PSP related problems and 
supersedes some of the best reduced alphabets.

(G) The result solves a problem of indisputable difficulty in its field.

Protein Structure Prediction (PSP) and its related problems (e.g. contact map 
prediction, contact number prediction, solvent accessibility prediction, domain 
prediction, etc.) is, after decades of research, still one of the main unsolved 
problems in Computational Molecular Biology. This is due to many factors, such 
as incomplete understanding of the folding process, imperfect and noisy data,  
enormous search spaces, etc. The relevence of this problem is enormous as a
better understanding and predictive capacity of the PSP models can lead do
improvements in the design of pharmaceuticals, among other benefits.

As a result of the above factors, intensive research on how to reduce the 
dimensionality of these problems has been pursued over the years. One such 
dimensionality reduction technique is, precisely, alphabet reduction. This 
entry is meant to help mainly in the latter factor, alleviating the huge 
computational cost (which can be in the order of 10^4 CPU days to predict the 
structure of a single protein, using state-of-the-art methods [7]) of PSP by 
strategically simplifying without much information loss the representation for 
the primary sequence of a protein and therefore, reducing the dimensionality 
of the search space that the PSP methods have to handle. 

Moreover, the solutions found by the methods in this entry can also 
contribute to a better understanding of the protein folding process.
Perhaps more significant, our results indicate that there is no one reduce 
alphabet that fits all problems, hence having a human-competitive alphabet 
reduction algorithm as the one we present could provide in the near future key 
insights for many different PSP subproblems, where a single overall reduction 
policy would fail. Thus, our entry is not only human-competitive in the sense 
that it improves current state of the art but also in that it exceeds 
human-capacity for specialising these alphabets to other PSP related problems.

[1] T. Li, K. Fan, J. Wang and W. Wang. Reduction of protein sequence 
complexity by residue grouping. Protein Engineering vol. 16 no. 5 pp. 
323-330, 2003.

[2] F. Melo and M. Marti-Renom. Accuracy of sequence alignment and fold 
assessment using reduced amino acid alphabets. Proteins, 63:986-995, 2006.

[3] M. Betts and R. Russell. Amino acid properties and consequences of 
subsitutions. In Bioinformatics for Geneticists. Wiley, 2003.

[4] C. Branden and J. Tooze. Introduction to Protein Structure. 
Routledge, 1998.

[5] L. Mirmiy and E. Shakhnovich. Evolutionary conservation of folding nucleus.
Journal of Molecular Biology, 2001, Vol.308: 123-129

[6] National Institute for Biotechnical Information. 
http://www.ncbi.nlm.nih.gov.

[7] K. M. Misura, D. Chivian, C. A. Rohl, D. E. Kim, and D. Baker. Physically 
realistic homology models built with rosetta can be more accurate than 
their templates. Proc Natl Acad Sci USA, 103(14):5361?5366, 2006.


(7) a full citation of the paper (that is, author names; publication date; 
name of journal, conference, technical report, thesis, book, or book chapter; 
name of editors, if applicable, of the journal or edited book; publisher name; 
publisher city; page numbers, if applicable);

J. Bacardit, M. Stout, J.D. Hirst, K. Sastry, X. Llora and N. Krasnogorr.
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein 
Structure Prediction. In Proceedings of the 9th Annual Conference on Genetic 
and Evolutionary Computation (GECCO2007), to appear, ACM Press, 2007.

(8) a statement either that "any prize money, if any, is to be divided equally 
among the co-authors" OR a specific percentage breakdown as to how the prize 
money, if any, is to be divided among the co-authors;

Any prize money, if any, is to be divided equally among the co-authors

(9) a statement stating why the judges should consider the entry as "best" in 
comparison to other entries that may also be "human-competitive."

Protein Structure Prediction, as stated above, is still one of the main 
unsolved problems in Computational Molecular Biology. The impact of being able 
to provide better PSP solutions than the current one are countless in 
biomedical research. Genetic therapy, synthesis of drugs for incurable 
diseases, etc.

The results obtained through our method takes a significant step forward
towards achieving this goal by reducing the dimensionality of the problem 
through an evolutionary based automated reduction of protein alphabets. We 
think it has to be considered as best because it addresses a very relevant, 
important, high profile and timely problem.