（1）主题：mAML: automated machine learning model-building pipeline with microbiome data
Due to the current barriers that prevent non-experts from performing customized machine learning (ML) analyses that involve large microbiome data sets, automated ML systems designed for use in this special field are in high demand. Here, we introduced mAML, an ML model-building pipeline, which can automatically and rapidly generate an interpretable model that exhibits high performance for completing certain microbial classification tasks. For users that are not highly skilled in programming, we developed a web server for this pipeline that is user-friendly, flexible, and scalable. Once the feature data are uploaded, the best combination of preprocessors and classifiers and the optimized hyper-parameters will be queried automatically and simultaneously. The server supports the addition and the pruning of any preprocessing method or classifier, and the default settings can be configured to user-specific settings by applying a very simple edit. The user can upload new data to repeatedly use the model, or they can download the pipeline and the corresponding docker image for local use. This pipeline is data-driven and can be easily extended to other data types with no missing data if the domain-specific feature matrix is supplied.
Fenglong Yang is a post-doc majored in bioinformatics advised by Prof. Quan Zou in UESTC. He obtained Ph.D. degree in microbiology in China Agricultural University (CAU). His research interests include exploring microbiome-host relationships with ML and network theory.
（1）主题：A random forest sub-Golgi protein classifier
To gain insight into the malfunction of the Golgi apparatus and its relationship to various genetic and neurodegenerative diseases, the identification of sub-Golgi proteins, both cis-Golgi and trans-Golgi proteins, is of great significance. In this study, a state-of-art random forests sub-Golgi protein classifier, rfGPT, was developed. The rfGPT used 2-gap dipeptide and split amino acid composition for the feature vectors and was combined with the synthetic minority over-sampling technique and an analysis of variance (ANOVA) feature selection method. The rfGPT was trained on a sub-Golgi protein sequence data set (137 sequences) with sequence identity less than 25%. For the optimal rfGPT classifier with 93 features, the accuracy (ACC) was 90.5%; the Matthews correlation coefficient (MCC) was 0.811; the sensitivity (Sn) was 92.6%; and the specificity (Sp) was 88.4%. The independent testing scores for the rfGPT were ACC = 90.6%; MCC = 0.696; Sn = 96.1%; and Sp = 69.2%. Although the independent testing accuracy was 4.4% lower than that for the best reported sub-Golgi classifier trained on a data set with 40% sequence identity (304 sequences), the rfGPT is currently the top sub-Golgi protein predictor utilizing feature vectors without any position-specific scoring matrix and its derivative features. Therefore, the rfGPT is a more practical tool, because no sequence alignment is required with tens of millions of protein sequences. Thus, to date, the rfGPT is the Golgi classifier with the best independent testing scores optimized by training on smaller benchmark data sets.
Zhibin Lv is a post-doctoral majored in machine learning. He obtained his bachelor’s degree in materials science in Xiamen University in 2008.From 2008 to 2013, he performed PhD research in solar cells advised by Prof. Dechun Zou in Peking University. In 2013, he joined China Academy of Engineering Physics and worked as a senior engineer for nuclear weapon science, technology, engineering and management (STEM) informatics. On May 2017, he worked as a vice-general manager in fine chemicals department of Chengrand Company of ChemChina . From May 25, 2019, he works as a post-doctoral in USTEC and is co-working with Prof. Quan Zou. His research interests include machine learning applications in materials science, cheminformatics, bioinformatics and drug discovery.
（1）主题：Shape of Data from the Algorithms’ View
Single cell RNA sequencing (scRNA-seq) has generated numerous data and renewed our understanding of biological phenomena at the cellular scale. Identification of cell types through classification or clustering followed by annotation has been the most prevalent means for interpreting scRNA-seq data, based upon which connections are made between the transcriptome and phenotype.
When it comes to the evaluation of clustering or classification algorithms performance on scRNA-seq data, of course, the attention should be paid to the algorithms instead of datasets, in which the stats index are compared within the algorithms indicating their performance on a chosen group of datasets. In this case, all the datasets are trusted indiscriminately instead of weighted, and the so-called “golden standard” or “silver standard” datasets are based on the experience (if not the intuition) of biologists. There is barely quantitative standard to evaluate the suitability of a dataset on the task of algorithms evaluation.
Another question comes from the development of clustering or classification algorithms, especially in those tools especially designed for scRNA-seq data, the characteristics of datasets could inspire new thoughts. One example is that in the development of scRNA-seq classifiers, the intrinsic structure of dataset is taken into consideration and thus explicitly hired to guide the classification of cells. So whether there is more other characteristics for us to use in the tools development and refinement should be an intriguing issue.
Both these problems call for the quantitative understanding of the characteristics of datasets, and relationship among datasets. Herein, as the problems arise from the evaluation of algorithms, we attempt to define these characteristics and relationships from the angle of the algorithms’ acting on datasets in their parameters space, in order to know more about datasets themselves and their relationships among one another, as well as to offer a new aspect to be considered in the future development of algorithm development and evaluation.
Ziwei Wang is a post-doc majored in bioinformatics advised by Prof. Quan Zou in UESTC. He obtained Ph.D. degree in microbiology in Peking University (PKU). Her research interest is in single cell sequencing data processing.
（1）主题：Its 2vec: fungal species identification using sequence embedding and random forest classification
Fungi play essential roles in many ecological processes, and taxonomic classification is fundamental for microbial community characterization and vital for the study and preservation of fungal biodiversity. Fungal barcodes, especially the internal transcribed spacer (ITS) region of ribosomal DNA, have been proven valuable for the identification of fungal species, especially those that are unculturable and for which morphological information is incomplete. To cope with massive fungal barcode data, tools that can implement extensive volumes of barcode sequences are necessary. However, high variation in the ITS region and computational requirements for processing high-dimensional features remain challenging for existing predictors.
Here, we developed Its2vec, a bioinformatics tool for the classification of fungal ITS barcodes to the species level. An ITS database covering more than 25,000 species in a broad range of fungal taxa was assembled. For dimensionality reduction, a word embedding algorithm was used to represent an ITS sequence as a dense low-dimensional vector. A random forest-based classifier was built for species identification. Benchmarking results showed that our model achieved an accuracy comparable to that of several state-of-the-art predictors, and more importantly, it could implement large datasets and greatly reduce dimensionality. We expect the ITS2vec model to be helpful for fungal species identification and, thus, for revealing microbial community structures and deepening our understanding of their functional mechanisms.
Chao Wang received his Ph.D degree from Institute of Microbiology, Chinese Academy of Sciences in 2019. He currently works as a Postdoctoral Fellow in IFFS of UESTC, Now, his interests mainly focus on using bioinformatics and machine learning tools to classify fungal DNA barcodes, and to predict the gene (protein) functions of fungi.
（1）主题：Oscillations, travelling fronts and patterns in a supramolecular system
Supramolecular polymers, such as microtubules, operate under non-equilibrium conditions to drive crucial functions in cells, such as motility, division and organelle transport. In vivo and in vitro size oscillations of individual microtubules (dynamic instabilities) and collective oscillations have been observed. In addition, dynamic spatial structures, like waves and polygons, can form in non-stirred systems. Here we describe an artificial supramolecular polymer made of a perylene diimide derivative that displays oscillations, travelling fronts and centimetre-scale self-organized patterns when pushed far from equilibrium by chemical fuels. Oscillations arise from a positive feedback due to nucleation–elongation–fragmentation, and a negative feedback due to size-dependent depolymerization. Travelling fronts and patterns form due to self-assembly induced density differences that cause system-wide convection. In our system, the species responsible for the nonlinear dynamics and those that self-assemble are one and the same. In contrast, other reported oscillating assemblies formed by vesicles, micelles or particles rely on the combination of a known chemical oscillator and a stimuliresponsive system, either by communication through the solvent (for example, by changing pH7–9), or by anchoring one of the species covalently (for example, a Belousov–Zhabotinsky catalyst). The design of self-oscillating supramolecular polymers and large-scale dissipative structures brings us closer to the creation of more life-like materials that respond to external stimuli similarly to living cells, or to creating artificial autonomous chemical robots.
Yixi Wang received his Ph.D degree from School of chemistry and chemical engineering, Shihezi university in 2019. He currently works as a Postdoctoral Fellow in IFFS of UESTC. Now, his research interests are using self-assembly of polymer and non-equilibrium of molecule, and to study reversible behavior at the molecular level.
编辑：杨棋凌 / 审核：林坤 / 发布者：陈伟