Personal tools

The Mining of Complex Data

From Orpailleur

Jump to: navigation, search
The Mining of Complex Data
Participants : Amedeo Napoli
Chedy Raïssi
Elias Egho
Florence Le Ber
Luis Felipe Melo
Nicolas Jay
Yannick Toussaint
Zainab Assaghir

Formal concept analysis, itemset search, and association rule extraction, are suitable symbolic methods for KDDK, that may be used for real-sized applications. Global improvements may be carried on the ease of use, on the efficiency of the methods, and on the ability to fit evolving situations. Accordingly, the team is working on extensions of these symbolic methods to be applied on complex data such as objects with multi-valued attributes (e.g. domains or intervals), n-ary relations, sequences, trees, graphs, documents.


FCA, RCA, and Pattern Structures

Recent advances in data and knowledge engineering have emphasized the need for Formal Concept Analysis (FCA) tools taking into account structured data. There are a few extensions of FCA for handling contexts involving complex data formats, e.g. graphs or relational data. Among them, Relational Concept Analysis (RCA) is a process for analyzing objects described both by binary and relational attributes [118]. The RCA process takes as input a collection of contexts and of inter-context relations, and yields a set of lattices, one per context, whose concepts are linked by relations. RCA has an important role in KDDK, especially in text mining [91] [92].

Another extension of FCA is based on Pattern Structures (PS) [100], which allows to build a concept lattice from complex data, e.g. nominal, numerical, and interval data. In [6], pattern structures are used for building a concept lattice from intervals, in full compliance with FCA (thus benefiting of the efficiency of FCA algorithms). Actually, the notion of similarity between objects is closely related to these extensions of FCA: two objects are similar as soon as they share the same attributes (binary case) or attributes with similar values or the same description (at least in part). Various results were obtained in the study of the relations existing between FCA with an embedded explicit similarity measure and FCA with pattern structures [42] [67] [68]. Moreover, similarity is not a transitive relation and this lead us to the study of tolerance relations [41]. In addition, a new research perspective is aimed at using frequent itemset search methods for mining interval-based data being guided by pattern structures [6].

Pattern structures in association with a similarity measure were applied in the field of decision support in agronomy. In this domain, a set of agro-ecological indicators is aimed at helping farmers to improve their agricultural practices by estimating the impact of cultivation practices on the “agrosystem”. The modeling and the assessment of environmental risk require a large number of parameters whose measure is imprecise. The propagation of the imprecision and the different types of imprecision have to be taken into account in the computation of the value of indicators for decision support. Actually, based on pattern structures with a associated similarity measure, this problem has been approached as an information fusion problems with substantial results [12] [34] [59] [60].

Still in the context of agronomy, research work is in concern with the design of representation and reasoning models of spatial structures in knowledge-based systems, and in parallel, with the mining of complex hydrobiological data with concept lattices. FCA was compared and combined with statistical approaches to deal with multi-valued contexts in hydrobiology [2] [66] [69].

For completing the work on itemset search, there is still on-going work on frequent and rare itemset search for various reasons, among which improving standard algorithms, being able to build lattices from very large data, and completing the algorithm collection of the Coron platform. This year, substantial results were obtained on the search for rare itemsets which is an activity very important in biology and medicine because of the existence of rare symptoms [30] [57].

Privacy, anonymization, skylines, and streams

In the past decade, most of the research in privacy preserving data mining has been focusing on the privacy issues for relational data. Techniques such as k-anonymity, l-diversity and t-closeness have been proposed to address related problems. The publication of transaction data, such as market basket data, medical records, and query logs, serves the public benefit. Mining such data allows for the derivation of association rules that connect certain items to others with measurable confidence. Still, this type of data analysis poses a privacy threat; an adversary having partial information on a person’s behavior may confidently associate that person to an item deemed to be sensitive. Ideally, an anonymization of such data should lead to an inferenceproof version that prevents the association of individuals with sensitive items, otherwise allowing truthful associations to be derived. Original approaches to this problem were based on value perturbation, damaging data integrity. Recently, value generalization has been proposed as an alternative; still, approaches based on it have assumed either that all items are equally sensitive, or that some are sensitive and can be known to an adversary only by association, while others are non-sensitive and can be known directly. Yet in reality there is a distinction between sensitive and non-sensitive items, but an adversary may possess information on any of them. Most critically, no antecedent method aims at a clear inference-proof privacy guarantee. In our research work, we propose the first, to our knowledge, privacy concept that inherently safeguards against sensitive associations without constraining the nature of an adversary’s knowledge and without falsifying data [37]. Recently, skyline analysis has attracted a lot of interest due to its importance in multi-criteria decision making applications. In our research work, we introduce a novel approach significantly reducing domination tests for a given subspace and the number of subspaces searched [9]. Technically, we identify two types of skyline points that can be directly derived without using any domination tests. Moreover, based on formal concept analysis, we introduce two closure operators that enable a concise representation of skyline cubes.We show that this concise representation is easy to compute and develop an efficient algorithm, which only needs to search a small portion of the huge search space.

Sampling streams of continuous data with limited memory, or “reservoir sampling”, is a utility algorithm. Standard reservoir sampling maintains a random sample of the entire stream as it has arrived so far. This does not meet the requirement of many applications to give preference to recent data. The simplest algorithm for maintaining a random sample of a sliding window reproduces periodically the same sample design. This is undesirable for many applications. In our research work, we propose an effective algorithm, which is very simple and therefore very efficient, for maintaining -almost- a random sample of a sliding window [48].

KDDK in Text Mining

Ontologies help software and human agents to communicate by providing shared and common domain knowledge, and by supporting various tasks, e.g. problem-solving and information retrieval. In practice, building an ontology depends on a number of “ontological resources” having different types: thesaurus, dictionaries, texts, databases, and ontologies themselves. We are currently working on the design of a methodology and the implementation of a system for ontology engineering from heterogeneous ontological resources [32]. This methodology is based on both FCA and RCA, and was previously successfully applied in contexts such as astronomy and biology.

This year, an engineer will be in charge of implementing a new and robust system being guided by the previous research results and opening some new research directions involving trees and graphs.

Besides text mining, pharmacovigilance (PV) is in concern with the study and the prevention of adverse reactions to drugs (ADR), based on data collected by specialized centers and stored in case report databases (CRDBs). The CRDBs are then mined for finding unexpected associations between drugs and ADR that can be interpreted as signals. One objective of the ANR Project Vigitermes, which ended in June 2010, was to design a knowledge-based system for the management and the documentation of case reports, and, as well, for the detection of unexpected pharmacological associations. Following expert needs, we propose a method based on FCA for identifying candidates for pharmacological associations to be investigated in clinical trials [11]. In addition, this identification method uses statistical components for filtering significant associations. It was implemented within a prototype system and validated through an experiment on a database from the “Georges Pompidou” hospital.

Another work in text mining is concerned with the extraction of pharmacogenomics relationships from texts. A large amount of biomedical knowledge is lying in texts embedded in published articles, clinical files or biomedical public databases. For building operational knowledge bases from these textual sources, it is important to capture and formalize this knowledge. Here, relationships (also known as facts or events in the NLP literature) between biological entities represent elementary but interesting and reusable knowledge units. In [4], we propose a method based on a syntactic parsing for extracting rich semantic relationships between pairs of entities co-occurring in a single sentence. The method was applied in pharmacogenomics (study of the impact of individual genomic variation on drug responses) and we obtained a resource encoded in RDF that summarizes pharmacogenomics relationships mentioned into roughly 17 million Medline abstracts. This resource appears to be of major interest since it is used to guide human curation of biomedical databases, and to derive new knowledge about drug-drug interactions [102].

KDDK in Chemical Reaction databases

The mining of chemical chemical reaction databases is an important task for at least two reasons:

  1. the challenge represented by this task regarding KDDK,
  2. the industrial needs that can be met whenever substantial results are obtained.

Chemical reactions are complex data, that may be modeled as undirected labeled graphs. They are the main elements on which synthesis in organic chemistry relies, knowing that synthesis —and thus chemical reaction databases— is of first importance in chemistry, but also in biology, drug design, and pharmacology. From a problem-solving point of view, synthesis in organic chemistry must be considered at two main levels of abstraction: a strategic level where general synthesis methods are involved–a kind of meta-knowledge– and a tactic level where specific chemical reactions are applied. An objective for improving computer-based synthesis in organic chemistry is to discover general synthesis methods from currently available chemical reaction databases for designing generic and reusable synthesis plans. Graphmining methods have been successfully used for the discovery of general synthesis methods in collaboration with chemists and in accordance with needs of chemical industry [8].


  • [1] - Z.Assaghir, Analyse formelle de concepts et fusion d'informations : application à l'estimation et au contrôle d'incertitude des indicateurs agri-environnementaux, PhD Thesis, Institut National Polytechnique de Lorraine - INPL, November 2010,

  • [2] - F.Badra, Extraction de connaissances d'adaptation en raisonnement à partir de cas, PhD Thesis, Université Henri Poincaré - Nancy I, November 2009,

  • [3] - R.Bendaoud, Analyses formelle et relationnelle de concepts pour la construction d'ontologies de domaines à partir de ressources textuelles hétérogènes, PhD Thesis, Université Henri Poincaré - Nancy I, July 2009,

  • [4] - M.Chavent, Vers une nouvelle stratégie pour l'assemblage interactif de macromolécules, PhD Thesis, Université Henri Poincaré - Nancy I, January 2009,

  • [5] - A.Coulet, Construction et utilisation d'une base de connaissances pharmacogénomique pour l'intégration de données et la découverte de connaissances, PhD Thesis, Université Henri Poincaré - Nancy I, October 2008,

  • [6] - N.Jay, Découverte et représentation des trajectoires de soins par analyse formelle de concepts, PhD Thesis, Université Henri Poincaré - Nancy I, October 2008,

  • [7] - M.Kaytoue, Traitement de données numériques par analyse formelle de concepts et structures de patrons, PhD Thesis, Université Henri Poincaré - Nancy I, April 2011,

  • [8] - N.Messai, Analyse de concepts formels guidée par des connaissances de domaine : Application à la découverte de ressources génomiques sur le Web, PhD Thesis, Université Henri Poincaré - Nancy I, March 2009,

  • [9] - F.Pennerath, Méthodes d'extraction de connaissances à partir de données modélisables par des graphes. Application à des problèmes de synthèse organique., PhD Thesis, Université Henri Poincaré - Nancy I, July 2009,

  • [10] - J.Lieber, Contributions à la conception de systèmes de raisonnement à partir de cas, HDR Thesis, Université Henri Poincaré - Nancy I, January 2008,

  • [11] - D.Ritchie, Algorithmes Haute-Performance pour la Reconnaissance de Formes Moléculaires, HDR Thesis, Université Henri Poincaré - Nancy I, April 2011,

  • [12] - Y.Asses, V.Leroux, S.Tairi-Kellou, R.Dono, F.Maina, B.Maigret, Analysis of c-Met Kinase Domain Complexes: A New Specific Catalytic Site Receptor Model for Defining Binding Modes of ATP-Competitive Ligands, Chemical Biology & Drug Design 74, 6, 2009, p.560--570,

  • [13] - A.Beautrait, A.S. Karaboga, M.Souchet, B.Maigret, Cluster Induced fit in liver X receptor beta: a molecular dynamics-based investigation, Proteins Structure Function and Bioinformatics 72, 3, 2008, p.873--882,

  • [14] - A.Beautrait, V.Leroux, M.Chavent, L.Ghemtio, M.-D. Devignes, M.Smail-Tabbone, W.Cai, X.Shao, G.Moreau, P.Bladon, J.Yao, B.Maigret, Multiple-step virtual screening using VSM-G: overview and validation of fast geometrical matching enrichment., Journal of Molecular Modeling 14, 2, 2008, p.135--148,

  • [15] - S.Benabderrahmane, M.Smaïl-Tabbone, O.Poch, A.Napoli, M.-D. Devignes, IntelliGO: a new vector-based semantic similarity measure including annotation origin, BMC Bioinformatics 11, 1, December 2010, p.588,

  • [16] - A.Bertaux, F.LeBer, A.Braud, M.Trémolières, Mining Complex Hydrobiological Data with Galois Lattices, International Journal of Computing & Information Sciences 7, 2, 2010, p.63--77,

  • [17] - C.Bonnon, C.Bel, L.Goutebroze, B.Maigret, J.-A. Girault, C.Faivre-Sarrailh, PGY Repeats and N-Glycans Govern the Trafficking of Paranodin and Its Selective Association with Contactin and Neurofascin-155, Molecular Biology of the Cell 18, 1, 2007, p.229--241,

  • [18] - C.Brassac, S.Lardon, F.LeBer, L.Mondada, P.-L. Osty, Analyse de l'émergence de connaissances au cours d'un processus collectif. Re-catégorisations, reformulations, stabilisations, Revue d'Anthropologie des Connaissances Vol. 2, 2, 2008, p.267--286, Version disponible sur internet m],

  • [19] - W.Cai, J.Xu, X.Shao, V.Leroux, A.Beautrait, B.Maigret, SHEF: a vHTS geometrical filter using coefficients of spherical harmonic molecular surfaces, Journal of Molecular Modeling 14, 5, 2008, p.393--401,

  • [20] - A.Carrieri, V.Pérez-Nueno, I., A.Fano, C.Pistone, D.Ritchie, J.Teixid'o, Biological Profiling of Anti-HIV Agents and Insight into CCR5 Antagonist Binding Using in silico Techniques, ChemMedChem 4, 7, June 2009, p.1153--1163,

  • [21] - M.Chavent, B.Lévy, B.Maigret, MetaMol: High-quality visualization of molecular skin surface, Journal of Molecular Graphics and Modelling 27, 2, 2008, p.209--216,

  • [22] - C.Claperon, I.Banegas-Font, X.Iturrioz, R.Rozenfeld, B.Maigret, C.Llorens-Cortes, Identification of threonine 348 as a residue involved in aminopeptidase A substrate specificity., The Journal of Biological Chemistry 284, 16, April 2009, p.10618--26,

  • [23] - C.Claperon, R.Rozenfeld, X.Iturrioz, N.Inguimbert, M.Okada, B.Roques, B.Maigret, C.Llorens-Cortes, Asp218 participates with Asp213 to bind a Ca2+ atom into the S1 subsite of aminopeptidase A: a key element for substrate specificity., Biochemical Journal 416, 1, November 2008, p.37--46,

  • [24] - C.Claperon, R.Rozenfeld, X.Iturrioz, N.Inguimbert, M.Okada, B.Roques, B.Maigret, C.Llorens-Cortes, Contribution of molecular modeling and site-directed mutagenesis to the identification of threonine 348 as a residue involved in aminopeptidase a substrate specificity., The Journal of Biological Chemistry, 2008,

  • [25] - A.Coulet, Y.Garten, M.Dumontier, R.B. Altman, M.Musen, N.H. Shah, Integration and publication of heterogeneous text-mined relationships on the Semantic Web, Journal of Biomedical Semantics 2, S2, May 2011, p.S10,

  • [26] - A.Coulet, M.Smail-Tabbone, P.Benlian, A.Napoli, M.-D. Devignes, Ontology-guided data preparation for discovering genotype-phenotype relationships, BMC Bioinformatics 9, Suppl 4, 2008, p.S3,

  • [27] - A.Coulet, M.Smaïl-Tabbone, A.Napoli, M.-D. Devignes, Ontology-based knowledge discovery in pharmacogenomics., Advances in experimental medicine and biology 696, 2011, p.357--66,

  • [28] - M.Crampes, J.Oliveira-Kumar, S.Ranwez, J.Villerd, Visualizing Social Photos on a Hasse Diagram for Eliciting Relations and Indexing New Photos, IEEE Computer Graphics and Applications 15, 6, November 2009, p.985--992,

  • [29] - E.DeOliveira, C.Humeau, L.Chebil, E.Maia, F.Dehez, B.Maigret, M.Ghoul, J.-M. Engasser, A molecular modelling study to rationalize the regioselectivity in acylation of flavonoïd glycosides catalysed by Candida antartica lipase B, Journal of Molecular Catalysis B Enzymatic 59, 1-3, 2009, p.96--105,

  • [30] - N.Déliot, M.Chavent, C.Nourry, P.L'Ecine, C.Arnaud, A.Hermant, B.Maigret, J.Borg, Biochemical studies and Molecular Dynamics Simulations of Smad3-Erbin interaction identify a non-classical Erbin PDZ binding, Biochemical and Biophysical Research Communications / Biochemistry and Biophysics Research Communications 378, 3, 2009, p.360--365,

  • [31] - M.-D. Devignes, P.Franiatte, N.Messai, E.Bresso, A.Napoli, M.Smaïl-Tabbone, BioRegistry: Automatic extraction of metadata for biological database retrieval and discovery, International Journal of Metadata Semantics and Ontologies 5, 3, 2010, p.184--193,

  • [32] - C.Eng, C.Asthana, B.Aigle, S.Hergalant, J.-F. Mari, P.Leblond, A new data mining approach for the detection of bacterial promoters combining stochastic and combinatorial methods, Journal of Computational Biology 16, 9, September 2009, p.1211--1225,

  • [33] - C.Eng, A.Thibessard, M.Danielsen, T.B. Rasmussen, J.-F. Mari, P.Leblond, In silico prediction of horizontal gene transfer in Streptococcus thermophilus, Archives of Microbiology 193, 4, January 2011, p.287--297,

  • [34] - A.Estacio-Moreno, Y.Toussaint, C.Bousquet, Mining for adverse drug events with formal concept analysis., Studies in health technology and informatics 136, 2008, p.803--808,

  • [35] - S.Ferraresso, H.Kuhl, M.Milan, D.Ritchie, W., C.Secombes, J., R.Reinhardt, L.Bargelloni, Identification and characterisation of a novel immune-type receptor (NITR) gene cluster in the European sea bass, Dicentrarchus labrax, reveals recurrent gene expansion and diversification by positive selection, Immunogenetics 61, 11-12, October 2009, p.773--788,

  • [36] - N.Floquet, P.Durand, B.Maigret, B.Badet, M.-A. Badet-Denisot, D.Perahia, Collective motions in glucosamine-6-phosphate synthase: influence of ligand binding and role in ammonia channelling and opening of the fructose-6-phosphate binding site., Journal of Molecular Biology 385, 2, January 2009, p.653--64,

  • [37] - N.Floquet, S.Mouilleron, R.Daher, B.Maigret, B.Badet, M.-A. Badet-Denisot, Ammonia channeling in bacterial glucosamine-6-phosphate synthase (Glms): molecular dynamics simulations and kinetic studies of protein mutants, FEBS Letters / FEBS-Letters; FEBS Microbiol Lett 581, 16, 2007, p.2981--2987,

  • [38] - N.Floquet, C.Richez, P.Durand, B.Maigret, B.Badet, M.-A. Badet-Denisot, Discovering new inhibitors of bacterial glucosamine-6P synthase (GlmS) by docking simulations, Bioorganic & Medicinal Chemistry letters / Bioorganic and Medicinal Chemistry Letters (Bioorg Med Chem Lett) 17, 7, 2007, p.1966--1970,

  • [39] - M.Foucaud, E.Archer-Lahlou, E.Marco, I.G. Tikhonova, B.Maigret, C.Escrieut, I.Langer, D.Fourmy, Insights into the binding and activation sites of the receptors for cholecystokinin and gastrin, Regulatory Peptides 145, 1-3, 2008, p.17--23,

  • [40] - L.Ghemtio, M.-D. Devignes, M.Smaïl-Tabbone, M.Souchet, V.Leroux, B.Maigret, Comparison of three preprocessing filters efficiency in virtual screening: identification of new putative LXRbeta regulators as a test case, Journal of chemical information and modeling 50, 5, May 2010, p.701--715,

  • [41] - L.Ghemtio, E.Jeannot, B.Maigret, Efficiency of a hierarchical protocol for highthroughput structure-based virtual screening on Grid5000 cluster grid, Open Access Bioinformatics 2, May 2010, p.41--53,

  • [42] - M.Huchard, M.HaceneRouane, C.Roume, P.Valtchev, Relational Concept Discovery in Structured Datasets, Annals of Mathematics and Artificial Intelligence 49, 1/4, April 2007, p.39--76,

  • [43] - X.Iturrioz, R.Alvear-Perez, N.DeMota, C.Franchet, F.Guillier, V.Leroux, H.Dabire, M.LeJouan, H.Chabane, R.Gerbier, D.Bonnet, A.Berdeaux, B.Maigret, J.-L. Galzi, M.Hibert, C.Llorens-Cortes, Identification and pharmacological properties of E339-3D6, the first nonpeptidic apelin receptor agonist, The FASEB Journal 24, 5, May 2010, p.1506--1517,

  • [44] - X.Iturrioz, S.ElMessari, N.DeMota, C.Fassot, R.Alvear-Perez, B.Maigret, C.Llorens-Cortes, Functional dissociation between apelin receptor signaling and endocytosis: implications for the effects of apelin on arterial blood pressure, Archives des maladies du coeur et des vaisseaux 100, 8, August 2007, p.704--8,

  • [45] - X.Iturrioz, R.Gerbier, V.Leroux, R.Alvear-Perez, B.Maigret, C.Llorens-Cortes, By interacting with the C-terminal Phe of apelin, Phe255 and Trp259 in helix VI of the apelin receptor are critical for internalization., The Journal of Biological Chemistry 285, 42, October 2010, p.32627--32637,

  • [46] - M.Kaytoue, S.O. Kuznetsov, A.Napoli, S.Duplessis, Mining gene expression data with pattern structures in formal concept analysis, Information Sciences 181, 10, August 2010, p.1989--2001,

  • [47] - A.Khalfa, W.Treptow, B.Maigret, M.Tarek, Self assembly of peptides near or within membranes using coarse grained MD simulations, Chemical Physics 358, 1-2, 2009, p.161--170,

  • [49] - E.G. Lazrak, J.-F. Mari, B.Marc, Landscape regularity modelling for environmental challenges in agriculture, Landscape Ecology 25, 2, September 2009, p.169--183,

  • [50] - F.LeBer, C.Brassac, 'Etude longitudinale d'une procédure de modélisation de connaissances en matière de gestion du territoire agricole, Revue d'Anthropologie des Connaissances 2, 2, 2008, p.151--168,

… further results warning.pngThe following query conditions could not be considered due to the wikis restrictions in query size or depth: <q>[[118]] OR [[91]] OR [[92]] OR [[100]] OR [[6]] OR [[42]] OR [[67]] OR [[68]] OR [[41]] OR [[12]] OR [[34]] OR [[59]] OR [[60]] OR [[2]] OR [[66]] OR [[69]] OR [[30]] OR [[57]] OR [[37]] OR [[9]] OR [[48]] OR [[32]] OR [[11]] OR [[4]] OR [[102]] OR [[8]]</q> .