IE & IR - Information Extraction and Information Retrieval

  • INFORMATION EXTRACTION and INFORMATION RETRIEVAL 

    INFORMATION EXTRACTION: Automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.

    In information extraction (IE), the work carried out focuses on the use of automatic learning techniques to overcome the main drawbacks of the application of IE and its inherent dependence on a domain by reducing the need for supervision. Specifically, work is being carried out to design pattern acquisition methods for IE in restricted and unrestricted domains (whether structured or unstructured texts), document clustering techniques (so the unsupervised learning of IE patterns in open domains may require this preliminary step) and robust methods for the extraction of information in different media (both texts and transcriptions of the spoken word).

    INFORMATION RETRIEVAL: Searching for information within multimedia documents.

    Information retrieval (IR), both of texts and multimedia resources, is an important part of the processes of collection indexing and document or passage retrieval (based on previously indexed collections or by means of Internet wrappers).

    TALP Center works in question answering (Q&A) tasks. As a result of its work, a multilingual question answering system was developed for the TREC competitions – in the open domain category for English – and in the CLEF competition – also in the open domain category but for Spanish. It has also been designed a geography demonstrator in Spanish for the ALIADO project for a restricted domain environment and took part in the first GEOCLEF competition for the same domain in English. In addition, the Q&A system was extended to handle oral questions about facts, lists, definitions, information and biographies, and it is also endeavoring to extend the system’s multilingual capacities to Catalan.

    The automatic production of summaries is also tackled at various levels: monolingual, multilingual and cross-lingual summaries; mono- and multi-document summaries; text and speech summaries; extract and abstract summaries; general summaries; and guided summaries based on the questions, profiles or interests of users.

    Document analysis involves recognizing and extracting written text and pre-processing it (lexical and sentence segmentation, morpho-syntactic analysis and disambiguation, the detection and classification of noun phrases, superficial and deep syntactic analysis, semantic analysis, resolution of cross-references, etc.).

    The tasks that make up this line of research are:

    • Classification of documents and passagesClustering of documents
    • Clustering of documents
    • Detection of subject matter in documents and collections
    • Detection of links to and in documents
    • Measurement of distances (semantic or distributional) between language units, etc.
Scroll to Top