Procesamiento de Lenguaje Natural
Procesamiento de Lenguaje Natural
ISSN (Print) 1135-5948 - ISSN (Online) 1989-7553
  • Número 69

    • Pages: 1 - 307
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • A computational psycholinguistic evaluation of the syntactic abilities of
           Galician BERT models at the interface of dependency resolution and
           training time

    • Authors: Iria de-Dios-Flores, Marcos Garcia
      Pages: 15 - 26
      Abstract: This paper explores the ability of Transformer models to capture subjectverb and noun-adjective agreement dependencies in Galician. We conduct a series of word prediction experiments in which we manipulate dependency length together with the presence of an attractor noun that acts as a lure. First, we evaluate the overall performance of the existing monolingual and multilingual models for Galician. Secondly, to observe the effects of the training process, we compare the different degrees of achievement of two monolingual BERT models at different training points. We also release their checkpoints and propose an alternative evaluation metric. Our results confirm previous findings by similar works that use the agreement prediction task and provide interesting insights into the number of training steps required by a Transformer model to solve long-distance dependencies.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • Information fusion for mental disorders detection: multimodal BERT against
           fusioning multiple BERTs

    • Authors: Mario Ezra Aragón, A. Pastor López-Monroy, Luis C. González-Gurrola, Manuel Montes-y-Gómez
      Pages: 27 - 38
      Abstract: Given the increasing number of modalities that modern classification problems provide, recently a multimodal BERT transformer (MMBT) was proposed. An interesting opportunity to evaluate the effectiveness of such model is posed by the problem of timely detection of mental disorders of social media users. For this problem, a multi-channel perspective involves extracting from each user post different types of information, such as thematic, emotional and stylistic content. This study evaluates the suitability of tackling this problem by the apparently ad-hoc MMBT, moreover, we further evaluate if regular BERT models could be combined or fused in such a way that could have a chance in a multi-channel arena. For the evaluation, we use recent public data sets for three important mental disorders: Depression, Anorexia, and Self-harm. Results suggest that BERT models can get on their own a data representation that could be later fusioned and boost the classification performance by at least 5% in F1 measure, even surpassing the MMBT.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • Un redactor asistido para adaptar textos administrativos a lenguaje claro

    • Authors: Iria da Cunha
      Pages: 39 - 49
      Abstract: El lenguaje claro aboga por que los textos dirigidos a los ciudadanos estén redactados en un lenguaje más sencillo y transparente, para que estos puedan entender fácilmente el mensaje que se les quiere transmitir. En este contexto, nuestro objetivo es desarrollar un redactor asistido para el español que ayude al personal de la Administración pública a escribir en lenguaje claro los textos que dirige a la ciudadanía. El sistema, gratuito y en línea, integra diferentes herramientas de Procesamiento de Lenguaje Natural (PLN) para detectar en los textos escritos por los usuarios los rasgos lingüísticos que interfieren con las recomendaciones sobre lenguaje claro. Asimismo, ofrece al usuario información para hacer más sencillo su texto. Para evaluar los algoritmos se empleó un corpus anotado manualmente, y las medidas de precisión y cobertura. Los resultados son muy positivos, aunque también reflejan algunos aspectos que se pueden mejorar en el futuro.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • Exploiting user-frequency information for mining regionalisms in
           Argentinian Spanish from Twitter

    • Authors: Juan Manuel Pérez, Damián E. Aleman, Santiago N. Kalinowski, Agustín Gravano
      Pages: 51 - 62
      Abstract: The task of detecting regionalisms (expressions or words used in certain regions) has traditionally relied on the use of questionnaires and surveys, heavily depending on the expertise and intuition of the surveyor. The emergence of social media and microblogging services has produced an unprecedented wealth of content (mainly informal text generated by users), opening new opportunities for linguists to extend their studies of language variation. Previous work on the automatic detection of regionalisms depended mostly on word frequencies. In this work, we present a novel metric based on Information Theory that incorporates user frequency. We tested this metric on a corpus of Argentinian Spanish tweets in two ways: via manual annotation of the relevance of the retrieved terms, and also as a feature selection method for geolocation of users. In either case, our metric outperformed other techniques based on word frequency, suggesting that measuring the amount of users that use a word is an informative feature. This tool has helped lexicographers discover several unregistered words of Argentinian Spanish, as well as different meanings assigned to registered words.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • Reflexive pronouns in Spanish Universal Dependencies: from annotation to
           automatic morphosyntactic analysis

    • Authors: Jasper Degraeuwe, Patrick Goethals
      Pages: 63 - 72
      Abstract: In this follow-up article of Degraeuwe and Goethals (2020), we present the annotation scheme used to reannotate the 7298 potentially reflexive pronouns included in the Universal Dependencies Spanish AnCora v2.6 treebank, which resulted in significant modifications for the “Case” feature (100% changed) and dependency relations (87% changed). Next, we evaluate the performance of spaCy v3.2.2 and Stanza v1.3.0 (both trained on AnCora v2.8, and thus based on our reannotations) on the AnCora v2.8 test set, which yielded weighted F1 scores up to 0.88 and 0.98 for the “Case” and “Reflex” features, respectively, and up to 0.71 for the dependency relations. Finally, the error analysis of the spaCy results underlines the (generalisation) potential of the model, but also reveals some of the remaining issues in the automatic morphosyntactic analysis of reflexive pronouns in Spanish, such as determining if expletive relations denote an impersonal, passive or inherently reflexive use.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • Multi-label Text Classification for Public Procurement in Spanish

    • Authors: Maria Navas-Loro, Daniel Garijo, Oscar Corcho
      Pages: 73 - 82
      Abstract: Public procurement accounts for a 14% of the annual budget of the different governments of the European Union. In Europe, contracting processes are classified using Common Procurement Vocabulary codes (CPVs), a taxonomy designed to facilitate statistical reporting, search and the creation of alerts that can be used by potential bidders. CPVs are commonly assigned manually by public employees in charge of contracting processes. However, CPV classification is not a trivial task, as there are more than 9,000 different CPV categories, which are often assigned following heterogeneous criteria. In this paper we have created a CPV classifier that uses as an input the textual description of the contracting process, and assigns CPVs from the 45 top-level CPV categories. We work only with texts in Spanish, although our approach may be easily extended to other languages. Our results improve the state of the art (10% F1-score improvement) and are available online.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • Selección de colocaciones académicas en español a través de un filtro
           de interdisciplinariedad

    • Authors: Eleonora Guzzi, Margarita Alonso-Ramos
      Pages: 83 - 94
      Abstract: En este artículo se propone una metodología para compilar una lista de colocaciones académicas con base nominal que se integran en una herramienta léxica (Alonso-Ramos, García-Salido y Garcia, 2017). Para ello, establecemos un filtro que mide la interdisciplinariedad de los nombres académicos a partir de los cuales se extraen las colocaciones (García-Salido, 2021), con el fin de mantener los nombres frecuentes y bien distribuidos en distintas disciplinas académicas, y descartar aquellos que se adscriben a la terminología o que son más característicos de la lengua general. Utilizamos tres criterios: (1)el IDF (Jones, 1972); (2) el análisis de la distribución de colocaciones; (3) el contrastecon listas de vocabulario académico inglés. Los resultados muestran que estos criterios sonútiles para identificar los nombres prototípicos del discurso académico y permiten filtrar lalista de colocaciones académicas. No obstante, persiste el problema de cómo tratar ladesambiguación semántica en relación con las diferentes disciplinas.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • Compilación del corpus académico de noveles en euskera HARTAeus y su
           explotación para el estudio de la fraseología académica

    • Authors: María Jesús Aranzabe, Antton Gurrutxaga, Igone Zabala
      Pages: 95 - 103
      Abstract: Se ha compilado un corpus académico de noveles para el euskera comparable con el corpus HARTA-noveles para el español. A partir del corpus se ha extraído una lista de vocabulario académico para el euskera, y sendas listas de colocaciones y fórmulas, a las que se les han asignado funciones discursivas. El objetivo último del proyecto HARTAes-vas, en el que se enmarca este trabajo, es diseñar una herramienta de ayuda a la escritura académica para las dos lenguas centrada en las combinaciones léxicas académicas, que integre diccionario y corpus.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
  • Extraction and Semantic Representation of Domain-Specific Relations in
           Spanish Labour Law

    • Authors: Artem Revenko, Patricia Martín-Chozas
      Pages: 105 - 116
      Abstract: Despite the freedom of information and the development of various open data repositories, the access to legal information to general audience remains hindered due to the difficulty of understanding and interpreting it. In this paper we aim at employing modern language models to extract the most important information from legal documents and structure this information in a knowledge graph. This knowledge graph can later be used to retrieve information and answer legal question. To evaluate the performance of different models we formalize the task as event extraction and manually annotate 133 instances. We evaluate two models: GRIT and Text2Event. The latter model achieves a better score of ~ 0.8 F1 score for identifying legal classes and 0.5 F1 score for identifying roles in legal relations. We demonstrate how the produced legal knowledge graph could be exploited with 2 example use cases. Finally, we annotate the whole Workers’ Statute using the fine-tuned Text2Event model and publish the results in an open repository.
      PubDate: 2022-09-13
      Issue No: Vol. 69 (2022)
