This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences (HSS), illustrated here through the case of ``non-technological innovation''. The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication). The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata. From this initial corpus, a derived dataset in English is created for machine learning purposes. For each occurrence of a term from the expert lexicon, a contextual block of five sentences is extracted (two preceding and two following the sentence containing the term). Each occurrence is annotated with the thematic category associated with the term, enabling the construction of data suitable for supervised classification tasks. This approach results in a reproducible and extensible resource, suitable both for analyzing lexical variability around emerging concepts and for generating datasets dedicated to natural language processing applications.
翻译:本文提出一种混合方法,用于构建旨在支持人文与社会科学(HSS)新兴概念研究的跨语言语料库,并以“非技术创新”为例进行说明。该语料库整合两种互补来源:(1)从企业网站自动提取并针对法语和英语进行清洗的文本内容;(2)依据文献标准(年份、格式、重复性)收集并自动过滤的年度报告。处理流程包括自动语言检测、无关内容过滤、相关片段提取以及结构化元数据增强。基于此初始语料库,为机器学习目的构建了一个英文衍生数据集。针对专家词典中每个术语的出现实例,提取包含该术语的句子及其前后各两句(共五句)作为上下文块。每个实例均标注有对应术语的主题类别,从而构建适用于监督分类任务的数据集。该方法生成了一种可复现且可扩展的资源,既适用于分析新兴概念周边的词汇变异,也适用于生成专用于自然语言处理应用的数据集。