SIGNHPC: Nordic High-Performance Computing for Natural Language Processing

Language technologies are inherently data-oriented and computational. Over the past decade, or so, state-of-the-art research often assumes the use of massive volumes of natural language data, for example in model training or the analysis of web-scale corpora. Natural language processing (NLP) researchers in Northern Europe have started to explore public high-performance computing (HPC) facilities, i.e. large, typically national storage and supercomputing services. To date, there are relatively few such use cases, with hardly any communication among HPC users across research group boundaries.

Modern and extensive HPC e-infrastructures are available in Northern Europe, but in NLP there is not much of a tradition for HPC utilization yet. Also, NLP use patterns differ somewhat from those in ‘classic’ HPC disciplines, with a focus on throughput computing and some distributed-memory parallelization, but often with high memory demands and random access to or aggregation over large on-disk data sets. MapReduce-based frameworks like Hadoop and Apache Spark are gaining popularity in NLP research but are not currently supported by national HPC facilities. In terms of data set characteristics, general machine learning techniques are widely applied to NLP, but modeling natural language often leads to very high-dimensional parameter spaces and proportionally small numbers of training instances (even in very large training data sets).

Besides increasing needs for compute cycles, large-scale language analysis can call for storage infrastructures that exceed department- or university-level facilities. Just the ‘pure’ textual content of a 2014 snapshot from the Common Crawl, for example, comprises several terabytes of compressed data. Research on improved content extraction (from the ‘raw’ documents, containing mark-up), in turn, would presuppose at least one hundred terabytes of available storage.

This SIG grew out of an informal community meeting in late 2014, from which developed an on-going dialogue with the Nordic e-Infrastructure Initiaitve (NeIC). The SIG seeks to (a) facilitate knowledge (and possibly tool) exchange among NLP researchers in Northern Europe who want to make good use of supercomputing facilities and (b) provide a platform for communication between HPC-active NLP researchers and the providers of national and Nordic e-infrastructures. The SIG maintains a (low-traffic) mailing list (where interested parties are encouraged to self-subscribe) and will hold regular, infrequent meetings of interested parties (for example co-located with either the NoDaLiDa or NeIC conferences). Depending on general interest, the SIG may also try to help organize HPC-related training events, e.g. a summer school in large-scale scientific programming (with applications to NLP).

Filip Ginter, University of Turku, Finland
Joakim Nivre, Uppsala University, Sweden
Stephan Oepen (chair), University of Oslo, Norway
Anders Søgaard, University of Copenhagen, Denmark
Jörg Tiedemann, University of Helsinki, Finland

-- StephanOepen - 2015-01-19