Primary vertex reconstruction using GPUs for the upgrade of the Inner Tracking System of the ALICE experiment at LHC

Year
2020
Degree
PhD
Author
Concas, Matteo
Mail
matteo.concas@cern.ch
Institution
CERN
Abstract

In 2021 the Large Hadron Collider at CERN will start its third data taking period, the so- called Run 3, which will last until 2023. During the Long Shutdown 2, the pause between Run 2 and Run 3, the four major experiments at the accelerator (ALICE, ATLAS, CMS, LHCb), are undergoing through major upgrades and working on the apparatus mainte- nance. In particular, the ALICE (A Large Hadron Collider Experiment) experiment is performing a huge upgrade of some detectors at the hardware level and a big effort is ded- icated to the online and offline processes to improve the data acquisition and processing. The ALICE experiment aims at studying the Quark-Gluon Plasma, a state of matter where quarks and gluons are not confined into hadrons and that can be inspected and charac- terised using high-energy ion collisions. Plans for ALICE in Run 3 include the collection of 10 nb−1 of Pb–Pb collisions, with an instantaneous luminosity up to 6 ×1027 cm−2s−1 and a collision rate of 50 kHz at 5.5 TeV, corresponding to a total of 1011 interactions recorded. This is the minimum rate required to address the proposed physics programme that focuses on rare probes both at low and high momentum. For the proton collisions programme, the experimental apparatus will acquire data with a rate up to 400 kHz of interactions at 13 TeV, to obtain a a meaningful data reference, instrumental for the heavy-ion physics programme. ALICE will also move to a brand new paradigm for what concern the data acquisition of many sub-detectors, the so-called continuous readout. This approach fore- sees a trigger-less mode to acquire data continuously; furthermore, data are processed and reconstructed as a stream of data in an on-line mode during the data taking. Such condi- tions impose stringent constraints to the detector performance in terms of acquisition rate and challenges for what concerns the output data bandwidth, forcing a general upgrade of the hardware and the software of the experiment. To comply with this ambitious scenario, many interventions are being operated on sub- detectors of the ALICE apparatus, mostly dedicated to the readout electronics and to sensor readout capabilities, to provide a faster acquisition rate compared to the Run 2 working conditions. Two brand new detectors have also be designed and constructed to improve the precision of physics measurements: a Muon Forward Tracker (MFT) and a completely renovated Inner Tracking System (ITS), the sub-detector dedicated to the re- construction of the interaction vertex, placed within the innermost section of the detector, in the central barrel volume of the ALICE apparatus. The latter is a cylindrical detector built up with seven concentric layers of silicon pixels chips adopting a Monolithic Active Pixels Sensors (MAPS) layout, a new technology for faster and more precise measurements. The upgraded layout is characterised by finer pixel granularity, inner barrel closer to the beam pipe with respect to the ITS used during the LHC Run2 and counts one additional layer for a total of 7 layers. These hardware features translate in a better spatial resolu- tion of the interaction point (tridimensional space point where beams collide) and better resolution for low momentum particles and tracks parameters useful at the data analysis level. The continuous readout set also critical challenges in terms of online data reduction, compression and reconstruction. It is estimated that the whole online reconstruction work- flow will be able to cope with a 3.5 TB/s bandwidth of data in input and will write ∼ 100 GB/s of data on persistent storage. To enable the processing of continuous readout data stream it has been developed a completely renewed software stack which includes parts to be run both online and offline on reconstructed data. The new framework is called Online-Offline (O2). Such a new platform is thought to be deployed both on large comput- ing clusters like the Event Processing Node farm (EPN), the main computing centre used for online reconstruction, located on-site close to the ALICE experimental apparatus, but also on smaller computing resources, down to the single workstation of a scientist. The O2 framework presents a new operational design, compared with former software used in ALICE. It is based on workflows operated by generic logical "devices": computing processes that aims at continuously execute some routine depending on the designed roles. Devices are then plugged each others to create actual pipelines or topologies, where they can com- municate and cooperate to execute more complex tasks. The O2 embodies a multi-process paradigm where each device is represented by a process on a computing resource and the workflows are described by topologies connecting devices following a data-flow models. More in details, each device is responsible to perform a piece of a large routine and the output data are exchanged across devices by mean of messages that can be implemented using different technologies, depending on the underlying computing infrastructure. Every type of workflow in the O2 is described and implemented using this design, from Monte Carlo simulation to online data reconstruction and even the analysis framework is being developed using the same core approach. The schema is flexible and scalable, each en- tity can be, if required, replicated to cope with specific demands and the deployment of a workflow can be done on a laptop as well as on a High-Performance Computing cluster, in some cases. These features are completely transparent to the final user, which will be able to agilely work with both resources with minimal techincal effort. Ultimately, the framework does support heterogeneous architectures, allowing devices to offload payloads on computing accelerators like Graphic Processing Units (GPUs) and Fully Programmable Field Arrays (FPGAs), to obtain high-throughput computations with a fully integrated data model to support them. The work in this thesis will present the design, development and implementation of a primary vertex reconstruction algorithm with ITS-only data that will be used by ALICE in the online reconstruction phase during the Run 3. The estimation of the primary vertex po- sition is instrumental for calibrating detectors because it provides useful information on the beam position. More importantly, it is a critical information for some online reconstruction processes such as the ITS tracking, the process that reconstruct the trajectories of charged particles, as a critical starting point for the process. The work presented is integrated in the O2 framework, and provides both a CPU and a GPU-accelerated parallel version of the algorithm. The algorithm is able to identify also the so-called pile up of events, when data related to more than one collision vertex are present in the same input dataset, and provide their position with a resolution compliant with the physics programme requests. Concern- ing the GPU version, two implementations are presented: on using Nvidia GPUs and one using Advanced Micro Devices (AMD) GPUs. Two versions are coded using the two corre- sponding development frameworks, the Computing Unified Device Architecture (CUDA) for Nvidia and Heterogeneous-Computing Interface for Portability (HIP) from AMD. Con- sistency checks among three implementations are performed, together with performance benchmarks. Time measurements are also reported for comparison the 3 implementations. The primary vertex reconstruction is proven to be compliant with the O2 requirements in terms of resolution and time performance. Knowledge acquired in working on this will be used in future also to further extend the dominion of the GPU-accelerated workflows in the O2 context.

Supervisors
Demarchi, Danilo
Report number
CERN-THESIS-2020-422
Date of last update
2023-11-07