uB-VisioGeoloc (2023)

[1] : Imagerie et Vision Artificielle (UR 7535) (Université de Bourgogne)
[2] : Laboratoire d'étude de l'apprentissage et du développement (Université de Bourgogne)
Description :
The dataset proposed is a collection of pedestrian navigation data sequences combining visual and spatial information. The pedestrian navigation sequences are situations encountered by a pedestrian walking in an urban outdoor environment, such as moving on the sidewalk, navigating through a crowd or crossing a street when the pedestrian light traffic is green. The acquired data is time-stamped and provide RGB-D images and associated with GPS, and inertial data (acceleration, rotation). These recordings were acquired by separate processes, avoiding delays during their capture in order to guarantee a synchronization between the moment of acquisition by the sensor and the moment of recording on the system. The acquisition was made in the city of Dijon, France, including narrow streets, wide avenues, and parks.
Annotations of the RGB-D are also provided by bounding boxes indicating the position of relevant static or dynamic object present in a pedestrian area such as a tree, bench, or person.
This pedestrian navigation dataset is proposed for the development of a mobility support system for visually impaired people in their daily movements in an outdoor environment. The visual data and localization sequences are used to elaborate the visual processing method to extract relevant information about the obstacle and the current position of the path. Alongside the dataset, a visual to auditory substitution method has been employed to convert each image sequence into corresponding stereophonic sound files, allowing for comparison and evaluation. Synthetics sequences associated with the same set of information are also provided based on the recordings of a displacement within the 3D model of a real place in Dijon.

More information about the project : https://imvia.u-bourgogne.fr/projet/3d-sound-glasses.html
Disciplines :
computer science, artificial intelligence (engineering science), computer science, software engineering (engineering science), engineering, industrial (engineering science)

General metadata

Data acquisition date : from 28 Mar 2023 ongoing
Data acquisition methods :
  • Observational data :
    The synthetic data was generated using the virtual camera of the Unity game engine and annotated directly, while the real data was captured using an RGB-D camera and annotated using a convolutional neural network based on deep learning. This section explains in detail the techniques used for the acquisition and annotation of the videos, and describes the method used to produce the accompanying sound files

    a. Real data
    The data collection process in the city of Dijon involved a person equipped with a tracking system. The system, mounted on the user's helmet at a height of 1.85 meters, replicated the visual perspective of the person. The elevation of the camera was adjusted using the fixation screw, allowing the camera angle to be fine-tuned. The capture system, as shown in Figure 4, consisted of several components. The primary component was the Intel Realsense D435 RGB-D camera, which captured both color (RGB) and depth (D) information at 30 frames per second. This camera was responsible for capturing the visual data as the person navigated. In addition to the camera, the acquisition system included an Adafruit BNO055 Inertial Measurement Unit (IMU) sensor and a GPS antenna, which recorded information about the environment at a frequency of 100 Hz and 10 Hz respectively. An Adafruit MCP2221A UART to USB module is also part of the device to facilitate communication between the IMU and GPS sensors and the laptop. Data acquisition was performed by a C++ program using the Realsense 2 and OpenCV libraries. A multi-threaded approach was used to ensure synchronization between the data acquisition rate and the recording process. Each sensor had its own dedicated thread running independently of the others. This design allowed each sensor to operate autonomously, unaffected by the performance of the other threads.

    b. Synthetic data
    The synthetic environment models the popular Darcy place of the city Dijon, France. This low-poly virtual model was done based on dot clouds acquired using a LiDAR scan (Light Detection And Ranging). The trajectory of the head was recorded by wearing a virtual reality headset (Oculus Quest 2.0) in an empty gymnasium. The 3D model and the head trajectories were reused in the 3DSmax software for optimal graphical rendering. The bounding boxes of the objects were determined by the axis align portion of the screen were the object is visible. To avoid the multiplicity of very small bounding boxes of objects situated far from the camera but still visible on the screen, we only kept the objects that were less than 50m away from the camera and with a bounding box of more than 50 pixels. The same list of class than for the real data was used for the annotation of the relevant elements. The annotation was generated automatically with a C# script running in the Unity software using the labelling of the 3D virtual objects.

    c. Sound generation
    An application of visual-to-auditory encoding scheme on the dataset was performed with associated audio files. The encoding scheme based on the Monotonic consists in an image processing step followed by an image-to-sound conversion. The video processing extracts the brightness intensity variation of the pixel by differentiating two successive depth maps previously remapped into 8-bit images and rescaled between 0.2m and 5.2m with a resolution of 160 x 120 pixels. The range was selected to focus on nearby elements that could result in a danger for visually impaired people. The resulting image-to-sound conversion is based on the association of the pixel position and brightness with a unique 3D spatialized sound, where the encoding scheme combines information about elevation, azimuth, and distance. The spatialized sound is a pure tone whose frequency depends on the elevation (from 250 Hz to 1492 Hz) that is convolved with HRTFs from the CIPIC database to obtain a spatialized stereophonic sound in azimuth and elevation. Finally, the distance encoding is added by modulating the sound intensity and the envelope amplitude as a function of the pixel brightness intensity. All the generated sounds were combined to produce an audio frame, and then the process is repeated until the end of the set of images to obtain a sequence of audio frames that generates an audio stream.
Update periodicity : as needed
Language : English (eng)
Formats : audio/mpeg, image/png, text/xml
Audience : Research, Informal Education


Spatial coverage :

  • Dijon: latitude between 47° 22' 39" N and 47° 17' 10" N, longitude between 4° 57' 44" E and 5° 6' 7" E
Projects and funders :
Additional information :
Data collected as part of the theses of Florian Scalvini (under the supervision of Professor Julien Dubois (ImVia), Professor Maxime Ambard (LEAD) and Professor Cyrille Migniot (ImVia)) and Camille Bordeau (under the supervision of Professor Emmanuel Bigand (LEAD) and Professor Maxime Ambard (LEAD)).
Record created 13 Jul 2023 by Cyrille Migniot.
Last modification : 30 Aug 2023.
Local identifier: FR-13002091000019-2023-07-13.


dat@ImViA is a sub-portal of dat@UBFC, a metadata catalogue for research data produced at UBFC.

Terms of use
Université de Bourgogne, Université de Franche-Comté, UTBM, AgroSup Dijon, ENSMM, BSB, Arts des Metiers