International Journal of
Computer Engineering Research

  • Abbreviation: Int. J. Comput. Eng. Res.
  • Language: English
  • ISSN: 2141-6494
  • DOI: 10.5897/IJCER
  • Start Year: 2010
  • Published Articles: 32

Full Length Research Paper

Lessons learned and perspectives on constrained data collection and preparation for a predictive machine learning model applied to transportation industry in a non-digitalised environment

  • Simon Isaac KABEYA MWEPU
  • Higher Institute of Statistics of Lubumbashi, Democratic Republic of Congo.
  • Google Scholar
Patrick MUKALA
  • Patrick MUKALA
  • University of Wollongong in Dubai, United Arab Emirates.
  • Google Scholar

  •  Received: 28 October 2023
  •  Accepted: 15 January 2024
  •  Published: 31 March 2024


Machine learning algorithms are based on qualitative and quantitative historical data, to create predictive models for shape recognition, autonomous systems, etc., using classifiers like K-Nearest Neighbors (KNN), neural-network, etc. So, treatment of data is the undisputed fuel that powers any machine learning endeavour. A standard data collection methodology would comprise a few steps as data collection, cleaning, resampling, resizing, selecting variables, extracting features, transforming and projecting data, removing noise, and irrelevant information. In this paper, we report on a case study based on collection of data for predicting trains derailments in the context of spiking neural networks (SNNs), the rail carrier in the Democratic Republic of Congo. We share the lessons learned of a company, where pretty much everything is done manually on reliance on experts’ opinion. Our data collection approach at SNCC, concerns 117,473 vehicles including 15,280 derailed of which 111 come from networks outside SNCC. 25,727 vehicles were excluded for one of the reasons mentioned earlier. The remaining 86,463 vehicles were split into 2 blocks consisting, respectively of 69,170 vehicles for the learning data and 17,293 vehicles for the test data. KNN classifier predicts the (survenue) of derailments with 87% for 3-NN and 85% for 3-NN of rate. With this rate, it is possible to avoid derailments by predicting their (survenue). But we must perform it to avoid consequences of derailments on persons and materials.

Key words: Machine learning, K-nearest neighbors (KNN), neural-network, spiking neural networks (SNNs), data constraints, predictive maintenance, train vehicles.