Review Shengming Zhang 1 , MSc, BSc ; Chaohai Zhang 1 , BSc ; Jiaxin Zhang 1, 2 , PhD 1School of Automation and Intelligent Manufacturing, Southern University of Science and Technology, Shenzhen, Guangdong, China 2Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, School of Automation and Intelligent Manufacturing, Southern University of Science

AI Can Now Assess Surgical Skills Using A 1270-Video Training Dataset
Laparoscopic surgery demands rigorous training, and recent advances in machine learning offer potential for automated, video-based assessment of surgical skills. However, progress is currently limited by the scarcity of large, annotated datasets. To overcome this challenge, we present the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four fundamental laparoscopic training tasks. Each recording includes a structured skill rating, derived from three independent raters, alongside binary labels identifying task-specific errors. The majority of recordings were captured during a laparoscopic training course, reflecting the natural skill variation among participants. To enable robust benchmarking of current and future video-based skill assessment and error recognition methods, we provide predefined data splits for each task, and present baseline model results for comparative analysis, led by Isabel Funke, Sebastian Bodenstedt, Felix von Bechtolsheim, Florian Oehme, Michael Maruschke, Stefanie Herrlich, Jürgen Weitz, Marius Distler, Sören Torge Mees and Stefanie Speidel.
This new resource addresses a critical limitation in the field, the scarcity of large, annotated datasets hindering progress in deep learning models for surgical skill assessment.
Comprising 1270 stereo video recordings, LASANA captures four fundamental laparoscopic training tasks performed by a diverse group of participants. Each video is meticulously annotated with structured skill ratings, derived from the consensus of three independent expert raters, alongside binary labels identifying specific task errors.
The dataset’s creation reflects a commitment to realistic training scenarios, with the majority of recordings originating from an actual laparoscopic training course. This ensures the data encompasses the natural variation in skill levels exhibited by trainees, providing a more robust foundation for model development.
To promote standardized benchmarking, researchers have provided predefined data splits for each task, enabling fair comparison of existing and novel approaches to video-based skill assessment and error recognition. A baseline model has also been implemented and its results published, serving as a crucial reference point for future research endeavours.
LASANA’s comprehensive annotations extend beyond simple skill scores, incorporating detailed error identification and structured skill ratings. These granular labels allow for more nuanced analysis of surgical technique and facilitate the development of models capable of providing targeted feedback to trainees.
The dataset’s size, significantly larger than previously available resources like the JIGSAWS dataset with 103 videos, ROSMA with 206 videos, and AIxSuture with 314 videos, promises to unlock new capabilities in automated assessment. Mean video duration per task ranges from 2 minutes 32 seconds to 4 minutes 30 seconds, providing ample data for robust model training.
This work has the potential to transform laparoscopic surgical training by providing objective, consistent, and cost-effective evaluation methods. By automating skill assessment, LASANA paves the way for personalized training programs, improved feedback mechanisms, and ultimately, enhanced surgical performance and patient outcomes. The study captured 1270 recordings of four basic laparoscopic training tasks, peg transfer, circle cutting, balloon resection, and suture & knot, performed by 70 participants.
Recordings were obtained using a Karl Storz TIPCAM 1 S 3D LAP 30° endoscope, providing synchronized stereo views of the surgical scene within a Laparo Aspire training box. Each participant’s performance was documented using laparoscopic instruments from Karl Storz, with the left camera’s video stream displayed as visual feedback during task completion.
Skill ratings were aggregated from three independent raters, employing a structured assessment inspired by the Global Operative Assessment of Laparoscopic Skills (GOALS) tool. This GOALS-inspired rating system evaluates performance across five dimensions, assigning scores on a five-point Likert scale, and culminating in a total score representing overall skill.
To ensure data reliability, the research team quantified inter-rater agreement using Lin’s Concordance Correlation Coefficient ρc, achieving values exceeding 0.65 for all tasks except circle cutting, which yielded a ρc of 0.49. Furthermore, the dataset includes binary labels denoting the presence or absence of task-specific errors, such as dropped objects or balloon punctures, enabling the development of complementary error recognition algorithms. Each recording incorporates a structured skill rating, derived from the consensus of three independent raters, alongside binary labels identifying the presence or absence of task-specific errors.
The dataset predominantly features recordings from a laparoscopic training course, accurately reflecting the varied skill levels of participating trainees. Predefined data splits are provided to enable benchmarking of existing and novel video-based skill assessment approaches and error recognition algorithms for each task.
A deep learning model was implemented to establish baseline results, serving as a comparative reference point for future investigations. The four laparoscopic training tasks included in the dataset are object manipulation, cutting, and suturing exercises commonly used in surgical curricula. Recordings exhibit a mean video duration of approximately 2 minutes 32 seconds for the first task, 3 minutes 32 seconds for the second, 3 minutes 55 seconds for the third, and 4 minutes 30 seconds for the final task.
The dataset includes annotations detailing experience level, skill rating, surgical gestures, and task-specific errors, providing a comprehensive resource for analysis. The 70 participants contributing to the LASANA dataset generated a total of 1270 videos, with approximately 314-329 videos dedicated to each of the four tasks.
Three independent raters provided skill ratings for each video, ensuring a robust and reliable assessment of surgical performance. The availability of both structured skill ratings and binary error labels facilitates the development of models capable of both skill level estimation and error detection. It comprises 1270 stereo video recordings documenting four fundamental laparoscopic training tasks, offering a substantial increase in scale compared to existing datasets such as JIGSAWS, ROSMA, and AIxSuture.
Each video is accompanied by structured skill ratings, derived from multiple independent assessors, and binary labels identifying task-specific errors, providing detailed information for automated analysis. This dataset facilitates the development and benchmarking of video-based systems designed to automatically assess surgical skills and recognise errors.
The recordings predominantly originate from a laparoscopic training course, capturing the natural variation in performance levels expected during skill acquisition. Predefined data splits are included to standardise evaluation procedures and enable meaningful comparisons between different approaches, with baseline results from a deep learning model provided for reference.
Acknowledging the current limitations in available data for training robust assessment models, LASANA offers a valuable resource for the surgical training community. The authors anticipate that this dataset will accelerate progress in automated skill assessment, potentially leading to more objective and efficient training programmes. Future work could focus on expanding the dataset to include more complex surgical procedures and diverse participant populations, further enhancing the generalisability and applicability of automated assessment tools.
👉 More information
🗞 A benchmark for video-based laparoscopic skill analysis and assessment
🧠 ArXiv: https://arxiv.org/abs/2602.09927
