Executive Summary
A metal forging manufacturer needed a better way to understand exactly when burn-off and vaporization occurred during its forging process. The footage already existed, but it was used for manual review, not measurement.
We used SmolVLM, a lightweight video-understanding model, to analyze forging videos and identify the precise point where visible burn-off or vaporization began.
Instead of "something burned off during the cycle," the system could produce: "Visible vaporization began at 6.42 seconds, near the front-right edge of the workpiece, before full die contact." That timestamp became a measurable process signal.
Across a pilot of several thousand archived cycles, the system turned footage that had only ever been watched into a column of data that could be charted, compared, and correlated against quality outcomes.
1) The Manufacturing Challenge
In forging, timing matters. A heated workpiece moves through a short, high-energy window, often just a handful of seconds, and during that window coatings, lubricants, residual moisture, and oxides may burn off or vaporize.
The manufacturer wanted to understand this event more clearly because it can signal surface temperature, die temperature, lubricant behavior, transfer timing, contamination, and cycle-to-cycle consistency. A shift in when burn-off starts can be the earliest visible hint that something upstream has changed.
An experienced operator could watch a clip and say "the vapor came off early" or "that one burned off differently." But those observations were nearly impossible to compare across hundreds or thousands of cycles. They lived in memory, not in a record, and they walked out the door at the end of a shift. The company needed a way to turn visual judgment into repeatable data.
Where Things Stood
The cameras were already there. Lines had been recording cycles for years, and the archive ran to thousands of clips across stations, parts, and shifts. The footage was used reactively: pulled up when something went wrong, watched, and then set aside.
Nothing about that footage was measurable. There was no timestamp for when burn-off started, no record of where on the workpiece it appeared, no way to ask whether the night shift ran differently from the day shift.
The goal of the pilot was deliberately narrow: do not try to predict failures or control the press. Just take the event everyone already cared about, burn-off onset, and turn it into a number that is the same number every time, no matter who is watching.
The single richest source of process information the plant had was effectively write-only.
2) What We Built
We built a video analysis workflow that detects and describes the burn-off / vaporization event, then writes it into a structured record. Each cycle moved from a video file to a searchable process-event entry.
The output was designed to be analyzed, not just read. Every field is something an engineer can group by, filter on, or plot (station, part type, onset time, location, confidence), so a year of footage becomes a dataset instead of a folder.
- First visible frame of burn-off or vaporization
- Timestamp of the event
- Region of the workpiece where it began
- Plain-language description of what was visible
- Confidence score and a review flag
- A structured output record for downstream analysis
3) Why SmolVLM
The task needed more than motion detection. A traditional computer-vision approach can detect changes in pixels, brightness, or motion, but it does not always understand what those changes mean. A bright flash could be a reflection, a spark, or the start of vaporization, and a pixel-difference threshold cannot tell them apart.
SmolVLM provided the visual-language reasoning layer. It could look at a frame or short clip and describe the event in manufacturing terms ("a light vapor plume begins rising from the right side of the heated workpiece"), which made the output easy for engineers and operators to review.
Its size was the point. A lightweight model is cheap enough to run across an enormous archive and fast enough to keep up with new footage, without a rack of GPUs behind it. The pilot did not need the largest possible model; it needed one good enough to read a forge cycle and small enough to run thousands of times without the cost becoming the project.
4) The Pipeline
SmolVLM worked best as part of a structured pipeline, not as a standalone detector. Video was tied to its process run and metadata, then sampled around the critical window rather than processed whole, which kept the analysis fast and focused.
Sampling was the key efficiency trick. Rather than feed every frame of a multi-second clip to the model, the pipeline narrowed in on the window where onset was plausible and sampled densely there, so the model spent its attention on the moment that mattered instead of the empty seconds around it.
Peak is easy and late; onset is subtle and early, the first visible moment the event began, and the only one that actually carries information about the process.
- Video intake linked to station, part type, run ID, operator, and material batch for traceability
- Frame sampling around the critical process window
- Dense sampling near the candidate onset, sparse sampling elsewhere
- Visual event detection for first vapor, smoke onset, localized haze, or brightness shift near the die
- SmolVLM interpretation of candidate frames into a structured description
- Timestamp and location extraction for each detected onset
The target was not peak smoke or peak vapor. The target was onset.
5) Structured, Auditable Output
For each candidate event, the model produced a structured description with a timestamp, frame number, location, the visual evidence behind the call, a confidence score, and a review flag.
That mattered because the result needed to be auditable. Engineers did not just need a number; they needed to know why the system selected that frame. A timestamp with no evidence behind it is a guess; a timestamp paired with the frame and the visual reasoning is a measurement someone can stand behind.
Storing the evidence alongside the result also made the system improvable. When a reviewer disagreed with a call, the disputed frame and its description were right there to learn from, rather than lost in a black box.
- event: burn_off_vaporization_onset
- timestamp: 6.42s, frame 385
- location: front-right edge of workpiece
- visual evidence: first vapor plume, localized haze near die contact, surface brightness change
- confidence: 0.89, review_required: false
6) Human Review Loop
When confidence was high, the system produced a timestamped event record. When confidence was low, it flagged the clip for a person. Only a modest fraction of cycles needed a human look, which is what made the approach scale across the whole archive.
Low-confidence cases included poor lighting, obstructed views, multiple vapor sources, reflections, camera shake, or smoke already present before the target window. The AI was never treated as an unquestionable authority; it screened and structured, and humans reviewed the uncertain cases.
Reviewer corrections did not just fix one record; they sharpened the thresholds for which cases get flagged, so the system steadily got better at knowing what it did not know.
7) Why It Was Valuable
The value was not detecting smoke. The value was extracting consistent timing from a process that had previously been judged by eye.
Once burn-off onset was timestamped, the manufacturer could finally compare cycles: Does vaporization start at the same time across good runs? Does one press station burn off earlier than another? Does timing vary by shift or material batch? Does delayed vaporization correlate with inspection issues?
Before the project, videos required manual review and burn-off timing was never consistently captured. After it, every forge cycle produced a timestamped onset, a location, the supporting evidence, and a flag for anything uncertain, a new measurable variable that could be compared against process parameters and quality outcomes.
What It Unlocks Next
A consistent onset measurement is a foundation, not an endpoint. With timing captured cycle after cycle, the obvious next step is to chart it over time and set expectation bands, so a station that starts burning off noticeably earlier or later than its own history becomes a signal worth investigating before it becomes a scrap problem.
The same pipeline pattern generalizes. The plant has other visual events that have always been judged by eye: transfer timing, die contact, the look of a good versus a marginal part. Each one is a candidate for the same treatment: sample the right window, let a small vision-language model describe what it sees, keep the evidence, and route the uncertain cases to people.
The longer-term prize is correlation. Once burn-off timing sits in the same place as process parameters and quality results, the manufacturer can start asking whether onset timing predicts downstream outcomes, turning a previously invisible moment into an early indicator of how the cycle is going to turn out.
Results
- Burn-off onset timestamped to the frame
- Event location and visual evidence captured per cycle
- Onset timing now comparable across thousands of cycles
- Low-confidence clips auto-flagged for human review