Data Design

Data Concierge. Calm Computation. Clean Data.

Data Extraction for CMS Open Data

(NanoAOD · Run2016H · SinglePhoton)

Why Are There More Events Than in the Certified JSON? (CMS Data Analysis)

Certified and Non-Certified Events in CMS NanoAOD

A common surprise when working with CMS Open Data is that the total number of events in a NanoAOD dataset often exceeds the number of events listed in the Certified (Golden) JSON file.
This raises a natural question:
if some events are not certified, why are they present in the dataset at all?

This confusion usually appears during the first serious extraction step — when event counts suddenly drop after applying a certified JSON filter, or when yields no longer match expectations from documentation.
Understanding why this happens is essential for both correct physics analysis and responsible data handling.

What the Certified (Golden) JSON Actually Represents

The Certified JSON is not a list of “good events.”
It is a list of good luminosity sections — time intervals where detector subsystems met strict data quality criteria.
Only runs and lumi sections included in this JSON are considered fully validated for precision physics measurements.

When analysts apply a certified JSON, they are effectively saying:
“I only trust events recorded during periods where the detector was operating within well-understood and approved conditions.”

Why Non-Certified Events Still Exist in NanoAOD

CMS Open Data intentionally preserves all recorded events, including those outside certified lumi sections.
These events are not automatically removed because:

• detector issues are often subsystem-specific, not global
• some subsystems may perform nominally while others are flagged
• certification criteria are conservative by design
• future re-interpretation or methodological studies may require access to raw or semi-validated data

In other words, non-certified does not mean “corrupted” or “random noise.”
It means “not approved for precision physics without further validation.”

Where Non-Certified Events Can Still Be Informative

While non-certified events should never be used directly to claim new physics results, they can still play a meaningful role in exploratory analysis.

For example:

• unusual calorimeter timing patterns
• low-energy deposits close to noise thresholds
• atypical correlations between subdetector responses
• rare configurations filtered out by conservative quality criteria

Such signals may not pass certification due to strict timing, stability, or calibration constraints — even if all detector channels technically responded.

These events can serve as a secondary exploration layer: a way to detect anomalies, patterns, or correlations that later guide a focused search within fully certified data.

From Exploration to Validation

This distinction is critical:

Non-certified events may suggest where to look, but certified events are required to prove anything.

Any hypothesis, correlation, or anomaly identified in non-certified data must be re-tested exclusively on certified lumi sections before it can be considered scientifically valid.
This two-step approach — exploration followed by strict validation — is a cornerstone of responsible analysis.

Why Event Counts Change After JSON Filtering

When a certified JSON is applied during extraction:

• entire lumi sections are removed
• all events within those lumis disappear from the dataset
• resulting yields can differ significantly from raw NanoAOD counts

This is expected behavior, not a bug.
The mismatch reflects the difference between “everything recorded” and “everything approved.”

Practical Takeaway

CMS NanoAOD contains more than just publication-ready events.
It contains the full experimental history — including regions that require careful interpretation.

Understanding the boundary between certified and non-certified data allows researchers to:

• avoid invalid conclusions
• design cleaner analysis pipelines
• explore subtle detector effects responsibly
• maintain scientific rigor while preserving curiosity

Data quality is not binary.
It is contextual — and knowing how to navigate that context is part of working seriously with collider data.