Behind the Scenes: Cómo funciona Sentinel-3

Tiempo de lectura: 15 minutos Audiencia: Desarrolladores, ML engineers, arquitectos de seguridad

Este es el deep dive técnico de Sentinel-3. Si buscás una intro más accesible, leé primero Introducing Sentinel-3.

Tabla de contenidos

Arquitectura general
Capa 1: Vigilancia y threat intelligence
Capa 2: Generación automática de adversarios
Capa 3: Modelo de detección adaptativo
Inference en producción
Benchmarks y evaluación
Decisiones de diseño y trade-offs

Arquitectura general

Sentinel-3 está construido sobre tres componentes principales que operan en paralelo:

┌─────────────────────────────────────────────────────────────┐
│                      SENTINEL-3 SYSTEM                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐     ┌──────────────┐     ┌────────────┐ │
│  │   Threat     │────▶│  Adversarial │────▶│  Adaptive  │ │
│  │ Intelligence │     │   Generator  │     │   Model    │ │
│  │              │     │              │     │            │ │
│  │  Monitors    │     │  Generates   │     │  Detects   │ │
│  │  ecosystem   │     │  examples    │     │  spoofs    │ │
│  └──────────────┘     └──────────────┘     └────────────┘ │
│         │                     │                    │       │
│         └─────────────────────┴────────────────────┘       │
│                              │                             │
│                    ┌─────────▼─────────┐                  │
│                    │  Model Registry   │                  │
│                    │  & Versioning     │                  │
│                    └───────────────────┘                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Stack tecnológico

Modelo base: AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention)
Framework: PyTorch 2.1 + TorchScript
Inference: ONNX Runtime + TensorRT
Orquestación: Kubernetes + ArgoCD
Feature extraction: Librosa + custom DSP pipeline
Model registry: MLflow + DVC
Monitoring: Prometheus + Grafana
CI/CD: GitHub Actions + custom runners

Capa 1: Vigilancia y threat intelligence

Fuentes monitoreadas

El sistema de vigilancia rastrea múltiples fuentes en paralelo:

GitHub (open source)

Repos con keywords: “voice cloning”, “TTS”, “voice conversion”, “speaker synthesis”
Stars > 100 o actividad reciente
Lenguajes: Python, JavaScript, C++
License: cualquiera (prioritizando Apache/MIT)

Hugging Face

Models con tag “text-to-speech”, “voice-conversion”
Downloads > 1,000 o trending
Repos que lanzan nuevas versiones

Papers with Code

Papers sobre speech synthesis publicados en últimos 30 días
Benchmarks de ASVspoof, VoxCeleb
Code available = true

Commercial APIs

Status pages de ElevenLabs, PlayHT, Resemble, etc.
Changelog monitoring
Version detection via API fingerprinting

Pipeline de detección

# Simplified detector logic
class ThreatDetector:
    def __init__(self):
        self.sources = [
            GitHubMonitor(keywords=VOICE_SYNTHESIS_KEYWORDS),
            HuggingFaceMonitor(tags=TTS_TAGS),
            PapersMonitor(arxiv_categories=["eess.AS", "cs.SD"]),
            CommercialAPIMonitor(services=COMMERCIAL_SERVICES)
        ]
        self.db = ThreatDatabase()

    async def scan(self):
        threats = []
        for source in self.sources:
            new_threats = await source.discover()
            for threat in new_threats:
                if self.is_new_or_updated(threat):
                    threats.append(threat)
                    await self.enqueue_for_analysis(threat)
        return threats

    def is_new_or_updated(self, threat):
        existing = self.db.get(threat.signature)
        if not existing:
            return True
        return threat.version > existing.version

Clasificación de amenazas

Cada amenaza detectada se clasifica automáticamente:

Nivel	Criterio	Acción
CRITICAL	Herramienta comercial popular (>10K users)	Generación inmediata, deploy urgente
HIGH	Open source con >1K stars o paper con code	Generación en 24h, testing normal
MEDIUM	Research code con resultados prometedores	Generación en 72h, validación extendida
LOW	Experimental, sin adoption	Monitoring continuo, sin acción inmediata

Capa 2: Generación automática de adversarios

Esta es la capa más compleja. Debe poder instalar y ejecutar herramientas arbitrarias de forma segura y automática.

Entorno de ejecución

Cada herramienta se ejecuta en un contenedor aislado:

FROM python:3.10-slim

# Sandboxing
RUN useradd -m -s /bin/bash sandbox
USER sandbox
WORKDIR /home/sandbox

# Common dependencies for voice synthesis tools
RUN pip install torch torchaudio librosa soundfile numpy

# Resource limits
ENV MAX_MEMORY=4G
ENV MAX_CPU=2
ENV TIMEOUT=3600

# Monitoring
COPY monitor.sh /usr/local/bin/
CMD ["/usr/local/bin/monitor.sh"]

Pipeline de generación

Paso 1: Instalación automática

class ToolInstaller:
    async def install(self, threat: Threat):
        if threat.source_type == "github":
            await self.clone_repo(threat.repo_url)
            await self.detect_install_method()  # pip, conda, docker, etc.
            await self.install_dependencies()

        elif threat.source_type == "huggingface":
            await self.download_model(threat.model_id)

        # Verify installation
        await self.run_sanity_check()

Paso 2: Interface detection

El sistema intenta detectar automáticamente cómo usar la herramienta:

class InterfaceDetector:
    def detect(self, tool_path):
        # Try common patterns
        if exists(tool_path / "inference.py"):
            return PythonScriptInterface(tool_path / "inference.py")

        if exists(tool_path / "app.py"):  # Gradio apps
            return GradioInterface(tool_path / "app.py")

        if exists(tool_path / "cli.py"):
            return CLIInterface(tool_path / "cli.py")

        # Try to parse README for usage examples
        readme = read(tool_path / "README.md")
        return self.parse_usage_from_readme(readme)

Paso 3: Generación de ejemplos

class AdversarialGenerator:
    async def generate(self, tool: Tool, target_samples: int = 3000):
        # Select source speakers
        speakers = self.select_diverse_speakers(n=50)

        # Select target texts
        texts = self.select_text_corpus(
            length_distribution="natural",  # 1-10 seconds
            languages=["es", "en"],
            domains=["banking", "general"]
        )

        examples = []
        for speaker in speakers:
            for text in texts[:target_samples // len(speakers)]:
                try:
                    synthetic = await tool.synthesize(
                        text=text,
                        reference_audio=speaker.sample,
                        timeout=30
                    )

                    examples.append({
                        "audio": synthetic,
                        "label": "spoof",
                        "tool": tool.name,
                        "tool_version": tool.version,
                        "speaker_id": speaker.id,
                        "text": text
                    })

                except Exception as e:
                    self.log_error(tool, e)

        return examples

Paso 4: Quality control

No todos los ejemplos generados son útiles. Aplicamos filtros:

class QualityFilter:
    def filter(self, examples):
        filtered = []
        for ex in examples:
            # Audio too short/long
            if not (1.0 <= duration(ex.audio) <= 10.0):
                continue

            # Silence detection
            if self.is_mostly_silence(ex.audio):
                continue

            # Noise/quality check
            if self.snr(ex.audio) < 15:  # dB
                continue

            # Duplicate detection
            if self.is_duplicate(ex.audio, filtered):
                continue

            filtered.append(ex)

        return filtered

Capa 3: Modelo de detección adaptativo

Arquitectura del modelo

Sentinel-3 usa AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention) como backbone.

Por qué AASIST:

SOTA performance en ASVspoof 2021 (EER 0.83%)
Generaliza bien a ataques no vistos
Razonablemente rápido para producción (150-200ms)
Arquitectura permite fine-tuning sin catastrophic forgetting

Modificaciones sobre AASIST vanilla:

class SentinelAASIST(nn.Module):
    def __init__(self):
        super().__init__()

        # Base AASIST encoder
        self.encoder = AASlSTEncoder(
            input_channels=1,
            out_channels=128
        )

        # Nuestra innovación: Graph attention con historical context
        self.graph_attention = TemporalGraphAttention(
            hidden_dim=128,
            num_heads=4,
            temporal_window=10  # Considera ejemplos históricos
        )

        # Classifier head con uncertainty estimation
        self.classifier = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 2)  # bonafide vs spoof
        )

        # Uncertainty head (para confidence scores)
        self.uncertainty = nn.Linear(128, 1)

    def forward(self, x, historical_context=None):
        # Extract features
        features = self.encoder(x)

        # Apply graph attention with context
        if historical_context is not None:
            features = self.graph_attention(features, historical_context)

        # Classify
        logits = self.classifier(features)
        uncertainty = torch.sigmoid(self.uncertainty(features))

        return logits, uncertainty

Estrategia de fine-tuning

Cuando llegan ejemplos de una nueva herramienta:

1. Continual learning

No reentrenamos desde cero. Usamos continual learning para actualizar el modelo sin olvidar amenazas anteriores.

class ContinualLearner:
    def __init__(self, base_model):
        self.model = base_model
        self.replay_buffer = ReplayBuffer(size=10000)
        self.optimizer = torch.optim.AdamW(
            self.model.parameters(),
            lr=1e-5,  # Learning rate bajo
            weight_decay=0.01
        )

    def update(self, new_examples):
        # Mix new examples with historical ones (replay)
        old_examples = self.replay_buffer.sample(
            n=len(new_examples) // 2
        )

        batch = new_examples + old_examples
        shuffle(batch)

        # Fine-tune con regularización fuerte
        for epoch in range(3):  # Pocas epochs
            for mini_batch in DataLoader(batch, batch_size=32):
                loss = self.compute_loss(mini_batch)

                # Elastic Weight Consolidation (EWC)
                # Penaliza cambios grandes en parámetros importantes
                ewc_loss = self.compute_ewc_penalty()

                total_loss = loss + 0.1 * ewc_loss

                self.optimizer.zero_grad()
                total_loss.backward()
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                self.optimizer.step()

        # Add new examples to replay buffer
        self.replay_buffer.add(new_examples)

2. Validación continua

Después de cada update, validamos contra TODO el histórico:

class Validator:
    def __init__(self, test_sets):
        # test_sets contiene ejemplos de TODAS las herramientas conocidas
        self.test_sets = test_sets

    def validate(self, updated_model):
        results = {}

        for tool, test_set in self.test_sets.items():
            eer = compute_eer(updated_model, test_set)
            results[tool] = eer

            # Regression detection
            if eer > self.baseline_eer[tool] * 1.2:  # 20% worse
                raise RegressionDetected(
                    f"Performance degraded on {tool}: "
                    f"{self.baseline_eer[tool]:.2%} → {eer:.2%}"
                )

        return results

3. Rollout gradual

Nuevas versiones no van directo a producción:

class GradualRollout:
    stages = [
        ("shadow", 0.05, 24),    # 5% shadow mode, 24h
        ("canary", 0.10, 12),    # 10% canary, 12h
        ("gradual", 0.50, 6),    # 50% gradual, 6h
        ("full", 1.00, None)     # 100% rollout
    ]

    async def rollout(self, new_version):
        for stage_name, traffic_pct, duration_hours in self.stages:
            # Route traffic
            await self.route_traffic(new_version, traffic_pct)

            # Monitor
            metrics = await self.monitor(duration_hours)

            # Check for issues
            if metrics.error_rate > 0.01 or metrics.latency_p99 > 500:
                await self.rollback(new_version)
                raise RolloutFailed(stage_name, metrics)

            self.log(f"Stage {stage_name} passed")

        self.log("Rollout complete")

Inference en producción

Feature extraction pipeline

class FeatureExtractor:
    def __init__(self):
        self.sample_rate = 16000
        self.n_fft = 512
        self.hop_length = 160

    def extract(self, audio_bytes):
        # Decode audio
        audio, sr = librosa.load(io.BytesIO(audio_bytes), sr=self.sample_rate)

        # Voice Activity Detection (solo procesar segmentos con voz)
        vad_segments = self.detect_voice_activity(audio)
        audio = self.extract_voice_segments(audio, vad_segments)

        # Normalize
        audio = self.normalize(audio)

        # Extract features
        # 1. Linear spectrogram
        spec = librosa.stft(audio, n_fft=self.n_fft, hop_length=self.hop_length)
        spec_db = librosa.amplitude_to_db(np.abs(spec))

        # 2. CQT (Constant-Q Transform) - better for voice
        cqt = librosa.cqt(audio, sr=self.sample_rate)
        cqt_db = librosa.amplitude_to_db(np.abs(cqt))

        # 3. LFCC (Linear Frequency Cepstral Coefficients)
        lfcc = self.compute_lfcc(audio)

        # Stack features
        features = np.stack([spec_db, cqt_db, lfcc], axis=0)

        return torch.from_numpy(features).float()

Optimizaciones de inference

TorchScript compilation:

# Compile model to TorchScript para speedup
model = SentinelAASIST()
model.load_state_dict(torch.load("sentinel-3.pth"))
model.eval()

# Trace con ejemplo
example_input = torch.randn(1, 3, 257, 100)  # (batch, features, freq, time)
traced_model = torch.jit.trace(model, example_input)

# Save
torch.jit.save(traced_model, "sentinel-3-traced.pt")

# Inference es ~2x más rápido

ONNX + TensorRT para GPU:

# Convert to ONNX
torch.onnx.export(
    model,
    example_input,
    "sentinel-3.onnx",
    input_names=["audio_features"],
    output_names=["logits", "uncertainty"],
    dynamic_axes={"audio_features": {0: "batch", 3: "time"}}
)

# Optimize con TensorRT
import tensorrt as trt
engine = trt.Runtime(...).deserialize_cuda_engine(onnx_model)

# Inference es ~5x más rápido en GPU

Batching dinámico:

class InferenceServer:
    def __init__(self, model, max_batch_size=32, max_wait_ms=50):
        self.model = model
        self.batch_queue = Queue()
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms

        # Start batch processor
        asyncio.create_task(self.process_batches())

    async def predict(self, audio):
        # Add to queue
        future = asyncio.Future()
        await self.batch_queue.put((audio, future))

        # Wait for result
        return await future

    async def process_batches(self):
        while True:
            batch = []
            futures = []

            # Collect batch
            start_time = time.time()
            while len(batch) < self.max_batch_size:
                if time.time() - start_time > self.max_wait_ms / 1000:
                    break

                try:
                    audio, future = await asyncio.wait_for(
                        self.batch_queue.get(),
                        timeout=0.01
                    )
                    batch.append(audio)
                    futures.append(future)
                except asyncio.TimeoutError:
                    break

            if not batch:
                continue

            # Inference en batch
            results = self.model(torch.stack(batch))

            # Distribute results
            for future, result in zip(futures, results):
                future.set_result(result)

Benchmarks y evaluación

Datasets de evaluación

Dataset	Descripción	Tamaño	Uso
ASVspoof 2019 LA	Logical access attacks	25K utterances	Baseline validation
ASVspoof 2021 DF	Deepfake detection	611K utterances	Primary test set
In-the-Wild	Audio real de clientes	100K utterances	Production validation
Adversarial	Herramientas nuevas generadas	~50K utterances	Continual eval

Métricas

class EvaluationMetrics:
    def compute(self, predictions, labels):
        # Primary metric: EER
        fpr, tpr, thresholds = roc_curve(labels, predictions)
        fnr = 1 - tpr
        eer_threshold = thresholds[np.nanargmin(np.abs(fnr - fpr))]
        eer = fpr[np.nanargmin(np.abs(fnr - fpr))]

        # At production operating point (FAR = 1%)
        target_far = 0.01
        frr_at_far = fnr[fpr <= target_far][0]

        # Detection Cost Function (ASVspoof metric)
        c_miss = 1.0
        c_fa = 1.0
        p_target = 0.05
        dcf = c_miss * fnr * p_target + c_fa * fpr * (1 - p_target)
        min_dcf = np.min(dcf)

        return {
            "eer": eer,
            "eer_threshold": eer_threshold,
            "frr_at_1pct_far": frr_at_far,
            "min_dcf": min_dcf
        }

Resultados comparativos

Sistema	ASVspoof 2021 EER	In-the-Wild EER	Latency (p95)
Baseline AASIST	0.83%	2.1%	180ms
AASIST + fine-tuning	0.71%	1.8%	180ms
Sentinel-3	0.65%	1.2%	220ms

Decisiones de diseño y trade-offs

1. Continual learning vs Retraining from scratch

Decision: Continual learning con replay buffer

Trade-off:

✅ Mucho más rápido (horas vs días)
✅ No requiere re-descargar todos los datos históricos
❌ Riesgo teórico de catastrophic forgetting
❌ Performance puede no ser óptimo absoluto

Mitigación: Replay buffer grande (10K ejemplos) + EWC regularization

2. Modelo único vs Ensemble

Decision: Modelo único adaptativo

Trade-off:

✅ Latencia más baja
✅ Más simple de mantener
❌ No aprovecha diversidad de ensemble
❌ Punto único de falla

Mitigación: Model versioning + rollback automático

3. Generación automática vs Manual curation

Decision: Generación automática con QC

Trade-off:

✅ Escalable
✅ Respuesta rápida
❌ Puede generar ejemplos de baja calidad
❌ Requiere infraestructura compleja

Mitigación: Quality filters + manual spot-checking

4. GPU vs CPU inference

Decision: Ambos, con routing inteligente

class InferenceRouter:
    def route(self, request):
        # Requests con baja latencia requerida → GPU
        if request.latency_requirement == "real-time":
            return self.gpu_endpoint

        # Batch requests → GPU (más eficiente)
        if request.batch_size > 8:
            return self.gpu_endpoint

        # Single requests sin urgencia → CPU
        return self.cpu_endpoint

Conclusión técnica

Sentinel-3 es un sistema complejo que combina:

ML systems engineering (continual learning, model versioning, gradual rollout)
Security engineering (sandboxing, threat intelligence, adversarial generation)
Production ML (low-latency inference, monitoring, reliability)

El resultado es un sistema que evoluciona tan rápido como las amenazas que combate.

Referencias

Papers:

AASIST: Jung et al., 2021
ASVspoof Challenge: Nautsch et al., 2021
Continual Learning: Kirkpatrick et al., 2017

Code:

AASIST implementation: GitHub
ASVspoof baseline: GitHub

Recursos relacionados

¿Tenés preguntas técnicas sobre Sentinel-3? Contactanos

Behind the Scenes: Cómo funciona Sentinel-3

Behind the Scenes: Cómo funciona Sentinel-3

Tabla de contenidos

Arquitectura general

Stack tecnológico

Capa 1: Vigilancia y threat intelligence

Fuentes monitoreadas

Pipeline de detección

Clasificación de amenazas

Capa 2: Generación automática de adversarios

Entorno de ejecución

Pipeline de generación

Capa 3: Modelo de detección adaptativo

Arquitectura del modelo

Estrategia de fine-tuning

Inference en producción

Feature extraction pipeline

Optimizaciones de inference

Benchmarks y evaluación

Datasets de evaluación

Métricas

Resultados comparativos

Decisiones de diseño y trade-offs

1. Continual learning vs Retraining from scratch

2. Modelo único vs Ensemble

3. Generación automática vs Manual curation

4. GPU vs CPU inference

Conclusión técnica

Referencias

Recursos relacionados

Artículos relacionados

Cómo la biometría de voz mejora el NPS

El caso CEO Fraud UK: lecciones de $243K perdidos por deepfake de voz

ECAPA-TDNN vs TitaNet: comparación de modelos de biometría de voz

¿Querés implementar biometría de voz?