Lip Reader

Lip Reader is a deep learning-based project that aims to recognize speech from video inputs by reading the lip movements of the speaker.

The project utilizes 3D Convolutional Neural Networks (CNNs) followed by Bidirectional Long Short-Term Memory (LSTM) networks to analyze and predict words or sentences from video frames.

Tech Stack:

Tensorflow

Keras

Mediapipe

Open CV

NumPy

Streamlit

Jupyter Notebook

Git

Github

Key Features:

  • End-to-End Deep Learning Workflow:

    Implements a robust workflow for sequence-to-sequence tasks, including data loading, preprocessing, model building, training, evaluation, and prediction.

  • Efficient Data Processing:

    Processes video data by extracting frames, performing grayscale conversion, normalization, and cropping for optimal model performance.

  • Advanced Neural Network Architecture:

    Combines 3D Convolutional Neural Networks (3D-CNN) and Bidirectional LSTMs to capture spatial, temporal, and sequential patterns effectively.

  • CTC Loss for Sequence Alignment:

    Uses Connectionist Temporal Classification (CTC) loss to train the model without explicit alignment between input sequences and output labels.

  • Model Evaluation with Word Error Rate:

    Assesses model performance using Word Error Rate (WER) and detailed error analysis to refine accuracy.

  • Pretrained Model Deployment:

    Provides pretrained model weights for quick inference and real-world application, ensuring seamless integration.