📄Automated Text Extraction from PDFs Using AWS Services: Textract, Lambda, SQS, and SNS⚙️

This blog introduces a streamlined, automated solution leveraging AWS services—Textract, Lambda, SQS, and SNS—to simplify text extraction from PDFs as soon as they are uploaded to an S3 bucket.

Published Sep 11, 2024 • 3 min read

In today's fast-paced business world, efficiency is the cornerstone of success. Repetitive manual tasks often drain valuable time and resources, making automation a critical factor in optimizing operations. One such task is extracting text from PDF files—a process that, when done manually, is not only time-consuming but also prone to errors. By choosing to automate PDF text, businesses can simplify workflows, enhance productivity, and focus on what truly matters: growth and innovation.

🔍 Overview

Our solution harnesses the power of AWS to deliver an effective and scalable text extraction process:

Amazon S3: Stores your PDF files.

AWS Textract: Extracts text from the PDFs.

AWS Lambda: Triggers and manages the extraction workflow.

Amazon SQS: Handles asynchronous task processing.

Amazon SNS: Notifies you upon completion of the extraction process.

🏛️ Architecture

The architecture for our automated text extraction solution is designed to be both efficient and reliable:

Upload PDF to S3: PDFs are uploaded to a designated S3 bucket.
Trigger Lambda Function: The upload event activates a Lambda function.
Invoke Textract: Lambda initiates Textract for text extraction.
Queue Processing: Completion messages from Textract are sent to an SQS queue.
Text Processing and Notification: Another Lambda function processes SQS messages, retrieves the text, and sends a notification via SNS.

🪜 Step-by-Step Implementation

Step 1: Setting Up the S3 Bucket

Create an S3 bucket where users will upload their PDF files. Enable event notifications to trigger a Lambda function upon file upload.

aws s3api create-bucket --bucket pdf-extraction-bucket --region us-east-1

Configure the bucket to trigger a Lambda function when a new file is uploaded.

Step 2: Creating the Lambda Function

Create a Lambda function to handle the file upload event and invoke Textract.

import json
import boto3

def lambda_handler(event, context):
    textract = boto3.client('textract')
    s3_bucket = event['Records'][0]['s3']['bucket']['name']
    document = event['Records'][0]['s3']['object']['key']
    
    response = textract.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3_bucket,
                'Name': document
            }
        },
        NotificationChannel={
            'RoleArn': 'arn:aws:iam::account-id:role/TextractRole',
            'SNSTopicArn': 'arn:aws:sns:region:account-id:TextractTopic'
        }
    )
    
    return {
        'statusCode': 200,
        'body': json.dumps('Textract job started')
    }

Step 3: Configuring Textract

Create an IAM role for Textract with the necessary permissions to read from S3 and write to SNS.

Step 4: Setting Up SQS and SNS

Create an SQS queue and SNS topic. Configure Textract to send job completion notifications to the SNS topic, which will then send messages to the SQS queue.

aws sns create-topic --name TextractTopic
aws sqs create-queue --queue-name TextractQueue

Subscribe the SQS queue to the SNS topic.

Step 5: Processing the Extracted Text

Create another Lambda function to process the messages in the SQS queue, retrieve the extracted text from Textract, and send a notification via SNS.

import json
import boto3

def lambda_handler(event, context):
    sqs = boto3.client('sqs')
    textract = boto3.client('textract')
    sns = boto3.client('sns')
    
    for record in event['Records']:
        message = json.loads(record['body'])
        job_id = message['JobId']
        
        response = textract.get_document_text_detection(JobId=job_id)
        extracted_text = ''
        
        for block in response['Blocks']:
            if block['BlockType'] == 'LINE':
                extracted_text += block['Text'] + '\n'
        
        sns.publish(
            TopicArn='arn:aws:sns:region:account-id:CompletionTopic',
            Message=f'Text extraction completed: {extracted_text}'
        )
    
    return {
        'statusCode': 200,
        'body': json.dumps('Text processing completed')
    }

Step 6: Testing the Solution

Upload a PDF file to the S3 bucket and verify that the text extraction process completes successfully. You should receive a notification with the extracted text.

✅ Conclusion

By automating the text extraction process using AWS services, you can save time and reduce errors associated with manual extraction. This solution is scalable, cost-effective, and easy to implement, making it ideal for businesses of all sizes.

For more information about our services and how Techlusion can help you automate your business processes, visit our website techlusion.io or contact us at info@techlusion.io.

Other interesting read : Choosing Between 💻Cross Platform vs. 📱Native Mobile App Development: What You Need to Know