We have around 100 parquet files sizing upto 300mb in S3, we need to do some processing of those files and then insert it into dynamo
Currently we are using a lambda to do this , for just 3 files, it’s taking around 1 and half minutes and 900mb of lambda memory.
Since lambda has memory constraint and time constraint.
lambda has maximum of 15 mins execution time and 3gb of memory
It can stop in middle bcoz of outof memory or timeout or any error and the data might be left inconsistent
Dataload is done once in a while.
There is no relationship between any of the files , these can be processed independently of each other.
We usually create a new table initially and load all data to it and then delete old table.
Can you please suggest any approaches which can solve this problem?
Any help would be much appreciated
AWS Lamda function is designed for execution of small task with execution time limited to 15 min. This execution time cannot be increased. As you mentioned single file of 300 mb can finish execution well within this timeout period. You can continue using Lamda function in your scenario. However for bigger task AWS batch service is there. Batch computing run jobs asynchronously and automatically across multiple compute instances You can setup queues for each S3 file processing task. AWS batch service will execute your task on EC2 instance you choose. AWS Batch will automatically shut down all resources after finish.You will be charged only for resources used for execution of your task. By writing a bash script you can even redirect your result in dynamo as well.
Follow these steps to create a batch job
Goto AWS Batch service in Console
Select a compute instance according to need of your processing environment
Create a job description. Provide liberal amount of time for your program to finish its execution
And finally submit a job.
You can read more about AWS batch here
Follow this link to know more about use cases which AWS Batch is getting used for