As we are going to dig in let’s say, how about we address why you would need to set up an ETL pipeline utilizing Python instead of an ETL system/tool. All things considered, ETL systems are created and kept up by experts who live-and-inhale ETL.
For a large portion of you, ETL systems/tools become the go-to once you start managing complex compositions and gigantic measures of information. You positively can utilize SQLAlchemy and pandas to execute ETL in Python. Be that as it may, the time has come expending, work escalated, and regularly overpowering once your database gets complicated.
There are three essential circumstances where Python bodes well.
1. You actually feel great with Python and have complete grip on structuring your own ETL tool.
2. You have very straightforward ETL requirements.
3. You have a novel, hyper-explicit need that must be met by means of writing a custom code an ETL arrangement through Python.
On the off chance that you fit into one of those three classifications, you have a wide assortment of choices available.
USE OF PANDAS
Pandas includes the idea of a DataFrame into Python, and is generally utilized in the information science network for cleaning and breaking down datasets. It is amazingly valuable as a transformation tool of ETL since it makes controlling information simple and instinctive.
• Broadly utilized for information control
• Straightforward, natural sentence structure
• Incorporated well with other Python instruments including representation libraries
• Backing for commonly used data designs (read from SQL databases, CSV records, and so on.)
As it stacks all information into memory, it isn't versatile and can be a terrible decision for extremely enormous (bigger than memory) datasets.
To cover up the issues of Pandas there are many libraries helping us out like Dask and MODIN. They actually uses parallel computing techniques to allow Pandas tackle large data in chunks. They create instances to use in distributed environments. For a single instance DaskDataFrame is just like that of pandas.
MODIN rely on Ray while Dask doesn’t. Ray is a framework for task parallelization.
• Expanded execution with identical usefulness, even on a similar equipment
• Insignificant code changes to change from Pandas (making amendments in the import articulation)
• Modin allows all of the Pandas functionalities.
• Ability to work with datasets that don’t fit in memory
• While Dask is designed to make use of other Python libraries
• Other ways are much better to improve Pandas performance than parallelism.
• For small computations it’s not that much useful.
• Some of the functions of Pandas aren’t implemented in the Dask.
For small set of data to be transformed, Pandas is the best choice as it allows too many functions very frequently while if the datasets are too large one must have to use other options like Modin or Dask. There are also many other ETL tools available to complete the tasks using Python. One just have to analyze his own needs according to the datasets available and the functions required for transformation of specific data and use the above mentioned tools accordingly.