Finding the quality of a tennis player by calibrating and analyzing the aces notched up by tennis players or predicting the next Pele or Cristiano Ronaldo after training the machine with the goals scored by the football players – and many such problem statements if truly answered would receive acclaim world over. All these problem statements have one thing in common – all of them require huge(read – humongous) amounts of data.
As you enter this world of machine learning or data science, you realize –
what numbers are to mathematician, data is for data scientists.
One just cannot live without the other! The sheer existence of this realm of computer science stands on the fact that no matter what one feels, no matter what one’s opinion is – there is one eternal truth – Data does not lie. It just can’t!
Data (which is a funny English plural, of the often forgotten singular form – datum) is the only truth.
So what is data? – facts and statistics collected together for reference or analysis – as the dictionary says. But simply stated, it is fodder for analysis.
Miracles can be made by intuition, gut feeling, notions and beliefs – however fact of the matter is – In today’s day and age, unless you have data to back up your standpoint – you might as well not have a standpoint. One cannot just go ahead taking crucial decisions without banking on the data to support them. This massive potential that data holds presents the need to get the data. By that I mean – web data scraping from the right sources, in the right manner, to a right extent, by the right people, at the right time, serving the right purpose and reaching the right conclusion!
In comes the problem of Data extraction, for one cannot necessarily rely on the open source datasets available. Well, if you want to crunch data about cancer, or American football league numbers, you can blindly go ahead on the well-kept datasets provided online. However, often we come up with requirements that are specific and data that meets those specifications is hard to find. It thus gives rise to the need for collecting “relevant” data. While data that has been extracted from reliable and pertinent sources is nothing short of gold, many a times random, adulterated and incomplete streams or packets make their way into our data lake. Such a mishandled and poorly organized extraction is equivalent to fixing an arcane program running with 1000+ LOC . One would rather build a new mechanism instead of getting into the mess involved in such a cleaning exercise(data or code). All of this highlights that collecting data is more of an art than a science. It’s been some time now that I have tried my hand at Scraping for this realization to sink in.
“Data extraction is the act (read “art” and not science) or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration).”
Data is being generated at a frenetic pace which is humanly impossible to digest and understand. Day in day out – massive amount of data is being churned across the globe – gifting an opportunity of humongous proportions. And the best thing about all this – One need not be a programmer or a trained professional to get going! As they rightly say – Every person with brain and logic is a data-man in disguise. So, here we are, equipped with knowledge of what data is and why it is quintessential for anyone and everyone to know basic ways of extracting it. Let’s march towards, the How- the most dreaded yet least attempted of all questions.
For simplicity and ease of understanding, we would restrict ourselves for the time being to 1 particular method of Data Extraction – Web Scraping – technique of extracting information from websites.
Tool we would be using – Scrapy – an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. So what are web-scrapers and crawlers?
Scrapers are entrusted the job of scratching the URL of the website and retrieving/storing the data present in that website while Crawlers are given the responsibility of marching from one URL to another – as the names aptly state. Scrapy provides us opportunity to design these scrapers and crawlers. While there are other tools alongside Scrapy, that enable quick-fix solution – Import.io, Octoparse, Dataminer, Data Scraping and others – Scrapy being a Python-based framework, provides the ease, convenience and efficiency of web-scraping.Learning curve associated with Scrapy is negligible and one can improvise real quick.
In short, Scrapy is an awesome leech for sucking in blood(data) from other mortals(websites).
More on it soon. Stay tuned!