A hands-on Guide to Data Science: Top Tools, Functions and Practical Applications

An overview of data science applications, tools and techniques, plus information on what data scientists do and the skills they need

Team Infosoft

Statistics deals with collecting, analyzing, interpreting, and presenting masses of numerical data. Traditionally, the data analysis was carried out on numerical data available in a well-defined format. However, with the emergence of the World Wide Web, vast amounts of data are now available online, and almost all businesses nowadays have digital traces and online presence. The massive amount of online data available in different formats and maintained in different structures brings new data analysis challenges, which lead to the emergence of the field named “Data Science.”

The scale and heterogeneity of data impact both data processing and analytics. Sophisticated machine learning algorithms are required to manage and process the data intelligently. Data Science discusses intelligent techniques that help in processing heterogeneous data on a large scale. Data Science brings a significant change in data analytic techniques and creates a need to develop and apply machine learning techniques at a large scale. The important tasks of data science are as follows:

Big Data Analytics

Big data analytics uses advanced analytic techniques against enormous, diverse big data sets that include structured, semi-structured, and unstructured data from different sources and in various sizes, from terabytes to zettabytes. With big data analytics, you can ultimately fuel better and faster decision-making, modeling and predicting future outcomes, and enhanced business intelligence.

Much open-source software such as Apache Hadoop, Apache Spark, and the entire Hadoop ecosystem provide cost-effective, flexible data processing and storage tools designed to handle the large volume of generated data.

Machine Learning

Machine Learning involves algorithms and mathematical models, chiefly employed to make machines learn and prepare them to adapt to everyday advancements. Machine learning techniques are required to process the data available in different formats, e.g., text, audio, video. Intelligent data-mining techniques like Frequent pattern mining, Market basket analysis, time series forecasting are instrumental in business decision making.

These techniques are based on historical data patterns where the machine can predict the outcomes for the future months or years and provide intelligent recommendations. Due to the enormous amount and heterogeneity of data, the traditional stochastic and probabilistic machine learning models are not sufficient for the task. Hence, more recently, many deep learning-based techniques have evolved to perform the task of data analytics. For example, large-scale deep learning models are now available to process audio, image and video, and text data.

Business Intelligence

Each business produces too much data every day. This data when analyzed carefully and then presented in visual reports involving graphs, can bring good decision making to life. Data visualization can help in faster decision making. In addition, data mining techniques like ‘Market Basket Analysis’ can help in better decision making. This can help the management in taking the best decision after carefully delving into patterns and details the reports bring to life.

Data Science Tools

Processing raw data is a cumbersome process. Therefore, there have been consistent efforts to develop software programs and tools that can help process data over the years. Data Science tools are broadly divided into four categories, Data Visualization, Data Processing & Aggregation, Data Warehousing, and Statistical Processing

1. Data Visualization Tools: Data Visualization tools help create a visual representation of data, which can be helpful in primary inspection and fundamental statistical analysis. Various charts like Bar graphs, Pie charts, and histograms can be created using data visualization tools. Following are some popular data visualization tools:

Apache Superset: Apache Superset is an open-source data exploration and visualization software that can process big data. It provides an intuitive interface and a wide range of visualization builders for visualizing datasets and crafting interactive dashboards. It supports most of the RDBMS and process SQL query. It also has a lightweight semantic layer.

Matplotlib & Seaborn: Matplotlib is a data visualization library for Python. Using Matplotlib, we can create publication-quality plots, Make interactive figures that can zoom, pan, update. MatPlotlib can also be embedded into JupyterLab and similar graphical user interfaces.

Seaborn is another library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures. Seaborn’s plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

Tableau: Tableau is a Data Visualization software packed with powerful graphics to make interactive visualizations. It is focused on industries working in the field of business intelligence. The most crucial aspect of Tableau is its ability to interface with databases, spreadsheets, OLAP (Online Analytical Processing) cubes, etc. Tableau can visualize geographical data and plotting longitudes and latitudes in maps with these features. You can also use its analytics tool to analyze data along with visualizations. While Tableau is enterprise software, it comes with a free version called Tableau Public.

Microsoft Power BI: Power BI is a business analytics service by Microsoft which provides an intuitive interface through which end users create their reports and dashboards. Power BI is a part of the Microsoft Power platform, which provides interactive visualizations and business intelligence capabilities.

Data Processing and Aggregation: Data Processing and Aggregation tools provide a framework to combine and process heterogeneous data from different sources. Some of the popular data aggregation tools are:

Spark: Apache Spark is an all-powerful analytics engine, and it is the most used data science tool. Spark is specifically designed to handle batch processing and Stream Processing. It comes with many APIs that facilitate Data Scientists to make repeated access to data for Machine Learning, Storage in SQL, etc.

It is an improvement over Hadoop and can perform 100 times faster than MapReduce. Spark has many Machine Learning APIs that can help Data Scientists to make robust predictions with the given data. Spark can process real-time data as compared to other analytical tools that process only historical data in batches.

Spark offers various APIs that are programmable in Python, Java, and R. But the most powerful conjunction of Spark is with Scala programming language. Spark is highly efficient in cluster management, making it much better than Hadoop as the latter is only used for storage. This cluster management system allows Spark to process the application at high speed.

Hadoop: The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, providing a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Dask: Dask is a library for parallel computing in Python, which can be used for processing Big Data.

Dask is open source and freely available. It is developed in coordination with other community projects like NumPy, pandas, and scikit-learn. Dask provides Big Data components like parallel arrays, dataframes, and lists that extend standard interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

3. Data WareHousing Tools: The data warehouse provides a central repository of data that can receive and store data from different sources and formats. Data WareHousing tools are used to store such data and perform Business Intelligence related operations.

Amazon Redshift: Amazon Redshift is a data warehouse product that forms part of the larger cloud-computing platform Amazon Web Services. Redshift uses parallel processing and compression to decrease command execution time.

This allows Redshift to perform operations on billions of rows at once. This also makes Redshift useful for storing and analyzing large quantities of data from logs or live feeds through a source such as Amazon Kinesis.

Snowflake: SnowFlake offers a cloud-based data warehouse for data storage and analysis. It allows corporate users to store and analyze data using cloud-based hardware and software. It runs on Amazon S3 since 2014, on Microsoft Azure since 2018, and on the Google Cloud Platform since 2019. The company is credited with reviving the data warehouse industry by building and perfecting a cloud-based data platform.

4. Statistical Processing: Statistical programming tools help perform basic statistical operations on data.

* SAS: SAS is a closed source proprietary software specifically designed for statistical operations. Large organizations widely use SAS to perform statistical modeling and data analysis. In addition, SAS offers numerous statistical libraries and tools that data scientists can use for modeling and organizing their data.

* R : R is a free software environment for statistical computing and graphics. . It is widely used among statisticians and data miners for developing statistical software and data analysis. It is equally popular among industry and academia. R is an interpreted programming language that provides libraries for statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, spatial and time-series analysis, classification, clustering, and others.

5. Complete Solutions:

Qlik

ThoughtSpot

SAP

Google BigQuery

Data Science Applications

* Google’s LYNA¹ : Google has developed a new tool, LYNA (Lymph Node Assistant), for identifying breast cancer tumors that metastasize to nearby lymph nodes. That can be difficult for the human eye to see, especially when the new cancer growth is small. In one trial, LYNA accurately identified metastatic cancer using its machine-learning algorithm 99 percent of the time. However, more testing is required before doctors can use it in hospitals.

*Clue: Predicting Periods² : The popular Clue app employs data science to forecast users’ menstrual cycles and reproductive health by tracking cycle start dates, moods, stool type, hair condition, and many other metrics. Behind the scenes, data scientists mine this wealth of anonymized data with tools like Python and Jupyter’s Notebook. Users are then algorithmically notified when they’re fertile, on the cusp of a period or at an elevated risk for conditions like an ectopic pregnancy.

*Oncora Medical: Cancer Care Recommendations³ : Oncora’s software uses machine learning to create personalized recommendations for current cancer patients based on data from past ones. Health care facilities using the company’s platform include New York’s Northwell Health. Their radiology team collaborated with Oncora data scientists to mine 15 years’ worth of data on diagnoses, treatment plans, outcomes, and side effects from more than 50,000 cancer records. Oncora’s algorithm learned to suggest personalized chemotherapy and radiation regimens based on this data.

*UPS: Optimizing Package Routing⁴ : UPS uses data science to optimize package transport from drop-off to delivery. Its latest platform for doing so, Network Planning Tools (NPT), incorporates machine learning and AI to crack challenging logistics puzzles, such as how packages should be rerouted around lousy weather or service bottlenecks. NPT lets engineers simulate a variety of workarounds and pick the best ones; AI also suggests routes on its own. According to a company forecast, the platform could save UPS $100 to $200 million by 2020.

*StreetLight Data: Traffic Patterns, and Not Just for Cars⁵ : StreetLight uses data science to model traffic patterns for cars, bikes, and pedestrians on North American streets. Streetlight’s traffic maps stay up-to-date based on a monthly influx of trillions of data points from smartphones, in-vehicle navigation devices, and more. They’re more granular than mainstream maps apps, too: they can, for instance, identify groups of commuters that use multiple transit modes to get to work, like a train followed by a scooter. The company’s maps inform various city planning enterprises, including commuter transit design.

*Uber Eats: Delivering Food While It’s Hot⁶ : The data scientists at Uber Eats, Uber’s food-delivery app, have a fairly simple goal: getting hot food delivered quickly. Making that happen across the country takes machine learning, advanced statistical modeling, and staff meteorologists. To optimize the complete delivery process, the team has to predict how every possible variable — from storms to holiday rushes — will impact traffic and cooking time.

*Liverpool F.C. – Moneyballing Soccer⁷ : Liverpool’s soccer team almost won the 2019 Premier League championship with data science, which the team uses to ferret out and recruit undervalued soccer players. According to the New York Times, Liverpool was long in the same bind as the Oakland A’s: It didn’t have nearly the budget of its competitors, like Manchester United, so it had to find great players before rich teams realized how great they were. Data scientist Ian Graham, now head of Liverpool’s research team, figured out exactly how to do that. Given the chaotic, continuous nature of play and the rarity of goals, it’s not easy to quantify soccer prowess. However, Graham built a proprietary model that calculates how every pass, run, and goal attempt influences a team’s overall chance of winning. Liverpool has used it to recruit players and for general strategy.

*Equivant: Data-Driven Crime Predictions⁸ : Widely used by the American judicial system and law enforcement, Equivant’s Northpointe software suite attempts to gauge an incarcerated person’s risk of reoffending. Its algorithms predict that risk based on a questionnaire that covers the person’s employment status, education level and more. No questionnaire items explicitly address race, but according to a ProPublica analysis that Northpointe disputed, the Equivant algorithm pegs black people as higher recidivism risks than white people 77 percent of the time, even when they’re the same age and gender, with similar criminal records. ProPublica also found that Equivant’s predictions were 60 percent accurate.

*IRS: Evading Tax Evasion⁹ : Tax evasion costs the U.S. government $458 billion a year, by one estimate, so it’s no wonder the IRS has modernized its fraud-detection protocols in the digital age. To the dismay of privacy advocates, the agency has improved efficiency by constructing multidimensional taxpayer profiles from public social media data, assorted metadata, emailing analysis, electronic payment patterns, and more. The agency forecasts individual tax returns; anyone with wildly different actual and forecasted returns gets flagged for auditing.

*Sovrn: Automated Ad Placement¹⁰ Sovrn brokers deals between advertisers and outlets like Bustle, ESPN, and Encyclopedia Britannica. Since these deals happen millions of times a day, Sovrn has mined a lot of data for insights, manifest in its intelligent advertising technology. Compatible with Google and Amazon’s server-to-server bidding platforms, its interface can monetize media with minimal human oversight — or, on the advertiser end, target campaigns to customers with specific intentions.

*Facebook: People You Almost Definitely Know¹¹ : Facebook, uses data science in various ways, but one of its buzzier data-driven features is the “People You May Know” sidebar, which appears on the social network’s home screen. Often creepily prescient, it’s based on a user’s friend list, the people they’ve been tagged within photos and where they’ve worked and gone to school. According to the Washington Post it’s also based on “really good math,” specifically, a type of data science known as network science, which essentially forecasts the growth of a user’s social network based on the growth of similar users’ networks.

*Tinder: The Algorithmic Matchmaker¹² : When singles match on Tinder, they can thank the company’s data scientists. A carefully-crafted algorithm works behind the scenes, boosting the probability of matches. Once upon a time, this algorithm relied on users’ Elo scores, essentially an attractiveness ranking. Now, though, it prioritizes matches between active users, users near each other, and users who seem like each other’s “types” based on their swiping history