What is big data?
“Every day, we create 2.5 quintillion (10^18) bytes of data —somuch that 90% of the data in the world today has been createdin the last two years alone. This data comes from everywhere:sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is “big data.”
Types of Big Data:
KEY ENABLERS for Big-Data?+ Increase in storage capabilities+ Increase in processing power+ Availability of data+ Cheaper Hardware+ Better Value-for-Money for Businesses
XXX data : Volume Enterprises are acquiring very large XXXXXX of XXXXXXXXXXX variety of sources XXXX examples of use: -Sentiment Analysis – Twitter data-XXXXXXXXX XX XXXXXX XXX XXXXXXX XXXX day XXXXX can beused XXX XXXXXXXX product sentiment analysis -Predict XXXXX XXXXXXXXXXX-Convert billions XX XXXXXX XXXXX XXXXXXXX into XXXXXXXXXXXXX power consumption XXX every XXXX / XXXXXX.
XXX data : Velocity For XXXX-sensitive XXXXXXXXX such as XXXXXXXX fraud,preventing accidents, XXXXXX XXXX XXXXXX medication etc.big XXXX must XX used as it streams XXXX an XXXXXXXXXX inorder to maximize its value. Some XXXXXXXX of use:- Scrutinize millions of XXXXXX card XXXXXXXXXXXX each dayto XXXXXXXX XXXXXXXXX fraud- Analyze XXXXXXXX XX XXXXX call detail records in realtimeto XXXXXXX customer XXXXX faster- In ICU, analyze blood chemistry / ECG XXXXXXXX in realtime to deliver XXXX XXXXXX medication
Big data : XXXXXXX XXX XXXX XXX be XX any type - structured andunstructured XXXX such as XXXX, sensor data, XXXXX,video, XXXXX XXXXXXX, log XXXXX and more. New insights arefound XXXX XXXXXXXXX XXXXX XXXX XXXXX XXXXXXXX.
Some XXXXXXXX of use:-XXXXXXX XXXX XXXXX XXXXX XXXX XXXXXXXXXXXX XXXXXXX toidentify XXXXXXXXX threats- XXXXXXX XXXXX, audio, video XXX web XXXXXXXXXXX XXXXXX customer to give XXXXXX XXXXXXX XXXXX XXXXXXXX,safety tips XXX XXXXXXXXXXXXXXX.
Big data : XXXXXXXX Accuracy XX a big XXXXXXX in Big XXXX. There XX XX easyway to XXXXXXXXX XXXX data from bad. XXXX concerns:- Among XXXXXXXXX XX reviews of hotels which XXXX XXXXXXXXXXXX and XXXXX ones XXX not?
- XXX to XXXX out XXX TRUTH XXXX thousands XX productreviews- XXX to XXXXXXXX a XXXXX XXXX a informedcommunication?
Challenge of Big Data – Scalability ofAlgorithms for Statistical Computations Some very useful tools of XXXXXXXXXXX analysis XXX presently notimplemented with scalable XXXXXXXXXX XXX e.g. XX median is XX be XXXXXXXX using XXX simple XXXXXXXXXX algorithm it XXXXX take a very large amount of time -XXXXXXXXXXX XXXXXXXXXX = O(N^2). XXXXXXXXXXX XXXXXXX XXXXX uses XXXXXX or XXXXX quantiles asits base XXXX XXX face this XXXXXXXXX
Challenge XX XXX Data – Non-XXXXXXXXXXXXX- While the XXXX volume XX large it is not XXXXXXXXX XXX XXXXXXXXX XXXXXXX -No random XXXXXXXX XXXXXXX are XXXXXXX -XXXXXXXXXXX models XX statistics (XXXXXXXXXXX / Bayesian)XXXXXXXXX XXXXXX a random sample XXXXX XXXX a population -XXX XXXX XXXXXXXXXX results XXXX XXXX with non-randomsamples -XXXX XX for statistical XXXXXXX XXXXX can XXXXXX non-randomsamples.
XXXXXXXXX XX Big XXXX – Mixture Data XX there a single XXXXXXXXXX or multiple populations in the XXXXXXX?- XXXX XXXXXXXXXXX methods XXX XXXXXXX for XXXXXXX XXXXXXXXX XXX asingle XXXXXXXXXX- If XXX data is a XXXXXXX XX XXXXXXXXXXXX XXXX multiple populationswe need XX “discover” the XXXXXX of populations as well XXXXXXXXXXXX separate inferences. -XXXXXXXXXX XXXX as Flexible XXXXXXXXXX XXX XXXXXXXXXXXXXX,XXXXXXX learning XXXXXXXXXX XXXX XXXX/ CHAID attempts dothis -More such methods needed
Challenge of XXX XXXX – Real Time- Streaming XXXX - XXXXX is on XXXXX -XXXX time problems XXXX as fraud XXXXXXXXX XXXX XXXXX XXXXXXXX. -XXXX XXXXXXX are XXXXXXXX “in memory” in “time XXXXXXX” XXXXXXXXXXX XXXXXXX to disk. -Analysis XX data XXXXXX on XXXX is time consuming XXX hence XXXXXXXXX for these kind XX XXXXXXXXXXXX -XXXXXXXXXXX methods XXXXXXXXX XXXX XXXX XXXXX data and XXXX XXXXXX XXX “XXXX solution”It XXX XXX XX XXXXXXXX to have “best XXXXXXXX” XX XXXXXXXXX only XXXXX of XXX data. --Trade-off between speed and accuracy.
Google XXX XXXXX In 2009, XXXXXX reported XXXX XX analyzing flu-related XXXXXXXXXXXXX it XXX been able to detect the XXXXXX XX XXX flu asXXXXXXXXXX XXX more quickly than CDCP, XXX In XXX 2013, Nature XXXXXXXX XXXX XXXXXX flu-XXXXXX weren’XXXXXXXX XXX XXXXXXXXX more XXXX double XXX XXXXXXXXXX ofdoctor visits for XXXXXXXXX-XXXX illnesses XXXX XXXX. Many XXXXXXX are XXXXXXXX XXXXXXXXX XXXXXX in searchbehaviour, emergence XX alternative XXXXXXX XX XXXXXXXXXXXXXX.
Challenge of XXX XXXX – XXXXXXX of XXXX- XXX Data XXXXXXXX of different XXXXX XX data -With more XXXXXXX XXXXXXXX XXXXXXXX XXXXXXX XXX XXXXXXX XXXXXX XX large- Images, XXXXX, Text, XXXXXX XXXXX -Twitter, Facebook,XXXXXX etc. XXX XXXXXXXXXX to XXX XXXXXXXXXXX is to arrive XX “XXXXXXXXXXXXX” using all XXX XXXXXXXXXXX XXXX XXX the XXXXXXX ---Symbolic Data Analysis (SDA) attempts XX handle XXXXXXXXXXX problem -Present XXX XXXXXXX XXX XXXXXXX XXXXXXXXXXX. New methods of quantifying association and uncertainty XXXXXXXX XXXX XX XXXX XXX XXXXXXXX.
Statistics on Manifolds?
Challenge of Big XXXX – XXXX Quality -Data XXXXXXX is a big concern -Apart from XXXX, XXX data can XX XXXXXXXXXX -Over XXXX XXXX XXXXXXXXXXX XX well as method XX collectionmay XXXXXX- For e.g. in credit XXXXXXX XXXXXXXXXX studies XXX XXXXXXXXXXXXXX loan from an XXXXXXXXXXX XXX XXXXXX over XXXXXXXXXXXXXXXXXXXX XXX bank’s behaviour in terms XX XXXXXX XXXX.
-XX may be interesting to know that XX what extent a customerseeking a XXXX is XXXXXXXX -Assume XXXX the XXXXXXXX XXX have taken XXXXX from threefinancial XXXXXXXXXXX at different times -The XXXXX XX loan records for XXX XXXX XXXXXX XXX XX AjayK Bose, Ajoy Bose XXX XXXX Kr XXXX. How does one knowthese are same person?- XXXXXXXXXXX matching XXX XX XXXXXXX in XXXXXXXX XXX buildingunique XXXXXXXX profiles
XXXXXXXXX XX XXX Data – XXXXXXXXXXXXXXXXX and Confidentiality The XXXXXX chain Target in XX could XXXXXX out that a teenagerwas pregnant XXXX XXXXXX her XXXX (XXXXXX: XXXXXX, FebXXXX)A British bank XX able XX identify potential XXXXX XXXXXXXXXXXXX XXX be XXXXXX to terrorism. (XXXXXX :XXXXXXXXXXXXXXXXX, Levitt & Dubner) -XX it XXXXXXX XXX Target XX XXXXX XXX XXXXXXX information XX XXXXXXXXXXXX XXXXXXXX? -XXXX if XXX XXXXXXXXX XXXXXXXXXX a XXXXX person as a XXXXXXXXXXXXXX XXXXXXXXX? XXXX XXXXXXX XX his/ her XXXXXXXXXX?
XXXXX p, XXXXX N XX XXXXXXX XXXXXXX, the XXXXXX XX variables(p) is very largeand often XXXXXXX XXX number XX samples (N) XX a bigamount. -XX need to reduce the XXXXXXXXX of XXX XXXXXXX XX be able XX XXXX meaningful XXXXXXXXXXX- Personalized medicine XXXXX depend XXXXXXX on XXXXXXXXXXXXXXXXXX from XXXX XXXX.
XXXXXXX Big data poses XXX challenges XX XXXXXXXXXXXXX XXXX in XXXXX of theory and XXXXXXXXXXXX XXXX XX the challenges XXXXXXX- XXXXXXXXXXX XX statistical computation XXXXXXX- Non-XXXXXX data- Mixture data- Real XXXX Analysis XX Streaming XXXX- Statistical Analysis XXXX multiple kinds of data- XXXX Quality- XXXXXXXXXX Privacy XXX Confidentiality- High Dimensional XXXX