Friday, June 28, 2013

Who is Data scientists -Data science training - Big data in chennai @ Geoinsyssoft


 Big Data Needs Data Scientists

                                                         



The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data


Quantitative
computer science
data modelling
business domain
visualization

Data driven
statistics
understanding-communication
Technical
domain
skeptical -what

QHD -quantitative,hacking,domain

Knowledge of people  -Deep talent


statistician /mathematician --more quants ,less technical
traditional research -more business ,more quants,less techie
Business intelligence - more tech,more business ,less quants
Data scientists -More Technical ,more business, more Quants
























phase 1: statistics –functional  -methods,process,theorem ,techniques
phase 2: big data  -

phase 3: bigdata analytics using R

phase 4 : machine learning,nlp

phase 5 : predictive,competitive intelligence

4 A's 

Data architecture
Data acquisition
Data analysis
Data archiving.


Data architecture -design of your sw/hw system to read and store the for the business ,data origin and how it suppport the various people of business

A data scientist would help the system architect by providing input on how the data would need to be routed and organized to support the analysis, visualization, and presentation of the data to the appropriate people.



Measurement  
Adv  of microscope for the biologist,chemist
Increase the productivity and profitability
Data driven
Charts ,graph show already decided things
But its experiment for analyst to choose various option from handling data
Skills to analyse and collect different  data –non financial and non numeric .
Customer experience,emotions,likes ..
Landscape
Bigdata –not only volume ,its nano data , grains of data
Lead to bi to view the same data in different ways . 
Conceive the data for advantage , break the opponent statistics in seconds  to succeed in the game








Thursday, June 20, 2013

Word Count in Hive and Pig





WordCount in Hive



hive> CREATE TABLE wordlist (word STRING, year INT, wordcount INT, pagecount INT, bookcount INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive> LOAD DATA LOCAL INPATH '/inputfile' OVERWRITE INTO TABLE wordlist;

hive> CREATE TABLE wordlengths (wordlength INT, wordcount INT);

hive> INSERT OVERWRITE TABLE wordlengths SELECT length(word), wordcount FROM wordlist;

hive> SELECT wordlength, sum(wordcount) FROM wordlengths group by wordlength;


Word Count in Pig : 


Lines = load './input.txt' AS (line:chararray);
--TOKENIZE splits the line into a bag of words
--FLATTEN produces a separate record for each item from a bag


Words = foreach Lines generate flatten(TOKENIZE(line)) as word;

---group records togather by each words

Groups = Group words by word;

--Counts words
Counts= foreach Groups generate group,COUNT(Words);
--store the results 

store Counts into './wordcount';


---use functions and UDF in uppercase -it's case sensitive


Word Count in






Bigdata-Hadoop-Simple_Learning -Geoinsyssoft-chennai-Training-and-Consulting


Big data  -Simple learning



                                                                  
“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.


Hadoop -

















Map Reduce- Parallel processing of large data




§ Hive - Hadoop data warehouse



§ Hbase - NoSQL key value pair database

·         Written in: Java
·         Main point: Billions of rows X millions of columns
·         License: Apache
·         Protocol: HTTP/REST (also Thrift)
·         Modeled after Google's BigTable
·         Uses Hadoop's HDFS as storage
·         Map/reduce with Hadoop
·         Query predicate push down via server side scan and get filters
·         Optimizations for real time queries
·         A high performance Thrift gateway
·         HTTP supports XML, Protobuf, and binary
·         Jruby-based (JIRB) shell
·         Rolling restart for configuration changes and minor upgrades
·         Random access performance is like MySQL
·         A cluster consists of several different types of nodes
Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.
For example: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.





§ Mahout - Machine Learning
§ Pig - Scripting language
§ Hue - Graphical user interface
§ Whirr- libraries for running cloud services
§ Oozie - Workflow engine
§ Zookeeper - Workflow manager
§ Avro - Serialization
§ Flume - Streaming
§ Sqoop - RDBMS connectivity
§ Chukwa - Data Collection




Wednesday, June 12, 2013

Big data training Daywise curriculum @Geoinsyssoft


                                         


         For more details Course curriculum ,duration ,fees Click here 
             Class room and Online training : 

For demo call 9884218531 or mail to : info@geoinsyssoft.com





Big data training Daywise Content : 


Day 1:
Introduction to Big Data.
Realtime usages
Volume ,Variety,Velocity,Value
Compare with existing OLTP,ETL,DWH,OLAP
Day 2
Introduction to Hadoop 1.0 and Hadoop 2.0
Architecture
HDFS Cluster – Data Storage Framework
Map Reduce  - Data Processing Framework
HBASE – NOSQL Database
HIVE Warehouse
PIG  latin Data flow scripts
SQOOP –Bulk data transfer for relational database
Flume  -Streaming Logs

DAY 3
Setup -VM Linux /ubuntu/CentOS
Java
Hadoop setup and configuration –version 1.1.2 and 2.05
Hadoop 1.0 cluster and Daemons
Name node – Metadata , fsimage ,Editlog , Block reports
Rack awareness policy
Safe mode ,rebalancing and load optimization
Data node – Writing, reading and replication of blocks
Job tracker – Intialization, Execution, IO, failure
Task tracker – Initialization , progress, failure
Secondary Namenode – Not a backup
DAY 4
Installation and config of Hadoop 2.0 –YARN
Resource Manager – resource and job Management
Application Manager
Scheduler  - Fair ,Capacity ,Priority
Node Manager
Application Master
Container – Yarn Child and task execution
UBER job
Failure of Application ,RM,AM,NM

Day 5:
Unix and Java Basics.
HDFS file operations  fs shell

 
Day 6:
Introduction to Mapreduce.
Architecture of MR v1 and v2
Key Value Pairs
Mapper – setup/Config,init,map,cleanup,close
Shuffle and Sort
Combiner
Pratitioner
Reducer

Day 7:
Map reduce  word count program.

Structured and Unstructured Data handling
Data processing 
Map only jobs 

Day 8 and Day 9
MR Programs 2:
Combiner and Partitioner
Single and multiple column
Inverted index
XML -semi structured data
Map side joins.
Reduce side join.

Day 10
Introduction to HIVE Datawarehouse
Architecture Installation
Basic HQL Commands
Load, external table
Join
Partioning
Bucket
Advance HQL commands
Beeswax –Web console
Word count in hive

Day 11:
Introduction to PIG
Installation
Data flow Scripts
Handling structured and unstructured

Day 12:
Introduction to NOSQL
ACID /CAP/BASE
Key value pair -Map reduce
Column family-Hbase
Document -MongoDB
Graph DB -Neo4j

Day 13:
Introduction to HBASE and installation. 
The HBase Data Model
The HBase Shell
HBase Architecture
Schema Design
The HBase API
HBase Configuration and Tuning

Day 14:
Introduction to Sqoop and installation.
Bulk loading
Hadoop Streaming.

Day 15:
Flume –NG
Source,Sink,Channel –Agent
Avro  
Zoo keeper
chukwa and oozie

Day 16:
Integrate With ETL
Talend Data studio

Day 17 :
Big data Analytics-Visualization
Tableau or Jaspersoft
Cloudera /Hortonworks/Greenplum

Day 18:
Introduction to Data science
Data mining -Machine learning
Statistical Analysis –Predictive modelling
Sentiment Analysis or opinion mining

Day 19 :
Use cases ,Case studies and Proof of Concepts 

Day 20 and Day 21(Optional)

CCD-410 - Cloudera Certification Questions Discussion.





                                           www.geoinsyssoft.com/courses
         For more details Course curriculum ,duration ,fees Click here