Friday, June 28, 2013

Who is Data scientists -Data science training - Big data in chennai @ Geoinsyssoft

Big Data Needs Data Scientists

The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data

Quantitative

computer science

data modelling

business domain

visualization

Data driven

statistics

understanding-communication

Technical

domain

skeptical -what

QHD -quantitative,hacking,domain

Knowledge of people -Deep talent

statistician /mathematician --more quants ,less technical

traditional research -more business ,more quants,less techie
Business intelligence - more tech,more business ,less quants

Data scientists -More Technical ,more business, more Quants

phase 1: statistics –functional -methods,process,theorem ,techniques

phase 2: big data -

phase 3: bigdata analytics using R

phase 4 : machine learning,nlp

phase 5 : predictive,competitive intelligence

4 A's

Data architecture

Data acquisition

Data analysis

Data archiving.

Data architecture -design of your sw/hw system to read and store the for the business ,data origin and how it suppport the various people of business

A data scientist would help the system architect by providing input on how the data would need to be routed and organized to support the analysis, visualization, and presentation of the data to the appropriate people.

Measurement

Adv of microscope for the biologist,chemist

Increase the productivity and profitability

Data driven

Charts ,graph show already decided things

But its experiment for analyst to choose various option from handling data

Skills to analyse and collect different data –non financial and non numeric .

Customer experience,emotions,likes ..

Landscape

Bigdata –not only volume ,its nano data , grains of data

Lead to bi to view the same data in different ways .

Conceive the data for advantage , break the opponent statistics in seconds to succeed in the game

Thursday, June 20, 2013

Word Count in Hive and Pig

WordCount in Hive

hive> CREATE TABLE wordlist (word STRING, year INT, wordcount INT, pagecount INT, bookcount INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive> LOAD DATA LOCAL INPATH '/inputfile' OVERWRITE INTO TABLE wordlist;

hive> CREATE TABLE wordlengths (wordlength INT, wordcount INT);

hive> INSERT OVERWRITE TABLE wordlengths SELECT length(word), wordcount FROM wordlist;

hive> SELECT wordlength, sum(wordcount) FROM wordlengths group by wordlength;

Word Count in Pig :

Lines = load './input.txt' AS (line:chararray);
--TOKENIZE splits the line into a bag of words
--FLATTEN produces a separate record for each item from a bag

Words = foreach Lines generate flatten(TOKENIZE(line)) as word;

---group records togather by each words

Groups = Group words by word;

--Counts words
Counts= foreach Groups generate group,COUNT(Words);
--store the results

store Counts into './wordcount';

---use functions and UDF in uppercase -it's case sensitive

Word Count in

Bigdata-Hadoop-Simple_Learning -Geoinsyssoft-chennai-Training-and-Consulting

Big data -Simple learning

“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

Hadoop -

Map Reduce- Parallel processing of large data

§ Hive - Hadoop data warehouse

§ Hbase - NoSQL key value pair database

· Written in: Java

· Main point: Billions of rows X millions of columns

· License: Apache

· Protocol: HTTP/REST (also Thrift)

· Modeled after Google's BigTable

· Uses Hadoop's HDFS as storage

· Map/reduce with Hadoop

· Query predicate push down via server side scan and get filters

· Optimizations for real time queries

· A high performance Thrift gateway

· HTTP supports XML, Protobuf, and binary

· Jruby-based (JIRB) shell

· Rolling restart for configuration changes and minor upgrades

· Random access performance is like MySQL

· A cluster consists of several different types of nodes

Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.

For example: Search engines. Analysing log data. Any place where scanning huge, two-dimensional join-less tables are a requirement.

§ Mahout - Machine Learning

§ Pig - Scripting language

§ Hue - Graphical user interface

§ Whirr- libraries for running cloud services

§ Oozie - Workflow engine

§ Zookeeper - Workflow manager

§ Avro - Serialization

§ Flume - Streaming

§ Sqoop - RDBMS connectivity

§ Chukwa - Data Collection

Wednesday, June 12, 2013

Big data training Daywise curriculum @Geoinsyssoft

For more details Course curriculum ,duration ,fees > Click here
Class room and Online training :

For demo call 9884218531 or mail to : info@geoinsyssoft.com

Big data training Daywise Content :

Day 1:

Introduction to Big Data.

Realtime usages

Volume ,Variety,Velocity,Value

Compare with existing OLTP,ETL,DWH,OLAP

Day 2

Introduction to Hadoop 1.0 and Hadoop 2.0

Architecture

HDFS Cluster – Data Storage Framework

Map Reduce - Data Processing Framework

HBASE – NOSQL Database

HIVE Warehouse

PIG latin Data flow scripts

SQOOP –Bulk data transfer for relational database

Flume -Streaming Logs

DAY 3

Setup -VM Linux /ubuntu/CentOS

Java

Hadoop setup and configuration –version 1.1.2 and 2.05

Hadoop 1.0 cluster and Daemons

Name node – Metadata , fsimage ,Editlog , Block reports

Rack awareness policy

Safe mode ,rebalancing and load optimization

Data node – Writing, reading and replication of blocks

Job tracker – Intialization, Execution, IO, failure

Task tracker – Initialization , progress, failure

Secondary Namenode – Not a backup

DAY 4

Installation and config of Hadoop 2.0 –YARN

Resource Manager – resource and job Management

Application Manager

Scheduler - Fair ,Capacity ,Priority

Node Manager

Application Master

Container – Yarn Child and task execution

UBER job

Failure of Application ,RM,AM,NM

Day 5:

Unix and Java Basics.

HDFS file operations fs shell

Day 6:

Introduction to Mapreduce.

Architecture of MR v1 and v2

Key Value Pairs

Mapper – setup/Config,init,map,cleanup,close

Shuffle and Sort

Combiner

Pratitioner

Reducer

Day 7:

Map reduce word count program.

Structured and Unstructured Data handling

Data processing

Map only jobs

Day 8 and Day 9

MR Programs 2:

Combiner and Partitioner

Single and multiple column

Inverted index

XML -semi structured data

Map side joins.

Reduce side join.

Day 10

Introduction to HIVE Datawarehouse

Architecture Installation

Basic HQL Commands

Load, external table

Join

Partioning

Bucket

Advance HQL commands

Beeswax –Web console

Word count in hive

Day 11:

Introduction to PIG

Installation

Data flow Scripts

Handling structured and unstructured

Day 12:

Introduction to NOSQL

ACID /CAP/BASE

Key value pair -Map reduce

Column family-Hbase

Document -MongoDB

Graph DB -Neo4j

Day 13:

Introduction to HBASE and installation.

The HBase Data Model

The HBase Shell

HBase Architecture

Schema Design

The HBase API

HBase Configuration and Tuning

Day 14:

Introduction to Sqoop and installation.

Bulk loading

Hadoop Streaming.

Day 15:

Flume –NG

Source,Sink,Channel –Agent

Avro

Zoo keeper

chukwa and oozie

Day 16:

Integrate With ETL

Talend Data studio

Day 17 :

Big data Analytics-Visualization

Tableau or Jaspersoft

Cloudera /Hortonworks/Greenplum

Day 18:

Introduction to Data science

Data mining -Machine learning

Statistical Analysis –Predictive modelling

Sentiment Analysis or opinion mining

Day 19 :

Use cases ,Case studies and Proof of Concepts

Day 20 and Day 21(Optional)

CCD-410 - Cloudera Certification Questions Discussion.

For more details Course curriculum ,duration ,fees > Click here

Best bigdata training In Chennai