Next up on Hadoop Radio, Moving Data ft. Sqoop and Flume.
There are a few types of data:
- Data in Rest
- Data in Motion
- Data from a Data Warehouse
Data in Rest
This means the data has stopped being produced and is now sat resting in the file system. It is complete and ready for operation. In this circumstance we want to use standard HDFS shell commands like
Data in Motion
This is continuously updated and appended data. These need to be frozen / snapshotted at some point and then treated like Data in Rest. Data from a Web-server is typically in motion as it is constantly appended.
Data from a Data Warehouse
A variety of tools are on offer but the general practice is to get it into a relational form (think excel with rows with data for each entry). After this, treat it like Data in Rest. There are also tools to simplify this such as Jaql, Sqoop, Flume, or BigSQL Load.
Sqoop helps move structured / relational DB data, sections of it, or SQL query results to and from a Hadoop file system. There are a variety of connector modules to help with all ranges of projects.
Why is Sqoop clever? Well when you ask it to move some files, it generates its own MapReduce job to do the work.
How do I get it to do that? Simply say whether you want to import or export, where to/from and the authentication credentials:
sqoop import/export --connect LOCATION --username USERNAME --password PASSWORD
As these are Java files essentially you need to use a location they can understand and hence have to give a JDBC (Java DataBase Connection) location. More specific features are possible. For example, you can define which columns you want (
columns “column1,column2,column3"), provide filters (
--where “salary > 4000”) and more.
Exporting is similar. The table must already exist but it then populates it. It can also be applied as an update to an existing table.
The data sqoop deliver can be text or binary type, or for more control you can send it to Hadoop features like HBase or Hive.
The name comes from Latin/French for River. In English it is an artificial water way, typically used to move logs when tree felling. Here lyes why the name is relevant — we get log files from web-servers and databases.
Logs, in the computer sense, are typically always a stream of data. These work by essentially having nodes which are designated as constant lookouts. When a log comes in the workout who to send it to and then store it.
Flume needs some initial ideas to be able to run.
- A source: where the data is coming from
- A channel: a temporary memory store before logs are written in the right place
- A sink: the right place to write the data
Each of these have prebuilt types which should be defined.
You can also add metadata to a log files by using “Interceptors”. Interceptors (addition bit of code for flume you can use or write your own) can also add filtering too.
To get these working you need to give it a configuration file including all the information above as well as some type definitions. This file should be a simple text file with the extension
.properties. Then each lookout (known as an Agent) must be started via the Command Line Interface (CLI).
For lookouts to direct info to one another (for example Lookout1/Agent1 might watch a source and combine info with Lookout2/Agent2 and then feed that information to Lookout3/Agent3), they need to have the same sink-source type. If Agent1 uses Apache Avro as a sink type then Agent2 must use Avro as a source type.