Top 30 Apache Flume Question and Answers
1. What is Apache Flume?
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources such as Twitter, Facebook, and LinkedIn into a centralized data store such a Hadoop HDFS, HBase.
2. What is the use of Apache Flume?
The use of Apache Flume is to fetch log or streaming data from different social media sources and asynchronously persists in the Hadoop cluster for further analysis.
3. Who developed Apache Flume?
Apache Flume was developed by Cloudera in 2011 for aggregating and moving a very large amount of data and later in the same year (June-2011) it was transferred to Apache Foundation.
4. What is Apache Flume Agent and what are the components of Flume Agent?
Apache Flume agent is a JVM process that manages the components through which events flow from an external source such as web-servers to the next destination like Hadoop HDFS. The components of Flume agents are (Source, Channel, and Sink).
5. What is an event in Apache Flume?
Apache Flume event is a continous form of data that is generated from different sources. It has byte payload and string attributes that are understood by the different type of Apache Flume sources, for example, the Avro source generate Avro events.
6. What are the core components of Apache Flume?
The core components of Apache Flume are Source, Channels, and Sink. Flume Source receives events from the source system and stores them in one or more Channels. Flume Channels is temporary storage that store data until it is consumed by Sink. Sink removes events from Channels and put them into external storage such as Hadoop HDFS.
7. What is Apache Flume Source?
Apache Flume source is used to receive events from external sources like a web server and put them into one or more channels.
8. List out the Source name support by Apache Flume?
Apache Flume supports different types of Sources as mentioned below.
- Avro Source
- Thrift Source
- Exec Source
- JMS Source
- Spooling Directory Source
- Kafka Source
- NetCat TCP Source
- NetCat UDP Source
- Sequence Generator Source
- Syslog Source
- HTTP Source
- Custom Source
- Scribe Source
9. What is Apache Flume Channel?
Apache Flume channel is used to store event data temporarily that is later consumed by the other component of Flume agent called a sink. We can take the example of the JDBC channel. It stores event on a file-system based embedded database till the time sink takes it.
10. How many types of Channel Apache Flume supports?
Apache Flume supports different types of Channels as mentioned below.
- Memory Channel
- JDBC Channel
- Kafka Channel
- File Channel
- Spillable Memory Channel
- Pseudo Transaction Channel
- Custom Channel
11. Which Flume Channel is the most reliable Channel to ensure no data loss?
Apache File channel is the most reliable and durable channel because it stores events on the disk so in case there is a system failure or JVM crashed or the system rebooted those events that are not transferred will start again when Flume is restarted.
12. How to use Apache Flume with HBase?
Apache Flume can be used with HBase using HBase Sink (org.apache.flume.sink.hbase.HBaseSink) or AsyncHBase Sink (org.apache.flume.sink.hbase.AsyncHBaseSink) .
13. What is the Apache Flume configuration file?
Apache Flume configuration file stores the detail of each agent’s source, sink and channel information. Each component has its name, type, and properties which are defined in the configuration file. For example, Avro source needs a hostname, port number to receive data from an external client.
Example of Avro agent b1 configuration detail.
b1.sources = r1
b1.channels = c1
b1.sources.r1.type = avro
b1.sources.r1.channels = c1
b1.sources.r1.bind = 0.0.0.0
b1.sources.r1.port = 4141
14. What is the Core Concepts of Apache Flume?
Flume is designed to provide a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store.
The architecture of Flume NG is based on a few concepts that help together to achieve this objective.
The core concepts of Flume are mentioned below.
- Event: It is the data that gets transported from its origin to its destination.
- Flow: The movement of events from source to destination is referred to as data flow.
- Client: It is an interface that takes works at the event origin and delivers them to the Apache Flume agent.
- Agent: It is used to store process and deliver Apache Flume events. It contains the sources, channels, and sinks processes.
- Source: It uses a specific mechanism to consume events.
- Channel: It stores the events which are received from different flume sources and later those events are consumed by the sink.
- Sink: It is used to take events from the channel and send it to the next agent or the destination system.
15. Apache Flume support third-party plugins?
Flume support plugin-based architecture as well. It can load data from external sources and send data to the external destination which is apart from Flume.
16. What is Streaming Data?
Streaming data is generated in continuous form from different sources such as logs files are generated from mobile users, online streaming data for movies such as Netflix, social media is generating lots of data, market trade is generating huge data, and so on.
17. What is Apache Flume Interceptors?
Apache Flume Interceptors are used to filter events between source and channel, channel and sink. It can modify or drop events based on the condition defined by Developers.
18. What are the types of Interceptors supported by Apache Flume?
Apache Flume supports the below list of Interceptors.
- Timestamp Interceptor
- Host Interceptor
- Static Interceptor
- Remove Header Interceptor
- UUID Interceptor
- Morphline Interceptor
- Search and Replace Interceptor
- Regex Filtering Interceptor
19. What are Flume Channel Selectors?
A Channel Selector is used to determine which channel should be chosen from a group of channels to transfer events from source to destination.
20. What is the type of Channel Selectors supported by Flume?
Apache Flume supports the below list of Channel Selectors.
- Replicating Channel Selector
- Multiplexing Channel Selector
- Custom Channel Selector
21. What is Multi-hop data flow?
In Multi-hop flow, a user can build multi-hop flows where events will travel through multiple agents before reaching the final destination.
22. What is Avro Source in Apache Flume?
Avro source is used to listens on Avro port and receives events from external Avro client streams. When it is paired with the built-in Avro Sink on another Flume agent, then it can create tiered collection topologies.
23. What is Apache Flume Sink?
Apache Flume Sink is used to consume events from channels from writing it into external sources. It can be grouped for various behaviors using SinkGroup and SinkProcessor.
24. What are the types of Sink supported by Apache Flume?
Apache Flume provides the below list of Sink.
- HDFS Sink
- Hive Sink
- Logger Sink
- Avro Sink
- Thrift Sink
- IRC Sink
- File Roll Sink
- Null Sink
- HBase Sink
- MorphlineSolr Sink
- ElasticSearch Sink
- Kite Dataset Sink
- HTTP Sink
- Custom Sink
25. How HBaseSink is different from AsyncHBaseSink?
Apache Flume HBaseSink and AsyncHBaseSink both are used to send the event to the Hbase system. In the case of HBaseSink, the HTable API is used to send the data to HBase, and in the case of AsyncHBaseSink, the asynchbase API is used to send the stream data to HBase. If there is a failure then that is handled by the callbacks.
26. What is the difference between Apache Flume and Apache Kafka?
Apache Flume uses Sinks to push messages to the destination whereas Kafka uses Kafka consumer API to consume messages from Kafka Broker.
27. What is Apache Flume Multiplexing?
Apache Flume Multiplexing is used to replicate or route the events to one more channel.
28. What is Apache Flume event batching?
Apache Flume can batch events. The batch size is the maximum number of events that a sink or client takes from a channel in a single transaction. If the batch size is small then the throughput is decreased but in case of failure, there would be less duplication and if the batch size is big then through will be high but in case of failure, there would be more duplication.
29. What is Apache Flume Fan-out data flow?
Apache Flume provides the facility to send an event from one source to multiple channels. It has two types of fan-out, the first one is called replicating, and the second one is called multiplexing. In case of replicating event is forwarded to all channels and in case of multiplexing, events are forwarded only to the selected channels.
30. What is Apache Flume topology design?
In Apache Flume, the first step is to check all sources and destination sinks for the data after that we can check if we need the aggregation or rerouting of events. If we are collecting data from many data sources then it is required to have aggregation and rerouting to direct those events to a different location.