Flink DataStream Window Join: Inner Join、Left Join、Right Join

Flink 专栏收录该内容
29 篇文章 20 订阅

Flink DataStream中支持双流Join的算子目前已知有4种,如下:

  1. Connect: 在之前的文章中已有总结。一般和广播流BroadcastStream(如配置流)结合使用。

  2. Join: 连接。语法:

// Left DataStream
stream
  // Right DataStream
  .join(otherStream)
  // Left Key
  .where(<KeySelector>)
  // Right Key
  .equalTo(<KeySelector>)
  // WindowAssigner 如:TumblingEventTimeWindows、SlidingEventTimeWindows ...
  .window(<WindowAssigner>)
  // JoinFunction 或 FlatJoinFunction
  .apply(<JoinFunction / FlatJoinFunction>)

注意:

A. 只支持`Inner Join`。相同窗口,两个流中,Key都存在且相同时才会关联并输出。

B. 实现上,基于`CoGroup`。

C. 就`Inner Join`而言,推荐使用`Join`,在Join的策略上做了优化,更高效。
  1. CoGroup: 联合分组。将同一Window内两个DataStream联合起来,按相同Key进行分组,再应用CoGroupFunction(自定义)进行处理。
// Left DataStream
stream
  // Right DataStream
  .coGroup(otherStream)
  // Left Key
  .where(<KeySelector>)
  // Right Key
  .equalTo(<KeySelector>)
  // WindowAssigner 如:TumblingEventTimeWindows、SlidingEventTimeWindows ...
  .window(<WindowAssigner>)
  // CoGroupFunction
  .apply(<CoGroupFunction>)

注意:

A. 相同窗口,两个流中,某个流的Key没数据,会返回空集合。

B. 比Join更通用。可简单解决`Left Join`、`Right Join`等问题。
  1. IntervalJoin: 间隔Join。如一条流去Join另一条流在过去一段时间段内的数据。在下篇文章中总结。

本文基于CoGroup总结Window中的Inner JoinLeft JoinRight Join

用CoGroup实现Inner Join、Left Join、Right Join

测试数据

// 测试数据

// 数据流Left
// 某个用户在某个时刻浏览了某个商品,以及商品的价值
// {"userID": "user_2", "eventTime": "2019-11-16 17:30:01", "eventType": "browse", "productID": "product_1", "productPrice": 10}

// 数据流Right
// 某个用户在某个时刻点击了某个页面
// {"userID": "user_2", "eventTime": "2019-11-16 17:30:02", "eventType": "click", "pageID": "page_1"}

案例代码

package com.bigdata.flink.dataStreamWindowJoin.tumblingTimeWindow.coGroup;

import com.alibaba.fastjson.JSON;
import com.bigdata.flink.beans.UserBrowseLog;
import com.bigdata.flink.beans.UserClickLog;
import lombok.extern.slf4j.Slf4j;
import org.apache.flink.api.common.functions.CoGroupFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.CoGroupedStreams;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.util.Collector;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;

import java.util.Properties;

/**
 * Summary:
 *  用CoGroup实现Inner Join、Left Join、Right Join
 */
@Slf4j
public class Test {
    public static void main(String[] args) throws Exception{

        args=new String[]{"--application","flink/src/main/java/com/bigdata/flink/dataStreamWindowJoin/application.properties"};

        //1、解析命令行参数
        ParameterTool fromArgs = ParameterTool.fromArgs(args);
        ParameterTool parameterTool = ParameterTool.fromPropertiesFile(fromArgs.getRequired("application"));

        String kafkaBootstrapServers = parameterTool.getRequired("kafkaBootstrapServers");

        String browseTopic = parameterTool.getRequired("browseTopic");
        String browseTopicGroupID = parameterTool.getRequired("browseTopicGroupID");

        String clickTopic = parameterTool.getRequired("clickTopic");
        String clickTopicGroupID = parameterTool.getRequired("clickTopicGroupID");

        //2、设置运行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        // 并行度1,防止控制台输出乱序
        env.setParallelism(1);

        //3、添加Kafka数据源
        // 浏览流
        Properties browseProperties = new Properties();
        browseProperties.put("bootstrap.servers",kafkaBootstrapServers);
        browseProperties.put("group.id",browseTopicGroupID);
        DataStream<UserBrowseLog> browseStream=env
                .addSource(new FlinkKafkaConsumer010<>(browseTopic, new SimpleStringSchema(), browseProperties))
                .process(new BrowseKafkaProcessFunction())
                .assignTimestampsAndWatermarks(new BrowseBoundedOutOfOrdernessTimestampExtractor(Time.seconds(0)));

        // 点击流
        Properties clickProperties = new Properties();
        clickProperties.put("bootstrap.servers",kafkaBootstrapServers);
        clickProperties.put("group.id",clickTopicGroupID);
        DataStream<UserClickLog> clickStream = env
                .addSource(new FlinkKafkaConsumer010<>(clickTopic, new SimpleStringSchema(), clickProperties))
                .process(new ClickKafkaProcessFunction())
                .assignTimestampsAndWatermarks(new ClickBoundedOutOfOrdernessTimestampExtractor(Time.seconds(0)));

        //browseStream.print();
        //clickStream.print();

        //4、Join
        CoGroupedStreams.WithWindow<UserBrowseLog, UserClickLog, String, TimeWindow> joinedStream =
                // Left DataStream
                browseStream
                    // Right DataStream
                    .coGroup(clickStream)
                    // Left Key
                    .where(new KeySelector<UserBrowseLog, String>() {
                        @Override
                        public String getKey(UserBrowseLog value) throws Exception {
                            return value.getUserID()+"_"+value.getEventTime();
                        }
                    })
                    // Right Key
                    .equalTo(new KeySelector<UserClickLog, String>() {
                        @Override
                        public String getKey(UserClickLog value) throws Exception {
                            return value.getUserID()+"_"+value.getEventTime();
                        }
                    })
                    // WindowAssigner: TumblingEventTimeWindows
                    .window(TumblingEventTimeWindows.of(Time.seconds(10)));

        //5、Inner Join/Left Join/Right Join
        //joinedStream.apply(new InnerJoinFunction()).print();
        //joinedStream.apply(new LeftJoinFunction()).print();
        joinedStream.apply(new RightJoinFunction()).print();


        env.execute();

    }

    /**
     * Inner Join
     * 获取每个用户每个时刻的浏览和点击。即浏览和点击都不为空才输出。
     */
    static class InnerJoinFunction implements CoGroupFunction<UserBrowseLog, UserClickLog, String>{
        @Override
        public void coGroup(Iterable<UserBrowseLog> left, Iterable<UserClickLog> right, Collector<String> out) throws Exception {

            for (UserBrowseLog userBrowseLog : left) {
                for (UserClickLog userClickLog : right) {
                    out.collect(userBrowseLog+" ==Inner Join=> "+userClickLog);
                }
            }
        }
    }

    /**
     * Left Join
     * 获取每个用户每个时刻的浏览。有点击则顺带输出,没有则点击置空。
     */
    static class LeftJoinFunction implements CoGroupFunction<UserBrowseLog, UserClickLog, String>{
        @Override
        public void coGroup(Iterable<UserBrowseLog> left, Iterable<UserClickLog> right, Collector<String> out) throws Exception {

            for (UserBrowseLog userBrowseLog : left) {
                boolean noElements = true;
                for (UserClickLog userClickLog : right) {
                    noElements = false;
                    out.collect(userBrowseLog+" ==Left Join=> "+userClickLog);
                }

                if (noElements){
                    out.collect(userBrowseLog+" ==Left Join=> "+"null");
                }
            }
        }
    }

    /**
     * Right Join
     * 获取每个用户每个时刻的点击。有浏览则顺带输出,没有则浏览置空。
     */
    static class RightJoinFunction implements CoGroupFunction<UserBrowseLog, UserClickLog, String>{
        @Override
        public void coGroup(Iterable<UserBrowseLog> left, Iterable<UserClickLog> right, Collector<String> out) throws Exception {

            for (UserClickLog userClickLog : right) {
                boolean noElements = true;
                for (UserBrowseLog userBrowseLog : left) {
                    noElements = false;
                    out.collect(userBrowseLog+" ==Right Join=> "+userClickLog);
                }

                if(noElements){
                    out.collect("null"+" ==Right Join=> "+userClickLog);
                }
            }
        }
    }


    /**
     * 解析Kafka数据
     */
    static class BrowseKafkaProcessFunction extends ProcessFunction<String, UserBrowseLog> {
        @Override
        public void processElement(String value, Context ctx, Collector<UserBrowseLog> out) throws Exception {
            try {
                UserBrowseLog log = JSON.parseObject(value, UserBrowseLog.class);
                if(log!=null){
                    out.collect(log);
                }
            }catch (Exception ex){
                log.error("解析Kafka数据异常...",ex);
            }
        }
    }

    /**
     * 解析Kafka数据
     */
    static class ClickKafkaProcessFunction extends ProcessFunction<String, UserClickLog> {
        @Override
        public void processElement(String value, Context ctx, Collector<UserClickLog> out) throws Exception {
            try {
                UserClickLog log = JSON.parseObject(value, UserClickLog.class);
                if(log!=null){
                    out.collect(log);
                }
            }catch (Exception ex){
                log.error("解析Kafka数据异常...",ex);
            }
        }
    }

    /**
     * 提取时间戳生成水印
     */
    static class BrowseBoundedOutOfOrdernessTimestampExtractor extends BoundedOutOfOrdernessTimestampExtractor<UserBrowseLog> {

        BrowseBoundedOutOfOrdernessTimestampExtractor(Time maxOutOfOrderness) {
            super(maxOutOfOrderness);
        }

        @Override
        public long extractTimestamp(UserBrowseLog element) {
            DateTimeFormatter dateTimeFormatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
            DateTime dateTime = DateTime.parse(element.getEventTime(), dateTimeFormatter);
            return dateTime.getMillis();
        }
    }

    /**
     * 提取时间戳生成水印
     */
    static class ClickBoundedOutOfOrdernessTimestampExtractor extends BoundedOutOfOrdernessTimestampExtractor<UserClickLog> {

        ClickBoundedOutOfOrdernessTimestampExtractor(Time maxOutOfOrderness) {
            super(maxOutOfOrderness);
        }

        @Override
        public long extractTimestamp(UserClickLog element) {
            DateTimeFormatter dateTimeFormatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");
            DateTime dateTime = DateTime.parse(element.getEventTime(), dateTimeFormatter);
            return dateTime.getMillis();
        }
    }

}

查看结果

向两个Topic中发送如下数据:

// 浏览记录Topic
{"userID": "user_2", "eventTime": "2019-11-16 17:30:01", "eventType": "browse", "productID": "product_1", "productPrice": 10}
{"userID": "user_2", "eventTime": "2019-11-16 17:30:02", "eventType": "browse", "productID": "product_1", "productPrice": 10}
{"userID": "user_2", "eventTime": "2019-11-16 17:30:05", "eventType": "browse", "productID": "product_1", "productPrice": 10}
{"userID": "user_2", "eventTime": "2019-11-16 17:30:10", "eventType": "browse", "productID": "product_1", "productPrice": 10}

// 点击记录Topic
{"userID": "user_2", "eventTime": "2019-11-16 17:30:01", "eventType": "click", "pageID": "page_1"}
{"userID": "user_2", "eventTime": "2019-11-16 17:30:02", "eventType": "click", "pageID": "page_1"}
{"userID": "user_2", "eventTime": "2019-11-16 17:30:03", "eventType": "click", "pageID": "page_1"}
{"userID": "user_2", "eventTime": "2019-11-16 17:30:10", "eventType": "click", "pageID": "page_1"}

Inner Join

UserBrowseLog{userID='user_2', eventTime='2019-11-16 17:30:01', eventType='browse', productID='product_1', productPrice=10} ==Inner Join=> UserClickLog{userID='user_2', eventTime='2019-11-16 17:30:01', eventType='click', pageID='page_1'}
UserBrowseLog{userID='user_2', eventTime='2019-11-16 17:30:02', eventType='browse', productID='product_1', productPrice=10} ==Inner Join=> UserClickLog{userID='user_2', eventTime='2019-11-16 17:30:02', eventType='click', pageID='page_1'}

Left Join

UserBrowseLog{userID='user_2', eventTime='2019-11-16 17:30:02', eventType='browse', productID='product_1', productPrice=10} ==Left Join=> UserClickLog{userID='user_2', eventTime='2019-11-16 17:30:02', eventType='click', pageID='page_1'}
UserBrowseLog{userID='user_2', eventTime='2019-11-16 17:30:05', eventType='browse', productID='product_1', productPrice=10} ==Left Join=> null
UserBrowseLog{userID='user_2', eventTime='2019-11-16 17:30:01', eventType='browse', productID='product_1', productPrice=10} ==Left Join=> UserClickLog{userID='user_2', eventTime='2019-11-16 17:30:01', eventType='click', pageID='page_1'}

Right Join

null ==Right Join=> UserClickLog{userID='user_2', eventTime='2019-11-16 17:30:03', eventType='click', pageID='page_1'}
UserBrowseLog{userID='user_2', eventTime='2019-11-16 17:30:02', eventType='browse', productID='product_1', productPrice=10} ==Right Join=> UserClickLog{userID='user_2', eventTime='2019-11-16 17:30:02', eventType='click', pageID='page_1'}
UserBrowseLog{userID='user_2', eventTime='2019-11-16 17:30:01', eventType='browse', productID='product_1', productPrice=10} ==Right Join=> UserClickLog{userID='user_2', eventTime='2019-11-16 17:30:01', eventType='click', pageID='page_1'}
  • 1
    点赞
  • 1
    评论
  • 7
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
©️2020 CSDN 皮肤主题: 精致技术 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值