CuriousY A world with wonder

Apache http client hangs when doing second request

| Comment

Issue

在Flink job用Apache Http client写了一个发post请求到某个api的sink,上线后发现前几个请求总是能成功的,之后就hang在那里了。

Troubleshooting

  1. 尝试给请求添加了一个timeout,结果就是前几个请求成功之后,后面的所有请求都触发ConnectTimeoutException
  2. 检查了api的服务器日志,发现除了前几个请求有记录外,之后的请求都没有对应的log。
  3. 再手动设置了connection pool:

    val connManager: PoolingHttpClientConnectionManager = new PoolingHttpClientConnectionManager
    connManager.setMaxTotal(1)
    connManager.setDefaultMaxPerRoute(1)
    val httpClient = HttpClients.custom().setConnectionManager(connManager).build()
    

    当都设为1时,只有一个请求能成功。当都设为10时,只有前10个请求能成功。

Solution

以上基本确定是connection被占用了,导致一直在等待空闲的connection,以至于timeout了。server端也没有收到后续的请求。
根本原因是Apache Http client需要手动释放connection资源!

val httpPost = new HttpPost(url)
httpPost.setEntity(new StringEntity(payload))
httpPost.setHeader("Accept", "application/json")
httpPost.setHeader("Content-type", "application/json")
val requestConfig = RequestConfig.custom()
    .setConnectTimeout(timeout * 1000)
    .setConnectionRequestTimeout(timeout * 1000)
    .setSocketTimeout(timeout * 1000).build()
httpPost.setConfig(requestConfig)
var response: CloseableHttpResponse = null
try {
    response = httpClient.execute(httpPost)
} catch {
    case e: ConnectTimeoutException => logger.error("Request timeout")
    case e: IOException => logger.error("IO exception")
    case _: Throwable => logger.error("Got some other kind of Throwable exception")
} finally {
    httpPost.releaseConnection()  // This line is important!!!
}
if (response != null) {
    logger.info("Response: " + response.toString)
}

关于SQLAlchemy的一次调优

| Comment

Issue

写了一个后端的api会通过mysql join两张表,然后返回一个list结果,发现在读取近一个月的数据时这个api就已经比较慢了(大概需要8s)。记录下是如何发现问题及优化的。
api是用Python3的flask加上flask-restful来写的,数据库的ORM使用的是flask-sqlalchemy。以下是原始的代码:

class AlertCollection(Resource):
    parser = reqparse.RequestParser()
    parser.add_argument('priority')
    parser.add_argument('status')
    parser.add_argument('type')
    parser.add_argument('start')
    parser.add_argument('end')
    parser.add_argument('rule_id')
    parser.add_argument('rule_name')
    parser.add_argument('offset', type=int, default=0)
    parser.add_argument('limit', type=int, default=20)

    def get(self):
        args = self.parser.parse_args()
        logger.info('Get alert notification collection with params: {}'.format(args))
        alert_query = AggAlertModel.query
        if args.get('rule_id'):
            alert_query = alert_query.filter(AggAlertModel.rule_id == args['rule_id'])
        if args.get('status'):
            alert_query = alert_query.filter(AggAlertModel.status.in_(args['status'].split(',')))
        if args.get('start'):
            alert_query = alert_query.filter(AggAlertModel.last_trigger_time >= int(args['start']))
        if args.get('end'):
            alert_query = alert_query.filter(AggAlertModel.last_trigger_time <= int(args['end']))
        alert_query = alert_query.join(AlertRuleMeta, AggAlertModel.rule_id == AlertRuleMeta.id)
        if args.get('rule_name'):
            alert_query = alert_query.filter(AlertRuleMeta.name == args['rule_name'])
        if args.get('priority'):
            alert_query = alert_query.filter(AlertRuleMeta.priority.in_(args['priority'].split(',')))
        if args.get('type'):
            alert_query = alert_query.filter(AlertRuleMeta.type.in_(args['type'].split(',')))
        alert_query = alert_query.order_by(AggAlertModel.last_trigger_time.desc())
        try:
            alert_count = alert_query.count()
            if args['limit'] != -1:
                alert_query = alert_query.limit(args['limit']).offset(args['offset'])
            alert_result = alert_query.all()
            data = {'items': [item.row_to_dict() for item in alert_result], 'total': alert_count}
        except Exception as e:
            return {'status': 'failed', 'message': getattr(e, 'message', repr(e))}, 500
        return {'status': 'success', 'data': data}, 200

ORM的model:

class AggAlertModel(db.Model):
    __tablename__ = 'agg_alert'

    id = db.Column(db.Integer, primary_key=True)
    rule_id = db.Column(db.Integer, db.ForeignKey('alert_rules_meta.id'), nullable=False)
    rule = db.relationship("AlertRuleMeta", lazy="joined")
    count = db.Column(db.SmallInteger, nullable=False, default=0)
    status = db.Column(db.String(32), nullable=False, default='Triggered',
                       comment='Enum: Triggered, Acknowledged, Resolved')
    follower = db.Column(db.String(64), default='', nullable=False)
    rca = db.Column(db.TEXT, nullable=False, default='')
    first_trigger_time = db.Column(db.Integer, default=time.time, nullable=False)
    last_trigger_time = db.Column(db.Integer, default=time.time, nullable=False)
    resolve_time = db.Column(db.Integer, nullable=True)
    update_time = db.Column(db.Integer, default=time.time,
                            onupdate=time.time, nullable=False, comment='update time')
    labels = db.Column(db.String(1024), nullable=False, default='{}')
    channel_id = db.Column(db.String(64), nullable=False, default='')
    alert_ts = db.Column(db.String(64), nullable=False, default='')

    def row_to_dict(self):
        """Return object data in serializeable format"""
        return {
            'id': self.id,
            'count': self.count,
            'status': self.status,
            'follower': self.follower,
            'rca': self.rca,
            'first_trigger_time': self.first_trigger_time,
            'last_trigger_time': self.last_trigger_time,
            'resolve_time': self.resolve_time,
            'update_time': self.update_time,
            'priority': self.rule.priority,
            'summary': self.rule.name,
            'type': self.rule.type,
            'team': self.rule.team,
            'owner': self.rule.owner,
            'domain': self.rule.domain,
            'labels': {**json.loads(self.rule.labels), **(json.loads(self.labels) if self.labels else {})},
            'sop': self.rule.sop,
            'channel_id': self.channel_id,
            'alert_ts': self.alert_ts,
        }

AlertRuleMeta这个ORM的model就不贴了,反正就是通过rule_id这个外键来join的。

Debugging

flask可以通过添加这个设置来把实际使用的sql语句打印出来,然后我们可以再通过对sql的性能做进一步分析:

app.config['SQLALCHEMY_ECHO'] = True

然后我们来看一下具体使用了哪些sql,以及每个sql的耗时:

2022-02-23 11:51:46,635 INFO sqlalchemy.engine.base.Engine SELECT count(*) AS count_1 
FROM (SELECT agg_alert.id AS agg_alert_id, agg_alert.rule_id AS agg_alert_rule_id, agg_alert.count AS agg_alert_count, agg_alert.`status` AS agg_alert_status, agg_alert.follower AS agg_alert_follower, agg_alert.rca AS agg_alert_rca, agg_alert.first_trigger_time AS agg_alert_first_trigger_time, agg_alert.last_trigger_time AS agg_alert_last_trigger_time, agg_alert.resolve_time AS agg_alert_resolve_time, agg_alert.update_time AS agg_alert_update_time, agg_alert.labels AS agg_alert_labels, agg_alert.channel_id AS agg_alert_channel_id, agg_alert.alert_ts AS agg_alert_alert_ts 
FROM agg_alert INNER JOIN alert_rules_meta ON agg_alert.rule_id = alert_rules_meta.id 
WHERE agg_alert.last_trigger_time >= %(last_trigger_time_1)s AND agg_alert.last_trigger_time <= %(last_trigger_time_2)s AND alert_rules_meta.priority IN (%(priority_1)s, %(priority_2)s) ORDER BY agg_alert.last_trigger_time DESC) AS anon_1
2022-02-23 11:51:46,635 INFO sqlalchemy.engine.base.Engine {'last_trigger_time_1': 1642928019, 'last_trigger_time_2': 1645520019, 'priority_1': 'p1', 'priority_2': 'p2'}
2022-02-23 11:51:47,076 INFO sqlalchemy.engine.base.Engine SELECT agg_alert.id AS agg_alert_id, agg_alert.rule_id AS agg_alert_rule_id, agg_alert.count AS agg_alert_count, agg_alert.`status` AS agg_alert_status, agg_alert.follower AS agg_alert_follower, agg_alert.rca AS agg_alert_rca, agg_alert.first_trigger_time AS agg_alert_first_trigger_time, agg_alert.last_trigger_time AS agg_alert_last_trigger_time, agg_alert.resolve_time AS agg_alert_resolve_time, agg_alert.update_time AS agg_alert_update_time, agg_alert.labels AS agg_alert_labels, agg_alert.channel_id AS agg_alert_channel_id, agg_alert.alert_ts AS agg_alert_alert_ts 
FROM agg_alert INNER JOIN alert_rules_meta ON agg_alert.rule_id = alert_rules_meta.id 
WHERE agg_alert.last_trigger_time >= %(last_trigger_time_1)s AND agg_alert.last_trigger_time <= %(last_trigger_time_2)s AND alert_rules_meta.priority IN (%(priority_1)s, %(priority_2)s) ORDER BY agg_alert.last_trigger_time DESC 
 LIMIT %(param_1)s, %(param_2)s
2022-02-23 11:51:47,076 INFO sqlalchemy.engine.base.Engine {'last_trigger_time_1': 1642928019, 'last_trigger_time_2': 1645520019, 'priority_1': 'p1', 'priority_2': 'p2', 'param_1': 0, 'param_2': 20}
2022-02-23 11:51:47,563 INFO sqlalchemy.engine.base.Engine SELECT alert_rules_meta.id AS alert_rules_meta_id, alert_rules_meta.name AS alert_rules_meta_name, alert_rules_meta.description AS alert_rules_meta_description, alert_rules_meta.type AS alert_rules_meta_type, alert_rules_meta.sub_type AS alert_rules_meta_sub_type, alert_rules_meta.team AS alert_rules_meta_team, alert_rules_meta.domain AS alert_rules_meta_domain, alert_rules_meta.labels AS alert_rules_meta_labels, alert_rules_meta.enabled AS alert_rules_meta_enabled, alert_rules_meta.owner AS alert_rules_meta_owner, alert_rules_meta.sop AS alert_rules_meta_sop, alert_rules_meta.priority AS alert_rules_meta_priority, alert_rules_meta.create_time AS alert_rules_meta_create_time, alert_rules_meta.update_time AS alert_rules_meta_update_time 
FROM alert_rules_meta 
WHERE alert_rules_meta.id = %(param_1)s
2022-02-23 11:51:47,563 INFO sqlalchemy.engine.base.Engine {'param_1': 2442}

可以看到很神奇的,ORM进行了三次sql查询。前两次都是正常的,一次查count数目,一次查具体的分页后的各字段值,第三次就很诡异了,它去查了某个具体的AlertRuleMeta的记录。
再仔细看第二次查询,可以看到它虽然做了join操作,但并没有把需要的AlertRuleMeta中的字段加到查询语句中,这也就导致了ORM不得不再去查一次所需要的具体的AlertRuleMeta表中的记录。因此,如果第二次查询的记录包含N个相关的AlertRuleMeta记录的话,它就会额外多查询N次!即使alert_rules_meta.id是有索引的,那也禁不住查询数量大啊。
另外,还可以优化的地方是:

  • limit=-1时,可以不用再查询一次count,因为不分页的查询结果总数就是所要的count值
  • AlertRuleMeta表中有一些字段其实是不需要查询出来的,过滤掉可以节省数据传输的时间

Optimization

优化后的api代码:

class AlertCollection(Resource):
    parser = reqparse.RequestParser()
    parser.add_argument('priority')
    parser.add_argument('status')
    parser.add_argument('type')
    parser.add_argument('start')
    parser.add_argument('end')
    parser.add_argument('rule_id')
    parser.add_argument('rule_name')
    parser.add_argument('offset', type=int, default=0)
    parser.add_argument('limit', type=int, default=20)

    def get(self):
        args = self.parser.parse_args()
        logger.info('Get alert notification collection with params: {}'.format(args))
        # Only fetch the fields we want
        fields = [AggAlertModel.id, AggAlertModel.count, AggAlertModel.status, AggAlertModel.follower,
                  AggAlertModel.rca, AggAlertModel.first_trigger_time, AggAlertModel.last_trigger_time,
                  AggAlertModel.resolve_time, AggAlertModel.update_time, AggAlertModel.labels, AlertRuleMeta.name,
                  AlertRuleMeta.priority, AlertRuleMeta.type, AlertRuleMeta.team, AlertRuleMeta.owner,
                  AlertRuleMeta.domain, AlertRuleMeta.labels]
        alert_query = AggAlertModel.query.join(AlertRuleMeta, AggAlertModel.rule_id == AlertRuleMeta.id).add_columns(
            *fields)
        if args.get('rule_id'):
            alert_query = alert_query.filter(AggAlertModel.rule_id == args['rule_id'])
        if args.get('status'):
            alert_query = alert_query.filter(AggAlertModel.status.in_(args['status'].split(',')))
        if args.get('start'):
            alert_query = alert_query.filter(AggAlertModel.last_trigger_time >= int(args['start']))
        if args.get('end'):
            alert_query = alert_query.filter(AggAlertModel.last_trigger_time <= int(args['end']))
        if args.get('rule_name'):
            alert_query = alert_query.filter(AlertRuleMeta.name == args['rule_name'])
        if args.get('priority'):
            alert_query = alert_query.filter(AlertRuleMeta.priority.in_(args['priority'].split(',')))
        if args.get('type'):
            alert_query = alert_query.filter(AlertRuleMeta.type.in_(args['type'].split(',')))
        alert_query = alert_query.order_by(AggAlertModel.last_trigger_time.desc())
        try:
            alert_count = -1
            if args['limit'] != -1:
                # No need to query count when limit=-1
                alert_count = alert_query.count()
                alert_query = alert_query.limit(args['limit']).offset(args['offset'])
            alert_result = alert_query.all()
            if alert_count == -1:
                alert_count = len(alert_result)
            data = {'items': [item.AggAlert.row_to_dict() for item in alert_result], 'total': alert_count}
        except Exception as e:
            return {'status': 'failed', 'message': getattr(e, 'message', repr(e))}, 500
        return {'status': 'success', 'data': data}, 200

优化之后sql语句:

2022-02-23 10:43:10,926 INFO sqlalchemy.engine.base.Engine SELECT count(*) AS count_1 
FROM (SELECT agg_alert.id AS agg_alert_id, agg_alert.rule_id AS agg_alert_rule_id, agg_alert.count AS agg_alert_count, agg_alert.`status` AS agg_alert_status, agg_alert.follower AS agg_alert_follower, agg_alert.rca AS agg_alert_rca, agg_alert.first_trigger_time AS agg_alert_first_trigger_time, agg_alert.last_trigger_time AS agg_alert_last_trigger_time, agg_alert.resolve_time AS agg_alert_resolve_time, agg_alert.update_time AS agg_alert_update_time, agg_alert.labels AS agg_alert_labels, agg_alert.channel_id AS agg_alert_channel_id, agg_alert.alert_ts AS agg_alert_alert_ts, alert_rules_meta.name AS alert_rules_meta_name, alert_rules_meta.priority AS alert_rules_meta_priority, alert_rules_meta.type AS alert_rules_meta_type, alert_rules_meta.team AS alert_rules_meta_team, alert_rules_meta.owner AS alert_rules_meta_owner, alert_rules_meta.domain AS alert_rules_meta_domain, alert_rules_meta.labels AS alert_rules_meta_labels 
FROM agg_alert INNER JOIN alert_rules_meta ON agg_alert.rule_id = alert_rules_meta.id 
WHERE agg_alert.last_trigger_time >= %(last_trigger_time_1)s AND agg_alert.last_trigger_time <= %(last_trigger_time_2)s AND alert_rules_meta.priority IN (%(priority_1)s, %(priority_2)s) ORDER BY agg_alert.last_trigger_time DESC) AS anon_1
2022-02-23 10:43:10,926 INFO sqlalchemy.engine.base.Engine {'last_trigger_time_1': 1642928019, 'last_trigger_time_2': 1645520019, 'priority_1': 'p1', 'priority_2': 'p2'}
2022-02-23 10:43:11,380 INFO sqlalchemy.engine.base.Engine SELECT agg_alert.id AS agg_alert_id, agg_alert.rule_id AS agg_alert_rule_id, agg_alert.count AS agg_alert_count, agg_alert.`status` AS agg_alert_status, agg_alert.follower AS agg_alert_follower, agg_alert.rca AS agg_alert_rca, agg_alert.first_trigger_time AS agg_alert_first_trigger_time, agg_alert.last_trigger_time AS agg_alert_last_trigger_time, agg_alert.resolve_time AS agg_alert_resolve_time, agg_alert.update_time AS agg_alert_update_time, agg_alert.labels AS agg_alert_labels, agg_alert.channel_id AS agg_alert_channel_id, agg_alert.alert_ts AS agg_alert_alert_ts, alert_rules_meta.name AS alert_rules_meta_name, alert_rules_meta.priority AS alert_rules_meta_priority, alert_rules_meta.type AS alert_rules_meta_type, alert_rules_meta.team AS alert_rules_meta_team, alert_rules_meta.owner AS alert_rules_meta_owner, alert_rules_meta.domain AS alert_rules_meta_domain, alert_rules_meta.labels AS alert_rules_meta_labels, alert_rules_meta_1.id AS alert_rules_meta_1_id, alert_rules_meta_1.name AS alert_rules_meta_1_name, alert_rules_meta_1.description AS alert_rules_meta_1_description, alert_rules_meta_1.type AS alert_rules_meta_1_type, alert_rules_meta_1.sub_type AS alert_rules_meta_1_sub_type, alert_rules_meta_1.team AS alert_rules_meta_1_team, alert_rules_meta_1.domain AS alert_rules_meta_1_domain, alert_rules_meta_1.labels AS alert_rules_meta_1_labels, alert_rules_meta_1.enabled AS alert_rules_meta_1_enabled, alert_rules_meta_1.owner AS alert_rules_meta_1_owner, alert_rules_meta_1.sop AS alert_rules_meta_1_sop, alert_rules_meta_1.priority AS alert_rules_meta_1_priority, alert_rules_meta_1.create_time AS alert_rules_meta_1_create_time, alert_rules_meta_1.update_time AS alert_rules_meta_1_update_time 
FROM agg_alert INNER JOIN alert_rules_meta ON agg_alert.rule_id = alert_rules_meta.id LEFT OUTER JOIN alert_rules_meta AS alert_rules_meta_1 ON alert_rules_meta_1.id = agg_alert.rule_id 
WHERE agg_alert.last_trigger_time >= %(last_trigger_time_1)s AND agg_alert.last_trigger_time <= %(last_trigger_time_2)s AND alert_rules_meta.priority IN (%(priority_1)s, %(priority_2)s) ORDER BY agg_alert.last_trigger_time DESC 
 LIMIT %(param_1)s, %(param_2)s
2022-02-23 10:43:11,380 INFO sqlalchemy.engine.base.Engine {'last_trigger_time_1': 1642928019, 'last_trigger_time_2': 1645520019, 'priority_1': 'p1', 'priority_2': 'p2', 'param_1': 0, 'param_2': 20}

耗时从8s减少到了2.6s,性能提示了3倍。

Flink实践

| Comment

How to debug

Flink是支持本地local运行的,但需要注意的是,如果pom.xml里面Flink相关的依赖是provided的scope的话,直接运行会报错NoClassDefFoundError。这是因为Flink框架本身就包含了Flink相关的依赖,打包的时候可以不用打包进去,但本地执行的话是不能缺少的。(如果是用IntelliJ来执行的话,有时候可能需要重启/删除debug配置再执行)
不修改pom.xml的话也可以通过配置debug设置来包含provided的依赖:

IntelliJ IDEA配置

另外,flink和slf4j-log4j12不兼容,如果dependency有这个包会导致本地启动失败。

A simple test

Socket Stream

一个比较简单的测试是通过socket来发数据,Flink接收并打印出来。
在本地起一个socket的进程:

nc -lk 7777

Flink代码:

import org.apache.flink.streaming.api.scala._

object SimpleTest {
  def main(args: Array[String]): Unit = {
    // set up streaming execution environment
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
      // Receive from socket
      val stream = env.socketTextStream("localhost", 7777)
      // Sink to stdout
      stream.print()
      // Execute flink job
      env.execute("Test Job")
  }
}

本地启动Flink进程后,在socket的进程输入多行字符串就能看到Flink的输出了。

From elements

一个简单的词频统计Job:

import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.log4j._

object TestJob {

  val logger: Logger = Logger.getLogger(getClass.getName)

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val dataStream = env.fromElements("aaa bbb ccc", "qqq ccc fff")
    val keyedStream = dataStream.flatMap { x => x.split(" ")}.map(x => (x, 1)).keyBy(0)
    keyedStream.print("keyed")
    val reduceStream = keyedStream.reduce(new ReduceFunction[(String, Int)] {
      override def reduce(t1: (String, Int), t2: (String, Int)): (String, Int) = {
        (t1._1, t1._2 + t2._2)
      }
    })
    reduceStream.print("reduced")
    env.execute("Test job")
  }
}
Read more

解决MySQL的幻读问题

| Comment

Issue

最近写了一个api,会从mysql当中读取某一行的记录,并根据原始记录的数值来执行不同的操作,测试时发现在并发调用的时候出现了问题。问题可简化为:

  1. 读取mysql的数据中一个计数器字段 count:
  2. 根据 count的值进行不同的操作:
    a. 若 count == 0,执行操作A
    b. 若 count > 0,执行操作B,并使 count - 1

问题在于当两个(或多个)请求同时触发时,会发生幻读,比如两个请求在同一时间读取到 count 为1,然后它们同时都会触发操作B,并使 count - 1,最终的结果是 count变成了0,且进行了两次操作B。而我们理想中应该是先进行一次操作B,再进行一次操作A。

Solution

根据mysql的事务一致性标准不同,解决幻读问题有两种方式:

  1. 将一致性标准设为最高的 serlization
  2. 锁表中的记录

针对第二种方法,记录下我用SQLAlchemy的ORM的代码如下(用了.with_for_update()对单条记录加锁):

try:
    db.session.begin()
    # Lock the record using FOR UPDATE
    task_record = db.session.query(WorkflowTask).filter_by(workflow_id=workflow_id,
                                                           check_signature=check_signature).with_for_update().first()
    if task_record.parent_count == 1:
        logger.info('starting check point')
        res = signature.delay(workflow_id=workflow_id, start_time=args.get('start_time'),
                              end_time=args.get('end_time'))
        db.session.query(WorkflowTask).filter_by(workflow_id=workflow_id, id=task_record.id).update(
            (dict(status='STARTED', parent_count=0, task_id=res.id, is_triggered=True)))
        db.session.commit()
        return {'status': 'success', 'data': {'task_id': res.id}}, 201
    else:
        logger.info('not ready to start, mark parent_count to {}'.format(task_record.parent_count - 1))
        db.session.query(WorkflowTask).filter_by(workflow_id=workflow_id, id=task_record.id).update(
            (dict(parent_count=WorkflowTask.parent_count - 1,
                  is_triggered=True if args.get('trigger') == 1 else False)))
        db.session.commit()
        return {'status': 'success', 'data': {'task_id': None}}, 201
except Exception as e:
    logger.error(getattr(e, 'message', repr(e)), exc_info=True)
    db.session.rollback()
    return {'status': 'failed', 'message': getattr(e, 'message', repr(e))}, 500

K8S troubleshooting notes

| Comment
  1. 由于minikube使用的并不是本地安装的docker deamon,所以它并不会使用本地build好的image,而是从remote去pull。如果想当场build之后就在minikube中使用,可以在build image之前设置为使用minikube的docker:

    eval $(minikube docker-env)
    

    并且将对应的K8S的schema中的imagePullPolicy设为Never,这样就会使用刚build好的本地的image了。

  2. 在使用minikube的docker制作image是碰到过pip install failed的情况,排查下来的原因是minikube中的docker(cpu、内存)资源不足,导致安装某些包需要编译时失败了。通过给minikube分配更多的资源解决,比如:

    minikube start --memory 4096 --cpus 4
    
  3. 想要mount本地的文件、文件夹到某个pod的某个container中?首先你需要先把本地的路径mount到minikube创建的VM中:

    minikube start --mount --mount-string="$HOME/.minikube:/data/.minikube"
    

    通过minikube ssh可以登录minikube的VM中确认文件是否成功mount。

    然后,在K8S的schema中创建对应的volume并mount:

          containers:
            - name: nginx
              image: nginx
              volumeMounts:
                - mountPath: /opt/kube-cert/
                  name: kube-cert
          volumes:
            - name: kube-cert
              hostPath:
                path: /data/.minikube/
                type: Directory
    
  4. pod中的某个container启动失败了,可以通过一些命令来troubleshooting:

    # Get detailed info of the pod, xxx is the pod name
    kubectl get pod xxx --output=yaml
    # Print stdout of failed pod/container
    kubectl logs xxx
    # For container which can start up but got some issues, you can login to the box
    kubectl exec -it xxx bash
    
  5. K8S的service可以被其他deployment的pod内直接访问(kube-proxy会自动给service分配DNS name,就是service的name,另外也可以通过环境变量来获取所有service的地址(K8S会把当前的service注册到环境变量中,但之后注册的就没办法了));而同一个pod内部的container之间互相访问则访问localhost就可以了,比如pod内起了一个MongoDB的container以及一个Nginx的container,那么Nginx可以通过localhost:27017来连接到该MongoDB。

  6. 可以在一个yaml文件中定义多个service和deployment等的schema,这样通过kubectl apply -f xxx.yaml就可以同时都起起来。

  7. 从pod内部调用K8S的api可以参考Accessing the API from within a Pod,前提是该pod需要有相应的权限,这些权限可以通过创建并绑定对应的RBAC role来完成,比如:

    ---
    kind: ClusterRole
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: jobs-create
    rules:
    - apiGroups: ["batch", "extensions"]
      resources: ["jobs"]
      verbs: ["create", "get", "list", "watch", "update", "patch", "delete"]
    ---
    # Bind to default service account in default namespace
    kind: ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: jobs-create
    subjects:
    - kind: ServiceAccount
      name: default
      namespace: default
    roleRef:
      kind: ClusterRole
      name: jobs-create
      apiGroup: rbac.authorization.k8s.io
    
  8. K8S的job执行结束后不会自动销毁,K8S目前只有一个alpha的feature可以为job设置TTL,并在固定的周期清理过期并完成的job,所以暂时要么手动enable这个feature,要么就自己写一个service来清理job。

  9. K8S的job是没有rerun的概念的,所以如果有一个同名的job即使处于completed的状态,也无法再创建一个同样名字的job,必须先delete掉之前的job或者使用不同的名字。

| Page 2 of 25 |