Uploaded image for project: 'StreamSets Data Collector'
  1. StreamSets Data Collector
  2. SDC-15724

GCS Origin does not correctly handle idle case.

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Resolved
    • Priority: P3 (Limited Impact)
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.19.0
    • Component/s: None
    • Labels:
    • Testing Status:
      Manual Testing
    • Testing Description:
      Manually tested that the stage idly waits for a certain amount of time (Batch Wait Time (ms)) instead of actively pooling SFTP
    • Team:
      Data Plane

      Description

      I have a simple Google Cloud Storage pipeline (attached). 

      When there are no files in the bucket, on my laptop, this pipeline loops checking the GCS bucket  15-20 times a second.

      We need to introduce some sort of delay when the pipeline does not have any work available. 

      In the specific case, the SDC is running on a GCP VM.  For now, the concern here is that the amount of access on the offset.json file is causing the GCP VM to throttle on IOPs.  If the VM has better connectivity to GCS than my laptop, then the number of IOPS used will be higher as it can cycle through the loop faster.

      Since we write the offsets for every batch, we are using about 6 IOPS? (not sure)

      rename - create - write - close (update metadata) - delete original - update the timestamp?

      I have 6 similar pipelines running as parameterized SCH jobs.

        • Also I agree we can provision more IOPS and switch from Mag Disk to SSD, but that does cost $$  ***

       

       

        Attachments

          Activity

            People

            Assignee:
            sebas Sebastian Sanchez
            Reporter:
            bob bob plotts
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: